Pandas Functions Too Slow - Optimise With Dict/numpy?

July 29, 2023 Post a Comment

I have ~10 large df's (5mil+ rows each and growing) that I want to perform calculations on. Doing so with raw pandas even on a super fast AWS machine is unbearably slow. Most funct

Solution 1:

The code is slow because there are so many groups and for every group, Pandas need to create a DataFrame object and pass it to tick_features(), the loop is executed in Python.

To speedup the calculation, you can call aggregation methods that executed in Cython loop:

Prepare some dummy data first:

import pandas as pd
import numpy as np

idx = pd.date_range("2018-05-01", "2018-06-02", freq="0.1S")
x = np.random.randn(idx.shape[0], 2)

df = pd.DataFrame(x, index=idx, columns=["size", "price"])

add extra columns to it, the calculation is fast, if you have enough memory:

df["time"] = df.index
df["volume"] = df["size"].abs()
df["buy_volume"] = np.clip(df["size"], 0, np.inf)
df["sell_volume"] = np.clip(df["size"], -np.inf, 0)
df["buy_trade"] = df["size"] > 0
df["sell_trade"] = df["size"] < 0

then group the DataFrame object first, and call aggregation methods:

g = df.groupby(pd.Grouper(freq="5s"))
df2 = pd.DataFrame(
    dict(
    open = g["time"].first(),
    close = g["time"].last(),
    high = g["price"].max(),
    low = g["price"].min(),
    volume = g["volume"].sum(),
    buy_volume = g["buy_volume"].sum(),
    sell_volume = -g["sell_volume"].sum(),
    num_trades = g["size"].count(),
    buy_trade = g["buy_trade"].sum(),
    sell_trade = g["sell_trade"].sum(),
    pct_buy_trades  = g["buy_trade"].mean() * 100,
    pct_sell_trades = g["sell_trade"].mean() * 100,
    )
)

d = df2.eval("buy_volume + sell_volume")
df2["pct_buy_volume"] = df2.eval("buy_volume / @d")
df2["pct_sell_volume"] = df2.eval("sell_volume / @d")

Python Playground

Pandas Functions Too Slow - Optimise With Dict/numpy?

Solution 1:

Post a Comment for "Pandas Functions Too Slow - Optimise With Dict/numpy?"