Pandas Functions Too Slow - Optimise With Dict/numpy?
I have ~10 large df's (5mil+ rows each and growing) that I want to perform calculations on. Doing so with raw pandas even on a super fast AWS machine is unbearably slow. Most funct
Solution 1:
The code is slow because there are so many groups and for every group, Pandas need to create a DataFrame object and pass it to tick_features()
, the loop is executed in Python.
To speedup the calculation, you can call aggregation methods that executed in Cython loop:
Prepare some dummy data first:
import pandas as pd
import numpy as np
idx = pd.date_range("2018-05-01", "2018-06-02", freq="0.1S")
x = np.random.randn(idx.shape[0], 2)
df = pd.DataFrame(x, index=idx, columns=["size", "price"])
add extra columns to it, the calculation is fast, if you have enough memory:
df["time"] = df.index
df["volume"] = df["size"].abs()
df["buy_volume"] = np.clip(df["size"], 0, np.inf)
df["sell_volume"] = np.clip(df["size"], -np.inf, 0)
df["buy_trade"] = df["size"] > 0
df["sell_trade"] = df["size"] < 0
then group the DataFrame object first, and call aggregation methods:
g = df.groupby(pd.Grouper(freq="5s"))
df2 = pd.DataFrame(
dict(
open = g["time"].first(),
close = g["time"].last(),
high = g["price"].max(),
low = g["price"].min(),
volume = g["volume"].sum(),
buy_volume = g["buy_volume"].sum(),
sell_volume = -g["sell_volume"].sum(),
num_trades = g["size"].count(),
buy_trade = g["buy_trade"].sum(),
sell_trade = g["sell_trade"].sum(),
pct_buy_trades = g["buy_trade"].mean() * 100,
pct_sell_trades = g["sell_trade"].mean() * 100,
)
)
d = df2.eval("buy_volume + sell_volume")
df2["pct_buy_volume"] = df2.eval("buy_volume / @d")
df2["pct_sell_volume"] = df2.eval("sell_volume / @d")
Post a Comment for "Pandas Functions Too Slow - Optimise With Dict/numpy?"