Pandas: How To Make Algorithm Faster
I have the task: I should find some data in big file and add this data to some file. File, where I search data is 22 million string and I divide it using chunksize. In other file
Solution 1:
There are some issues in your code
zip
takes arguments that may be of different lengthdateutil.relativedelta
may not be compatible with pandas Timestamp. With pandas 0.18.1 and python 3.5, I'm getting this:now = pd.Timestamp.now() now Out[46]: Timestamp('2016-07-06 15:32:44.266720') now + dateutil.relativedelta.relativedelta(day=5) Out[47]: Timestamp('2016-07-05 15:32:44.266720')
So it's better to use
pd.Timedelta
now + pd.Timedelta(5, 'D') Out[48]: Timestamp('2016-07-11 15:32:44.266720')
But it's somewhat inaccurate for months:
now - pd.Timedelta(1, 'M') Out[49]: Timestamp('2016-06-06 05:03:38.266720')
This is a sketch of code. I didn't test and I may be wrong about what you want. The crucial part is to merge the two data frames instead of iterating row by row.
# 1) convert to datetime here
# 2) optionally, you can select only relevant cols with e.g. usecols=['ID', 'used_at', 'url']
# 3) iterator is prob. superfluous
el = pd.read_csv('df2.csv', chunksize=1000000, parse_dates=['used_at'])
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
# consider loading only relevant columns to buys
# compute time intervals here (not in a loop!)
buys['date_min'] = (buys['date'] - pd.TimeDelta(1, unit='M')
buys['date_min'] = (buys['date'] + pd.TimeDelta(5, unit='D')
# now replace (probably it needs to be done row by row)
buys['date_min'] = buys['date_min'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
buys['date_max'] = buys['date_max'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
# not necessary
# dates1 = buys['date']
# ids1 = buys['id']
for chunk in el:
# already converted to datetime
# i['used_at'] = pd.to_datetime(i['used_at'])
# defer sorting until later
# df = i.sort_values(['ID', 'used_at'])
# merge!
# (option how='inner' selects only rows that have the same id in both data frames; it's default)
merged = pd.merge(chunk, buys, left_on='ID', right_on='id', how='inner')
bool_idx = (merged['used_at'] < merged['date_max']) & (merged['used_at'] > merged['date_min'])
selected = merged.loc[bool_idx]
# probably don't need additional columns from buys,
# so either drop them or select the ones from chunk (beware of possible duplicates in names)
selected = selected[chunk.columns]
# sort now (possibly a smaller frame)
selected = selected.sort_values(['ID', 'used_at'])
if selected.empty:
continue
with open('3.csv', 'a') as f:
selected.to_csv(f, header=False)
Hope this helps. Please double check the code and adjust to your needs.
Please, take a look at the docs to understand the options of merge
.
Post a Comment for "Pandas: How To Make Algorithm Faster"