Pandas: How To Make Algorithm Faster
I have the task: I should find some data in big file and add this data to some file. File, where I search data is 22 million string and I divide it using chunksize. In other file
Solution 1:
There are some issues in your code
ziptakes arguments that may be of different lengthdateutil.relativedeltamay not be compatible with pandas Timestamp. With pandas 0.18.1 and python 3.5, I'm getting this:now = pd.Timestamp.now() now Out[46]: Timestamp('2016-07-06 15:32:44.266720') now + dateutil.relativedelta.relativedelta(day=5) Out[47]: Timestamp('2016-07-05 15:32:44.266720')So it's better to use
pd.Timedeltanow + pd.Timedelta(5, 'D') Out[48]: Timestamp('2016-07-11 15:32:44.266720')But it's somewhat inaccurate for months:
now - pd.Timedelta(1, 'M') Out[49]: Timestamp('2016-06-06 05:03:38.266720')
This is a sketch of code. I didn't test and I may be wrong about what you want. The crucial part is to merge the two data frames instead of iterating row by row.
# 1) convert to datetime here
# 2) optionally, you can select only relevant cols with e.g. usecols=['ID', 'used_at', 'url']
# 3) iterator is prob. superfluous
el = pd.read_csv('df2.csv', chunksize=1000000, parse_dates=['used_at'])
buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
# consider loading only relevant columns to buys
# compute time intervals here (not in a loop!)
buys['date_min'] = (buys['date'] - pd.TimeDelta(1, unit='M')
buys['date_min'] = (buys['date'] + pd.TimeDelta(5, unit='D')
# now replace (probably it needs to be done row by row)
buys['date_min'] = buys['date_min'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
buys['date_max'] = buys['date_max'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
# not necessary
# dates1 = buys['date']
# ids1 = buys['id']
for chunk in el:
# already converted to datetime
# i['used_at'] = pd.to_datetime(i['used_at'])
# defer sorting until later
# df = i.sort_values(['ID', 'used_at'])
# merge!
# (option how='inner' selects only rows that have the same id in both data frames; it's default)
merged = pd.merge(chunk, buys, left_on='ID', right_on='id', how='inner')
bool_idx = (merged['used_at'] < merged['date_max']) & (merged['used_at'] > merged['date_min'])
selected = merged.loc[bool_idx]
# probably don't need additional columns from buys,
# so either drop them or select the ones from chunk (beware of possible duplicates in names)
selected = selected[chunk.columns]
# sort now (possibly a smaller frame)
selected = selected.sort_values(['ID', 'used_at'])
if selected.empty:
continue
with open('3.csv', 'a') as f:
selected.to_csv(f, header=False)
Hope this helps. Please double check the code and adjust to your needs.
Please, take a look at the docs to understand the options of merge.
Post a Comment for "Pandas: How To Make Algorithm Faster"