Skip to content Skip to sidebar Skip to footer

Pandas: How To Make Algorithm Faster

I have the task: I should find some data in big file and add this data to some file. File, where I search data is 22 million string and I divide it using chunksize. In other file

Solution 1:

There are some issues in your code

  1. zip takes arguments that may be of different length

  2. dateutil.relativedelta may not be compatible with pandas Timestamp. With pandas 0.18.1 and python 3.5, I'm getting this:

    now = pd.Timestamp.now()
    now
    Out[46]: Timestamp('2016-07-06 15:32:44.266720')
    now + dateutil.relativedelta.relativedelta(day=5)
    Out[47]: Timestamp('2016-07-05 15:32:44.266720')
    

    So it's better to use pd.Timedelta

    now + pd.Timedelta(5, 'D')
    Out[48]: Timestamp('2016-07-11 15:32:44.266720')
    

    But it's somewhat inaccurate for months:

    now - pd.Timedelta(1, 'M')
    Out[49]: Timestamp('2016-06-06 05:03:38.266720')
    

This is a sketch of code. I didn't test and I may be wrong about what you want. The crucial part is to merge the two data frames instead of iterating row by row.

# 1) convert to datetime here 
# 2) optionally, you can select only relevant cols with e.g. usecols=['ID', 'used_at', 'url']
# 3) iterator is prob. superfluous
el = pd.read_csv('df2.csv', chunksize=1000000, parse_dates=['used_at'])

buys = pd.read_excel('smartphone.xlsx')
buys['date'] = pd.to_datetime(buys['date'])
# consider loading only relevant columns to buys

# compute time intervals here (not in a loop!)
buys['date_min'] = (buys['date'] - pd.TimeDelta(1, unit='M')
buys['date_min'] = (buys['date'] + pd.TimeDelta(5, unit='D')

# now replace (probably it needs to be done row by row)
buys['date_min'] = buys['date_min'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))
buys['date_max'] = buys['date_max'].apply(lambda x: x.replace(day=1, hour=0, minute=0, second=0))

# not necessary
# dates1 = buys['date']
# ids1 = buys['id']

for chunk in el:
    # already converted to datetime
    # i['used_at'] = pd.to_datetime(i['used_at'])

    # defer sorting until later
    # df = i.sort_values(['ID', 'used_at'])

    # merge!
    # (option how='inner' selects only rows that have the same id in both data frames; it's default)
    merged = pd.merge(chunk, buys, left_on='ID', right_on='id', how='inner')
    bool_idx = (merged['used_at'] < merged['date_max']) & (merged['used_at'] > merged['date_min'])
    selected = merged.loc[bool_idx]

    # probably don't need additional columns from buys, 
    # so either drop them or select the ones from chunk (beware of possible duplicates in names)
    selected = selected[chunk.columns]

    # sort now (possibly a smaller frame)
    selected = selected.sort_values(['ID', 'used_at'])

    if selected.empty:
        continue
    with open('3.csv', 'a') as f:
        selected.to_csv(f, header=False)

Hope this helps. Please double check the code and adjust to your needs.

Please, take a look at the docs to understand the options of merge.


Post a Comment for "Pandas: How To Make Algorithm Faster"