Concat Python Dataframes Based On Unique Rows
My dataframe reads like : df1 user_id username firstname lastname 123 abc abc abc 456 def def def 789 ghi ghi ghi
Solution 1:
Use concat + drop_duplicates:
df = pd.concat([df1, df2]).drop_duplicates('user_id').reset_index(drop=True)
print (df)
user_id username firstname lastname
0123 abc abc abc
1456defdefdef2789 ghi ghi ghi
3111 xyz xyz xyz
4234 mnp mnp mnp
Solution with groupby and aggregate first is slowier:
df = pd.concat([df1, df2]).groupby('user_id', as_index=False, sort=False).first()
print (df)
user_id username firstname lastname
0123 abc abc abc
1456defdefdef2789 ghi ghi ghi
3111 xyz xyz xyz
4234 mnp mnp mnp
EDIT:
Another solution with boolean indexing and numpy.in1d:
df = pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
print (df)
user_id username firstname lastname
0123 abc abc abc
1456defdefdef2789 ghi ghi ghi
3111 xyz xyz xyz
4234 mnp mnp mnp
Solution 2:
One approach with masking -
def app1(df1,df2):
df20 = df2[~df2.user_id.isin(df1.user_id)]
return pd.concat([df1, df20],axis=0)
Two more approaches using the underlying array data, np.in1d, np.searchsorted to get the mask of matches and then stacking those two and constructing an output dataframe from the stacked array data -
def app2(df1,df2):
df20_arr = df2.values[~np.in1d(df1.user_id.values, df2.user_id.values)]
arr = np.vstack(( df1.values, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
def app3(df1,df2):
a = df1.values
b = df2.values
df20_arr = b[~np.in1d(a[:,0], b[:,0])]
arr = np.vstack(( a, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
def app4(df1,df2):
a = df1.values
b = df2.values
b0 = b[:,0].astype(int)
as0 = np.sort(a[:,0].astype(int))
df20_arr = b[as0[np.searchsorted(as0,b0)] != b0]
arr = np.vstack(( a, df20_arr ))
df_out = pd.DataFrame(arr, columns= df1.columns)
return df_out
Timings for given sample -
In [49]: %timeit app1(df1,df2)
...: %timeit app2(df1,df2)
...: %timeit app3(df1,df2)
...: %timeit app4(df1,df2)
...:
1000 loops, best of 3: 753 µs per loop
10000 loops, best of 3: 192 µs per loop
10000 loops, best of 3: 181 µs per loop
10000 loops, best of 3: 171 µs per loop
# @jezrael's edited solution
In [85]: %timeit pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
1000 loops, best of 3: 614 µs per loop
Would be interesting to see how these fare on larger datasets.
Solution 3:
Another approach is to use np.in1d to check for duplicate user_id.
pd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])
Or to use a set to get unique rows from the merged records from df1 and df2. This one seems to be a few times faster.
pd.DataFrame(data=np.vstack({tuple(row) for row in np.r_[df1.values,df2.values]}),columns=df1.columns)
Timings:
%timeitpd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])1000 loops,best of 3:2.48msperloop%timeitpd.DataFrame(data=np.vstack({tuple(row)forrowinnp.r_[df1.values,df2.values]}),columns=df1.columns)1000 loops,best of 3:632µsperloopSolution 4:
One can also use append + drop_duplicates.
df1.append(df2)
df1.drop_duplicates(inplace=True)
Post a Comment for "Concat Python Dataframes Based On Unique Rows"