Skip to content Skip to sidebar Skip to footer

Concat Python Dataframes Based On Unique Rows

My dataframe reads like : df1 user_id username firstname lastname 123 abc abc abc 456 def def def 789 ghi ghi ghi

Solution 1:

Use concat + drop_duplicates:

df = pd.concat([df1, df2]).drop_duplicates('user_id').reset_index(drop=True)
print (df)
   user_id username firstname lastname
0123      abc       abc      abc
1456defdefdef2789      ghi       ghi      ghi
3111      xyz       xyz      xyz
4234      mnp       mnp      mnp

Solution with groupby and aggregate first is slowier:

df = pd.concat([df1, df2]).groupby('user_id', as_index=False, sort=False).first()
print (df)
   user_id username firstname lastname
0123      abc       abc      abc
1456defdefdef2789      ghi       ghi      ghi
3111      xyz       xyz      xyz
4234      mnp       mnp      mnp

EDIT:

Another solution with boolean indexing and numpy.in1d:

df = pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
print (df)
   user_id username firstname lastname
0123      abc       abc      abc
1456defdefdef2789      ghi       ghi      ghi
3111      xyz       xyz      xyz
4234      mnp       mnp      mnp

Solution 2:

One approach with masking -

def app1(df1,df2):
    df20 = df2[~df2.user_id.isin(df1.user_id)]
    return pd.concat([df1, df20],axis=0)

Two more approaches using the underlying array data, np.in1d, np.searchsorted to get the mask of matches and then stacking those two and constructing an output dataframe from the stacked array data -

def app2(df1,df2):    
    df20_arr = df2.values[~np.in1d(df1.user_id.values, df2.user_id.values)]
    arr = np.vstack(( df1.values, df20_arr ))
    df_out = pd.DataFrame(arr, columns= df1.columns)
    return df_out

def app3(df1,df2):
    a = df1.values
    b = df2.values

    df20_arr = b[~np.in1d(a[:,0], b[:,0])]
    arr = np.vstack(( a, df20_arr ))
    df_out = pd.DataFrame(arr, columns= df1.columns)
    return df_out

def app4(df1,df2):
    a = df1.values
    b = df2.values

    b0 = b[:,0].astype(int)
    as0 = np.sort(a[:,0].astype(int))
    df20_arr = b[as0[np.searchsorted(as0,b0)] != b0]
    arr = np.vstack(( a, df20_arr ))
    df_out = pd.DataFrame(arr, columns= df1.columns)
    return df_out

Timings for given sample -

In [49]: %timeit app1(df1,df2)
    ...: %timeit app2(df1,df2)
    ...: %timeit app3(df1,df2)
    ...: %timeit app4(df1,df2)
    ...: 
1000 loops, best of 3: 753 µs per loop
10000 loops, best of 3: 192 µs per loop
10000 loops, best of 3: 181 µs per loop
10000 loops, best of 3: 171 µs per loop

# @jezrael's edited solution
In [85]: %timeit pd.concat([df1, df2[~np.in1d(df2['user_id'], df1['user_id'])]], ignore_index=True)
1000 loops, best of 3: 614 µs per loop

Would be interesting to see how these fare on larger datasets.

Solution 3:

Another approach is to use np.in1d to check for duplicate user_id.

pd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])

Or to use a set to get unique rows from the merged records from df1 and df2. This one seems to be a few times faster.

pd.DataFrame(data=np.vstack({tuple(row) for row in np.r_[df1.values,df2.values]}),columns=df1.columns)

Timings:

%timeitpd.concat([df1,df2[df2.user_id.isin(np.setdiff1d(df2.user_id,df1.user_id))]])1000 loops,best of 3:2.48msperloop%timeitpd.DataFrame(data=np.vstack({tuple(row)forrowinnp.r_[df1.values,df2.values]}),columns=df1.columns)1000 loops,best of 3:632µsperloop

Solution 4:

One can also use append + drop_duplicates.

df1.append(df2)
df1.drop_duplicates(inplace=True)

Post a Comment for "Concat Python Dataframes Based On Unique Rows"