Skip to content Skip to sidebar Skip to footer

Comparing One Record With All Other To Remove Duplicates - Python Or R

I have a data set which contains all the world cup matches with columns Date,Team A, Team B and some other columns. But this data set has duplicates in it, like for a India Vs Aust

Solution 1:

Something like this is probably OK, you just need to put the teams in alphabetical order so it doesn't matter which one was recorded as Team A versus team B:

df['team_tuple'] = df.apply(
    lambda row: tuple(
        sorted((row['Team A'], row['Team B']))
    ), 
    axis='columns'
)
df
Out[17]: 
          DATE     Team A     Team B          team_tuple
024-May-1983      India  Australia  (Australia, India)
124-May-1983  Australia      India  (Australia, India)

duplicates = df.loc[:, ['DATE', 'team_tuple']].duplicated()
cleaned_df = df.loc[~ duplicates, :]
In [16]: cleaned_df
Out[16]: 
          DATE Team A     Team B          team_tuple
024-May-1983  India  Australia  (Australia, India)

Solution 2:

In R, you could sort the two columns and then drop duplicates. Here's that with data.table:

library(data.table)
DT[`Team A` < `Team B`, `:=`(
  `Team A` = `Team B`,
  `Team B` = `Team A`
)]
unique(DT)
#          DATE Team A    Team B# 1: 1983-05-24  India Australia

The liberal application of backticks is necessary because the OP used column names with spaces. The first step can be read as:

subset to where A < B, and within that subset, swap 'em.

`:=` is the assignment operator inside a data.table, and it is being applied like a function here.

# input data
DT <- data.table(DATE=as.IDate(c("24-May-1983","24-May-1983"), "%d-%b-%Y"), 
  `Team A`=c("India", "Australia"), `Team B` = c("Australia", "India"))

Post a Comment for "Comparing One Record With All Other To Remove Duplicates - Python Or R"