Comparing One Record With All Other To Remove Duplicates - Python Or R
I have a data set which contains all the world cup matches with columns Date,Team A, Team B and some other columns. But this data set has duplicates in it, like for a India Vs Aust
Solution 1:
Something like this is probably OK, you just need to put the teams in alphabetical order so it doesn't matter which one was recorded as Team A versus team B:
df['team_tuple'] = df.apply(
lambda row: tuple(
sorted((row['Team A'], row['Team B']))
),
axis='columns'
)
df
Out[17]:
DATE Team A Team B team_tuple
024-May-1983 India Australia (Australia, India)
124-May-1983 Australia India (Australia, India)
duplicates = df.loc[:, ['DATE', 'team_tuple']].duplicated()
cleaned_df = df.loc[~ duplicates, :]
In [16]: cleaned_df
Out[16]:
DATE Team A Team B team_tuple
024-May-1983 India Australia (Australia, India)
Solution 2:
In R, you could sort the two columns and then drop duplicates. Here's that with data.table:
library(data.table)
DT[`Team A` < `Team B`, `:=`(
`Team A` = `Team B`,
`Team B` = `Team A`
)]
unique(DT)
# DATE Team A Team B# 1: 1983-05-24 India Australia
The liberal application of backticks is necessary because the OP used column names with spaces. The first step can be read as:
subset to where A < B, and within that subset, swap 'em.
`:=`
is the assignment operator inside a data.table, and it is being applied like a function here.
# input data
DT <- data.table(DATE=as.IDate(c("24-May-1983","24-May-1983"), "%d-%b-%Y"),
`Team A`=c("India", "Australia"), `Team B` = c("Australia", "India"))
Post a Comment for "Comparing One Record With All Other To Remove Duplicates - Python Or R"