Skip to content Skip to sidebar Skip to footer

Searching Csv Files With Pandas (unique Id's) - Python

I am looking into searching a csv file with 242000 rows and want to sum the unique identifiers in one of the columns. The column name is 'logid' and has a number of different value

Solution 1:

As you haven't post your code I can give you an answer only about the general way it would work.

  1. Load the CSV file into a pd.Dataframe using pandas.read_csv
  2. Save all values which a occurence > 1 in a seperate df1 using pandas.DataFrame.drop_duplicates like:

    df1=df.drop_duplicates(keep="first)

--> This Will return a DataFrame which only contains the rows with the first occurence of duplicate values. E.g. if the value 1000 is in 5 rows only the first row will be returned while the others are dropped.

--> Applying df1.shape[0] will return you the number of duplicate values in your df.

3.If you want to store all rows of df which contain a "duplicate value" in a seperate CSV file you have to do smth like this:

df=pd.DataFrame({"A":[0,1,2,3,0,1,2,5,5]}) # This should represent your original data setprint(df)

df1=df.drop_duplicates(subset="A",keep="first") #I assume the column with the duplicate values is columns "A" if you want to check the whole row just omit the subset keyword.print(df1)
list=[]


for m in df1["A"]:
    mask=(df==m)
    list.append(df[mask].dropna())

for dfx inrange(len(list)):
    name="file{0}".format(dfx)
    list[dfx].to_csv(r"YOUR PATH\{0}".format(name))

Post a Comment for "Searching Csv Files With Pandas (unique Id's) - Python"