Marking Duplicates In A Csv File
Solution 1:
The only thing you have to keep in memory while this is running is a map of phone numbers to their IDs.
map = {}
withopen(r'c:\temp\input.csv', 'r') as fin:
reader = csv.reader(fin)
withopen(r'c:\temp\output.csv', 'w') as fout:
writer = csv.writer(fout)
# omit this if the file has no header row
writer.writerow(next(reader))
for row in reader:
(id, name, phone, ref, discard) = row
ifmap.has_key(phone):
ref = map[phone]
discard = "YES"else:
map[phone] = id
writer.writerow((id, name, phone, ref, discard))
Solution 2:
Sounds like homework. Since this is a CSV file (and thus changing the record size is next to impossible) you are best off loading the whole file into memory and manipulating it there before writing it out to a new file. Create a list of strings which is the original lines of the file. Then create a map, insert into the the phone number (the key) and the value (the id). Before the insert you look for the number if it already exists, you update the line containing the duplicate phone number. If it isn't already in the map, you insert the (phone, id) pair.
Solution 3:
I know one thing. I know you don't have to read the entire file into memory to accomplish this.
import csv
myfile = "C:\Users\Eduardo\Documents\TEST2.csv"
dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")
phonedict = {}
for row in cvs.reader(open(myfile, "r")):
# setdefault sets the value to the second argument if it hasn't been set, and then# returns what the value in the dictionary is.
firstid = phonedict.setdefault(row[2], row[0])
row[3] = firstid
if firstid isnot row[0]:
row[4] = "Yes"
dest.writerow(row)
Solution 4:
from operator import itemgetter
from itertools import groupby
import csv
verum = csv.reader(open('data.csv','rb'))
verum.sort(key=itemgetter(2,0))
defgrouper( verum ):
for key, grp in groupby(verum,itemgetter(2)):
# key = phone number, grp = records with that number
first = grp.next()
# first item gets its id written into the 4th columnyield [first[0],first[1],first[2],first[0],''] #or list(itemgetter(0,1,2,0,4)(first)) for x in grp:
# all others get the first items id as refyield [x[0],x[1],x[2], first[0], "Yes"]
for line insorted(grouper(verum), key=itemgetter(0)):
print line
Outputs:
['1', 'JOHN', '12345', '1', '']['2', 'PETER', '6232', '2', '']['3', 'JON', '12345', '1', 'Yes']['4', 'PETERSON', '6232', '2', 'Yes']['5', 'ALEX', '7854', '5', '']['6', 'JON', '12345', '1', 'Yes']
Writing the data back is left to the reader ;-)
Solution 5:
I work with large 40k plus record csv files, the easiest way to get rid of dupes it with Access. 1. Create new database, 2, Tables tab Get external Data 3. Save Table. 4. Queries tab New find dupe wizard ( Match on phone field, show all fields and count) 5. Save Query ( export has .txt but name dupes.txt ) 6. Import Query result as new table, do not import field with dupe count.. 7. Query Find unmatched (match by phone field, show all fields in result. save query then Export has .txt but name unique.txt) 8. Import unique file in to existing table ( dupes ) 9.You can now save and export again into what ever files type you need and not have any dupes
Post a Comment for "Marking Duplicates In A Csv File"