Reading Csv Files In Scipy/numpy In Python

January 04, 2024 Post a Comment

I am having trouble reading a csv file, delimited by tabs, in python. I use the following function: def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=No

Solution 1:

Check out the python CSV module: http://docs.python.org/library/csv.html

import csv
reader = csv.reader(open("myfile.csv", "rb"), 
                    delimiter='\t', quoting=csv.QUOTE_NONE)

header = []
records = []
fields = 16if thereIsAHeader: header = reader.next()

for row, record inenumerate(reader):
    iflen(record) != fields:
        print"Skipping malformed record %i, contains %i fields (%i expected)" %
            (record, len(record), fields)
    else:
        records.append(record)

# do numpy stuff.

Solution 2:

May I ask why you're not using the built-in csv reader? http://docs.python.org/library/csv.html

I've used it very effectively with numpy/scipy. I would share my code but unfortunately it's owned by my employer, but it should be very straightforward to write your own.

Solution 3:

I have successfully used two methodologies; (1): if I simply need to read arbitrary CSV, I used the CSV module (as pointed out by other users), and (2): if I require repeated processing of a known CSV (or any) format, I write a simple parser.

It seems that your problem fits in the second category, and a parser should be very simple:

f = open('file.txt', 'r').readlines()
for line in f:
 tokens = line.strip().split('\t')
 gene = tokens[0]
 vals = [float(k) for k in tokens[1:10]]
 stuff = tokens[10:]
 # do something with gene, vals, and stuff

You can add a line in the reader for skipping comments (`if tokens[0] == '#': continue') or to handle blank lines ('if tokens == []: continue'). You get the idea.

Solution 4:

I think Nick T's approach would be the better way to go. I would make one change. As I would replace the following code:

for row, record in enumerate(reader):
iflen(record) != fields:
    print"Skipping malformed record %i, contains %i fields (%i expected)" %
        (record, len(record), fields)
else:
    records.append(record)

with

records = np.asrray([row for row in reader iflen(row) = fields ])
print('Number of skipped records: %i'%(len(reader)-len(records)) #note you have to do more than len(reader) as an iterator does not have a length like a list or tuple

The list comprehension will return a numpy array and take advantage of pre-compiled libraries which should speed things up greatly. Also, I would recommend using print() as a function versus print "" as the former is the standard for python3 which is most likely the future and I would use logging over print.

Solution 5:

Likely it came from Line 27100 in your data file... and it had 12 columns instead of 16. I.e. it had:

separator,1,2,3,4,5,6,7,8,9,10,11,12,separator

And it was expecting something like this:

separator,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,separator

I'm not sure how you want to convert your data, but if you have irregular line lengths, the easiest way would be something like this:

lines = f.read().split('someseparator')
for line inlines:
    splitline = line.split(',')
    #do something with splitline

Python Playground