Why Is Pandas Read_csv Not Reading The Right Number Of Rows?

May 09, 2023 Post a Comment

I'm trying to open part of a csv file using pandas read_csv. The section I am opening has a header on line 746, and goes to line 1120. gr = read_csv(inputfile,header=746,nrows=374

Solution 1:

Unless I'm reading the docs wrong this looks like a bug in read_csv (I recommend filling an issue on github!).

A workaround, since your data is smallish (read in the lines as a string):

from StringIO import StringIO
with open(inputfile) as f:
    df = pd.read_csv(StringIO(''.join(f.readlines()[:1120])), header=746, nrows=374)

I tested this with the csv you provide and it works/doesn't raise!

Solution 2:

I reckon this is an off by one/counting (user) error! That is, pd.read_csv(inputfile, header=746, nrows=374) reads the 1021st 1-indexed line, so you should read one fewer row. I could be mistaken, but here's what I'm thinking...

In python line indexing (as with most python indexing) starts at 0.

In [11]: s = 'a,b\nA,B\n1,2\n3,4\n1,2,3,4'

In [12]: for i, line in enumerate(s.splitlines()): print(i, line)
0 a,b
1 A,B
2 1,2
3 3,4
4 1,2,3,4

The usual way you think of line numbers is from 1:

In [12]: for i, line in enumerate(s.splitlines(), start=1): print(i, line)
1 a,b
2 A,B
3 1,2
4 3,4
5 1,2,3,4

In the following we are reading up the the 3rd row (with python indexing) or the 4th (with 1-indexing):

In [13]: pd.read_csv(StringIO(s), header=1, nrows=2)  # Note: header + nrows == 3
Out[13]:
   A  B
0  1  2
1  3  4

And if we include the next line it'll raise:

In [15]: pd.read_csv(StringIO(s), header=1, nrows=3)
CParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 4

Python Playground

Why Is Pandas Read_csv Not Reading The Right Number Of Rows?

Solution 1:

Solution 2:

Post a Comment for "Why Is Pandas Read_csv Not Reading The Right Number Of Rows?"