Skip to content Skip to sidebar Skip to footer

Reading A Large Numpy Save File Iteratively (i.e. With A Generator) When The Dtype=object

I have a large numpy save file (potentially larger than fits in memory). The dtype is object (it's a numpy array of variable length numpy arrays). Can I avoid reading the entire f

Solution 1:

The basic format for a non-object dtype is a header block (with shape, dtype, strides, etc), followed by a byte copy of its data buffer.

In other words something akin to this sequence:

In [129]: x
Out[129]: 
array([[1, 2, 3],
       [4, 5, 6]])
In [130]: x.tostring()
Out[130]: b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00'
In [132]: np.frombuffer(__, dtype=int)
Out[132]: array([1, 2, 3, 4, 5, 6])

But if I change the dtype to object:

In [134]: X = x.astype(object)
In [135]: X
Out[135]: 
array([[1, 2, 3],
       [4, 5, 6]], dtype=object)
In [136]: X.tostring()
Out[136]: b'`\x1bO\x08p\x1bO\x08\x80\x1bO\x08\x90\x1bO\x08\xa0\x1bO\x08\xb0\x1bO\x08'

Those data buffer bytes point to locations in memory. Since these are small integers they may point to the unique cached values

In [137]: id(1)
Out[137]: 139402080
In [138]: id(2)
Out[138]: 139402096

If the elements instead are arrays, they would point to those arrays stored elsewhere in memory (to the ndarray objects, not their databuffers).

To handle objects like this np.save uses pickle. Now the pickle for a ndarray is its save string. I don't know where the np.save puts those strings. Maybe it streams in line, maybe uses pointers to points latter in the file.

You/we'd have to study the np.save (and function calls) to determine how this data is saved. I've looked enough to see how several arrays could be saved and loaded from a file, but haven't focused on the object dtype layout. The relevant code is in numpy/lib/npyio.py, numpy/lib/format.py

The format file has a doc block about the save format.

np.save
   format.write_array

If non-object write_array uses array.tofile(fp). If object, it uses pickle.dump(array, fp)

Similarly read_array uses np.fromfile(fp, dtype) and pickle.load.

So that means we need to delve into how the array pickle.dump is done.


Solution 2:

You probably want to take a look at numpy memmap.

From the official documentation:

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html


Post a Comment for "Reading A Large Numpy Save File Iteratively (i.e. With A Generator) When The Dtype=object"