Reading A Large Numpy Save File Iteratively (i.e. With A Generator) When The Dtype=object
Solution 1:
The basic format for a non-object dtype is a header block (with shape, dtype, strides, etc), followed by a byte copy of its data buffer.
In other words something akin to this sequence:
In [129]: x
Out[129]:
array([[1, 2, 3],
[4, 5, 6]])
In [130]: x.tostring()
Out[130]: b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00'
In [132]: np.frombuffer(__, dtype=int)
Out[132]: array([1, 2, 3, 4, 5, 6])
But if I change the dtype to object:
In [134]: X = x.astype(object)
In [135]: X
Out[135]:
array([[1, 2, 3],
[4, 5, 6]], dtype=object)
In [136]: X.tostring()
Out[136]: b'`\x1bO\x08p\x1bO\x08\x80\x1bO\x08\x90\x1bO\x08\xa0\x1bO\x08\xb0\x1bO\x08'
Those data buffer bytes point to locations in memory. Since these are small integers they may point to the unique cached values
In [137]: id(1)
Out[137]: 139402080
In [138]: id(2)
Out[138]: 139402096
If the elements instead are arrays, they would point to those arrays stored elsewhere in memory (to the ndarray
objects, not their databuffers).
To handle objects like this np.save
uses pickle. Now the pickle for a ndarray
is its save
string. I don't know where the np.save
puts those strings. Maybe it streams in line, maybe uses pointers to points latter in the file.
You/we'd have to study the np.save
(and function calls) to determine how this data is saved. I've looked enough to see how several arrays could be saved and loaded from a file, but haven't focused on the object dtype layout. The relevant code is in numpy/lib/npyio.py
, numpy/lib/format.py
The format
file has a doc block about the save format.
np.save
format.write_array
If non-object write_array
uses array.tofile(fp)
. If object
, it uses pickle.dump(array, fp)
Similarly read_array
uses np.fromfile(fp, dtype)
and pickle.load
.
So that means we need to delve into how the array
pickle.dump
is done.
Solution 2:
You probably want to take a look at numpy memmap.
From the official documentation:
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html
Post a Comment for "Reading A Large Numpy Save File Iteratively (i.e. With A Generator) When The Dtype=object"