Memoryerror When Using The Read() Method In Reading A Large Size Of Json File From Amazon S3

March 21, 2024 Post a Comment

I'm trying to import a large size of JSON FILE from Amazon S3 into AWS RDS-PostgreSQL using Python. But, these errors occured, Traceback (most recent call last): File 'my_code.py'

Solution 1:

A significant savings can be had by avoiding slurping your whole input file into memory as a list of lines.

Specifically, these lines are terrible on memory usage, in that they involve a peak memory usage of a bytes object the size of your whole file, plus a list of lines with the complete contents of the file as well:

file_content = obj['Body'].read().decode('utf-8').splitlines(True)
for line in file_content:

For a 1 GB ASCII text file with 5 million lines, on 64 bit Python 3.3+, that's a peak memory requirement of roughly 2.3 GB for just the bytes object, the list, and the individual strs in the list. A program that needs 2.3x as much RAM as the size of the files it processes won't scale to large files.

To fix, change that original code to:

file_content = io.TextIOWrapper(obj['Body'], encoding='utf-8')
for line in file_content:

Given that obj['Body'] appears to be usable for lazy streaming this should remove both copies of the complete file data from memory. Using TextIOWrapper means obj['Body'] is lazily read and decoded in chunks (of a few KB at a time), and the lines are iterated lazily as well; this reduces memory demands to a small, largely fixed amount (the peak memory cost would depend on the length of the longest line), regardless of file size.

Baca Juga

Update:

It looks like StreamingBody doesn't implement the io.BufferedIOBase ABC. It does have its own documented API though, that can be used for a similar purpose. If you can't make the TextIOWrapper do the work for you (it's much more efficient and simple if it can be made to work), an alternative would be to do:

file_content = (line.decode('utf-8') forlinein obj['Body'].iter_lines())
forlinein file_content:

Unlike using TextIOWrapper, it doesn't benefit from bulk decoding of blocks (each line is decoded individually), but otherwise it should still achieve the same benefits in terms of reduced memory usage.

Python Playground

Memoryerror When Using The Read() Method In Reading A Large Size Of Json File From Amazon S3

Solution 1:

Post a Comment for "Memoryerror When Using The Read() Method In Reading A Large Size Of Json File From Amazon S3"