Skip to content Skip to sidebar Skip to footer

Python Encoding - Could Not Decode To Utf8

I have an sqlite database that was populated by an external program. Im trying to read the data with python. When I attempt to read the data I get the following error: OperationalE

Solution 1:

Python is trying to be helpful by converting pieces of text (stored as bytes in a database) into a python str object for you. In order to do this conversion, python has to guess what letter each byte (or group of bytes) returned by your query represents. The default guess is an encoding called utf-8. Obviously, this guess is wrong in your case.

The solution is to give python a little hint as to how to do the mapping from bytes to letters (i.e., unicode characters). You've already come close with the line

conn.text_factory = str

However (based on your response in the comments above), since you are using python 3, str is the default text factory, so that line will do nothing new for you (see the docs).

What happens behind the scenes with this line is that python tries to convert the bytes returned by the query using the str function, kind of like:

your_string = str(the_bytes, 'utf-8') # actually uses `conn.text_factory`, not `str`

...but you want a different encoding where 'utf-8' is. Since you can't change the default encoding of the str function, you will have to mimic it some other way. You can use a one-off nameless function called a lambda for this:

conn.text_factory = lambda x: str(x, 'latin1')

Now when the database is handing the bytes to python, python will try to map them to letters using the 'latin1' scheme instead of the 'utf-8' scheme. Of course, I don't know if latin1 is the correct encoding of your data. Realistically, you will have to try a handful of encodings to find the right one. I would try the following first:

  • 'iso-8859-1'
  • 'utf-16'
  • 'utf-32'
  • 'latin1'

You can find a more complete list here.

Another option is to simply let the bytes coming out of the database remain as bytes. Whether this is a good idea for you depends on your application. You can do it by setting:

conn.text_factory = bytes

Solution 2:

If the text in the database is actually mostly encoded in UTF-8, but you're still seeing this error (Could not decode to UTF-8), then the problem may be that one or more rows have bogus data that is not valid UTF-8. By default, Python's decode() function throws an exception when it sees text like that. If you are in this situation and want to simply ignore these errors, you can set up a text_factory like this:

conn = sqlite3.connect('my-database.db')
conn.text_factory = lambda b: b.decode(errors = 'ignore')

Post a Comment for "Python Encoding - Could Not Decode To Utf8"