Python 3 And B'\x92'.decode('latin1')
Solution 1:
The error I made was to expect that the character 0x92 decoded to "RIGHT SINGLE QUOTATION MARK" in latin-1, it doesn't. The confusion was caused because it was present in a file that was specified as being in latin1 encoding. It now appears that the file was actually encoded in windows-1252. This is apparently a common source of confusion:
http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
If the character is decoded with the correct encoding, then the expected result is obtained.
>>> b'\x92'.decode('windows-1252').encode('ascii', 'namereplace')
b'\\N{RIGHT SINGLE QUOTATION MARK}'
Solution 2:
I was expecting b'\x92'.decode('latin1') to result in U+2018
latin1
is an alias for ISO-8859-1. In that encoding, byte 0x92 maps to character U+0092, an unprintable control character.
The encoding you might have really meant is windows-1252
, the Microsoft Western code page based on it. In that encoding, 0x92 is U+2019 which is close...
(Further befuddlement arises because for historical reasons web browsers are also confused between the two. When a web page is served as charset=iso-8859-1
, web browsers actually use windows-1252
.)
Solution 3:
I just want to make clear that you're not encoding anything here.
xa3
has a ordinal value of 163 (0xa3 in hexadecimal). Since that ordinal is not seven bits, it can't be encoded into ascii. Your handler for errors just replaces the Unicode Character into the name of the character. The Unicode Character 163 maps to £.
'\x92' on the other hand, has an ordinal value of 146. According to this Wikipedia Article, the character isn't printable - it's a privately used control code in the C2 space. This explains why it's name is simply the literal '\\x92'
.
As an aside, if you need the name of the character, it's much better to do it like this:
import unicodedata
print unicodedata.name(u'\xa3')
Post a Comment for "Python 3 And B'\x92'.decode('latin1')"