Skip to content Skip to sidebar Skip to footer

Getting Ready To Convert From Python 2.x To 3.x

As we all know by now (I hope), Python 3 is slowly beginning to replace Python 2.x. Of course it will be many MANY years before most of the existing code is finally ported, but the

Solution 1:

The biggest problem that cannot be adequately addressed by micro-level changes and 2to3 is the change of the default string type from bytes to Unicode.

If your code needs to do anything with encodings and byte I/O, it's going to need a bunch of manual effort to convert correctly, so that things that have to be bytes remain bytes, and are decoded appropriately at the right stage. You'll find that some string methods (in particular format()) and library calls require Unicode strings, so you may need extra decode/encode cycles just to use the strings as Unicode even if they're really just bytes.

This is not helped by the fact that some of the Python standard library modules have been crudely converted using 2to3 without proper attention to bytes/unicode/encoding issues, and so themselves make mistakes about what string type is appropriate. Some of this is being thrashed out, but at least from Python 3.0 to 3.2 you will face confusing and potentially buggy behaviour from packages like urllib, email and wsgiref that need to know about byte encodings.

You can ameliorate the problem by being careful every time you write a string literal. Use u'' strings for anything that's inherently character-based, b'' strings for anything that's really bytes, and '' for the ‘default string’ type where it doesn't matter or you need to match a library call's string use requirements.

Unfortunately the b'' syntax was only introduced in Python 2.6, so doing this cuts off users of earlier versions.

eta:

what's the difference?

Oh my. Well...

A byte contains a value in the range 0–255, and may represent a load of binary data (eg. the contents of an image) or some text, in which case there has to be a standard chosen for how to map a set of characters into those bytes. Most of these ‘encoding’ standards map the normal ‘ASCII’ character set into the bytes 0–127 in the same way, so it's generally safe to use byte strings for ASCII-only text processing in Python 2.

If you want to use any of the characters outside the ASCII set in a byte string, you're in trouble, because each encoding maps a different set of characters into the remaining byte values 128–255, and most encodings can't map every possible character to bytes. This is the source of all those problems where you load a file from one locale into a Windows app in another locale and all the accented or non-Latin letters change to the wrong ones, making an unreadable mess. (aka ‘mojibake’.)

There are also ‘multibyte’ encodings, which try to fit more characters into the available space by using more than one byte to store each character. These were introduced for East Asian locales, as there are so very many Chinese characters. But there's also UTF-8, a better-designed modern multibyte encoding which can accommodate every character.

If you are working on byte strings in a multibyte encoding—and today you probably will be, because UTF-8 is very widely used; really, no other encoding should be used in a modern application—then you've got even more problems than just keeping track of what encoding you're playing with. len() is going to be telling you the length in bytes, not the length in characters, and if you start indexing and altering the bytes you're very likely to break a multibyte sequence in two, generating an invalid sequence and generally confusing everything.

For this reason, Python 1.6 and later have native Unicode strings (spelled u'something'), where each unit in the string is a character, not a byte. You can len() them, slice them, replace them, regex them, and they'll always behave appropriately. For text processing tasks they are indubitably better, which is why Python 3 makes them the default string type (without having to put a u before the '').

The catch is that a lot of existing interfaces, such as filenames on OSes other than Windows, or HTTP, or SMTP, are primarily byte-based, with a separate way of specifying the encoding. So when you are dealing with components that need bytes you have to take care to encode your unicode strings to bytes correctly, and in Python 3 you will have to do it explicitly in some places where before you didn't need to.

It is an internal implementation detail that Unicode strings take ‘two bytes’ of storage per unit internally. You never get to see that storage; you shouldn't think of it in terms of bytes. The units you are working on are conceptually characters, regardless of how Python chooses to represent them in memory.

...aside:

This isn't quite true. On ‘narrow builds’ of Python like the Windows build, each unit of a Unicode string is not technically a character, but a UTF-16 ‘code unit’. For the characters in the Basic Multilingual Plane, from 0x0000–0xFFFF you won't notice any difference, but if you're using characters from outside this 16-bit range, those in the ‘astral planes’, you'll find they take two units instead of one, and, again, you risk splitting a character when you slice them.

This is pretty bad, and has happened because Windows (and others, such as Java) settled on UTF-16 as an in-memory storage mechanism before Unicode grew beyond the 65,000-character limit. However, use of these extended characters is still pretty rare, and anyone on Windows will be used to them breaking in many applications, so it's likely not critical for you.

On ‘wide builds’, Unicode strings are made of real character ‘code point’ units, so even the extended characters outside of the BMP can be handled consistently and easily. The price to pay for this is efficiency: each string unit takes up four bytes of storage in memory.

Post a Comment for "Getting Ready To Convert From Python 2.x To 3.x"