Skip to content Skip to sidebar Skip to footer

Command-line Arguments As Bytes Instead Of Strings In Python3

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings. I t

Solution 1:

When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8.

Those are surrogate characters. The low 8 bits is the original invalid byte.

See PEP 383: Non-decodable Bytes in System Character Interfaces.

Solution 2:

Don't go against the grain: filenames are strings, not bytes.

You shouldn't use a bytes when you should use a string. A bytes is a tuple of integers. A string is a tuple of characters. They are different concepts. What you're doing is like using an integer when you should use a boolean.

(Aside: Python stores all strings in-memory under Unicode; all strings are stored the same way. Encoding specifies how Python converts the on-file bytes into this in-memory format.)

Your operating system stores filenames as strings under a specific encoding. I'm surprised you say that some filenames have different encodings; as far as I know, the filename encoding is system-wide. Functions like open default to the default system filename encoding, for example.

Post a Comment for "Command-line Arguments As Bytes Instead Of Strings In Python3"