Removing Right-to-left Mark And Other Unicode Characters From Input In Python
Solution 1:
The OP, in a hard-to-read comment to another answer, has an example that appears to start like...:
comment = comment.encode('ascii', 'ignore')
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'That of course, with the two statements in this order, would be a different error (the first one tries to access comment but only the second one binds that name), but let's assume the two lines are interchanged, as follows:
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'comment = comment.encode('ascii', 'ignore')
This, which would indeed cause the error the OP seems to have in that hard-to-read comment, is a problem for a different reason: comment is a byte string (no leading u before the opening quote), but .encode applies to a unicode string -- so Python first of all tries to make a temporary unicode out of that bytestring with the default codec, ascii, and that of course fails because the string is full of non-ascii characters.
Inserting the leading u in that literal would work:
comment = u'\xc3\xa4\xc3\xb6\xc3\xbc'
comment = comment.encode('ascii', 'ignore')
(this of course leaves comment empty since all of its characters are ignored). Alternatively -- for example if the original byte string comes from some other source, not a literal:
comment = '\xc3\xa4\xc3\xb6\xc3\xbc'comment = comment.decode('latin-1')
comment = comment.encode('ascii', 'ignore')
here, the second statement explicitly builds the unicode with a codec that seems applicable to this example (just a guess, of course: you can't tell with certainty which codec is supposed to apply from just seeing a bare bytestring!-), then the third one, again, removes all non-ascii characters (and again leaves comment empty).
Solution 2:
If you simply want to restrict the characters to those of a certain character set, you could encode the string in that character set and just ignore encoding errors:
>>>uc = u'aäöüb'>>>uc.encode('ascii', 'ignore')
'ab'
Solution 3:
It's hard to guess the set of characters you want to remove from your Unicode strings. Could it be they are all the “Other, Format” characters? If yes, you can do:
import unicodedata
your_unicode_string= filter(
    lambda c: unicodedata.category(c) != 'Cf',
    your_unicode_string)
Solution 4:
"example".replace(u'\u200e', '')
You can remove the characters by the hex values with .replace() method.
Post a Comment for "Removing Right-to-left Mark And Other Unicode Characters From Input In Python"