Skip to content Skip to sidebar Skip to footer

Python-re.sub() And Unicode

I want to replace all emoji with '' but my regEx doesn't work.For example, content= u'?\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u

Solution 1:

You won't be able to recognize properly decoded unicode codepoints that way (as strings containing \uXXXX, etc.) Properly decoded, by the time the regex parser gets to them, each is a* character.

Depending on whether your python was compiled with only 16-bit unicode code points or not, you'll want a pattern something like either:

# 16-bit codepoints
re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

# 32-bit* codepoints
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')

And your code would look like:

import re

# Pick a pattern, adjust as necessary
#re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')

content= u'[\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633]'
print(content)

stripped = re_strip.sub('', content)
print(stripped)

Both expressions, reduce the number of characters in the stripped string to 26.

These expressions strip out the emojis you were after, but may also strip out other things you do want. It may be worth reviewing a unicode codepoint range listing (e.g. here) and adjusting them.

You can determine whether your python install will only recognize 16-bit codepoints by doing something like:

import sys
print(sys.maxunicode.bit_length())

If this displays 16, you'll need the first regex expression. If it displays something greater than 16 (for me it says 21), the second one is what you want.

Neither expression will work when used on a python install with the wrong sys.maxunicode.

See also: this related.


Post a Comment for "Python-re.sub() And Unicode"