Python \ufffd After Replacement With Chinese Content

March 23, 2024 Post a Comment

After we found the answer to this question we are faced with next unusual replacement behavior: Our regex is: [\\(（\\[{【]+(\\w+|\\s+|\\S+|\\W+)?[）\\)\\]}】]+ We are trying

Solution 1:

You should not mix UTF-8 and regular expressions. Process all your text as Unicode. Make sure you decoded both the regex and the input string to unicode values first:

>>> import re
>>> columnString = '\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'>>> regex = '[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'>>> utf8_compiled = re.compile(regex, flags=re.I)
>>> utf8_compiled.sub('', columnString)
'\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4'>>> print utf8_compiled.sub('', columnString).decode('utf8', 'replace')
当代骨�
>>> unicode_compiled = re.compile(regex.decode('utf8'), flags=re.I | re.U)
>>> unicode_compiled.sub('', columnString.decode('utf8'))
u'\u5f53\u4ee3\u9aa8\u4f24\u79d1\u5999\u65b9'>>> print unicode_compiled.sub('', columnString.decode('utf8'))
当代骨伤科妙方
>>> print unicode_compiled.sub('', u'物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))')
物理化学名校考研真题详解

When using UTF-8 in your pattern consists of separate bytes for the 【 codepoint:

>>> '【''\xe3\x80\x90'

which means your character class matches any of those bytes; \xe3, or \x80 or \x90 are each separately valid bytes in that character class.

Solution 2:

Decode your string first , and you can get rid of that � ( \ufffd) character .

In [1]: import re
   ...: subject = '物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))'.decode('utf-8')
   ...: reobj = re.compile(r"[\(（\[{【]+(\w+|\s+|\S+|\W+)?[）\)\]}】]+", re.IGNORECASE | re.MULTILINE)
   ...: result = reobj.sub("", subject)
   ...: print result
   ...: 
物理化学名校考研真题详解

Python Playground

Python \ufffd After Replacement With Chinese Content

Solution 1:

Solution 2:

Post a Comment for "Python \ufffd After Replacement With Chinese Content"