Skip to content Skip to sidebar Skip to footer

Python \ufffd After Replacement With Chinese Content

After we found the answer to this question we are faced with next unusual replacement behavior: Our regex is: [\\((\\[{【]+(\\w+|\\s+|\\S+|\\W+)?[)\\)\\]}】]+ We are trying

Solution 1:

You should not mix UTF-8 and regular expressions. Process all your text as Unicode. Make sure you decoded both the regex and the input string to unicode values first:

>>> import re
>>> columnString = '\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4\xbc\xa4\xe7\xa7\x91\xe5\xa6\x99\xe6\x96\xb9(\xe7\xac\xac\xe5\x9b\x9b\xe7\x89\x88)'>>> regex = '[\\(\xef\xbc\x88\\[{\xe3\x80\x90]+(\\w+|\\s+|\\S+|\\W+)?[\xef\xbc\x89\\)\\]}\xe3\x80\x91]+'>>> utf8_compiled = re.compile(regex, flags=re.I)
>>> utf8_compiled.sub('', columnString)
'\xe5\xbd\x93\xe4\xbb\xa3\xe9\xaa\xa8\xe4'>>> print utf8_compiled.sub('', columnString).decode('utf8', 'replace')
当代骨�
>>> unicode_compiled = re.compile(regex.decode('utf8'), flags=re.I | re.U)
>>> unicode_compiled.sub('', columnString.decode('utf8'))
u'\u5f53\u4ee3\u9aa8\u4f24\u79d1\u5999\u65b9'>>> print unicode_compiled.sub('', columnString.decode('utf8'))
当代骨伤科妙方
>>> print unicode_compiled.sub('', u'物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))')
物理化学名校考研真题详解 

When using UTF-8 in your pattern consists of separate bytes for the codepoint:

>>> '【''\xe3\x80\x90'

which means your character class matches any of those bytes; \xe3, or \x80 or \x90 are each separately valid bytes in that character class.

Solution 2:

Decode your string first , and you can get rid of that � ( \ufffd) character .

In [1]: import re
   ...: subject = '物理化学名校考研真题详解 (理工科考研辅导系列(化学生物类))'.decode('utf-8')
   ...: reobj = re.compile(r"[\((\[{【]+(\w+|\s+|\S+|\W+)?[)\)\]}】]+", re.IGNORECASE | re.MULTILINE)
   ...: result = reobj.sub("", subject)
   ...: print result
   ...: 
物理化学名校考研真题详解 

Post a Comment for "Python \ufffd After Replacement With Chinese Content"