Python Regular Expression With Utf8 Issue

March 02, 2024 Post a Comment

I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese. PROCESS：类型：关爱积分[NOTIFY] 交易号：2012022900000109 订单�

Solution 1:

There are several issues with your code. First you should use re.compile(ur'<unicode string>'). Also it is nice to add re.UNICODE flag (not sure if really needed here though). Next one is that still you will not receive a match since \d+ doesn't handle decimals just a series of numbers, you should use \d+\.?\d+ instead (you want number, probably a dot and a number). Example code:

#coding: utf-8

text = u"PROCESS：类型：关爱积分[NOTIFY]   交易号：2012022900000109   订单号：W12022910079166    交易金额：0.01元    交易状态：true 2012-2-29 10:13:08"import re
pattern = re.compile(ur'交易金额：(\d+\.?\d+)元', re.UNICODE)

print pattern.search(text).group(1)

Solution 2:

You need to use .search() since .match() is like starting your regex with ^, i.e. it only checks at the beginning of the string.

Solution 3:

If you use utf-8, you can use flags=re.LOCALE

#coding: utf-8import re
pattern = re.compile(r'交易金额：(\d+\.?\d+)元', flags=re.LOCALE)
for line inopen('xx.txt'):
    match = pattern.match(line)

More details, see re.LOCALE. There is no need to convert utf-8 to unicode.

Python Playground

Python Regular Expression With Utf8 Issue

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Python Regular Expression With Utf8 Issue"