Python Emoji Search And Replace Not Working As Expected
I am trying to separate emoji in given text from other characters/words/emojis. I want to use emoji later as features in text classification. So it is important that I treat each e
Solution 1:
There are several issues here.
- There is no capturing groups in the regex pattern, but in the replacement pattern, you define
\1
backreference to Group 1 - so, the most natural workaround is to use a backreference to Group 0, i.e. the whole match, that is\g<0>
. - The
\1
in the replacement is not actually parsed as a backreference, but as a a char with an octal value 1 because the backslash in the regular (not raw) string literals forms escape sequences. Here, it is an octal escape. - The
+
after the]
means that the regex engine must match 1 or more occurrences of text matching the character class, so you match sequences of emojis rather than each separate emoji.
Use
import re
text = "I am very #happy man but๐๐ my wife๐ is not ๐๐"print(text) #line a
reg = re.compile(u'['u'\U0001F300-\U0001F64F'u'\U0001F680-\U0001F6FF'u'\u2600-\u26FF\u2700-\u27BF]',
re.UNICODE)
#padding the emoji with space at both ends
new_text = reg.sub(r' \g<0> ',text)
print(new_text) #line b# this is just to test if it can still identify the emojis in new_text
new_text2 = reg.sub(r'#\g<0>#', new_text)
print(new_text2) # line c
See the Python demo printing
Iamvery#happymanbut๐๐ mywife๐ isnot ๐๐
Iamvery#happymanbut ๐ ๐ mywife ๐ isnot ๐ ๐
Iamvery#happymanbut #๐# #๐# mywife #๐# isnot #๐# #๐#
Post a Comment for "Python Emoji Search And Replace Not Working As Expected"