Removing Emojis From A String In Python
Solution 1:
On Python 2, you have to use u''
literal to create a Unicode string. Also, you should pass re.UNICODE
flag and convert your input data to Unicode (e.g., text = data.decode('utf-8')
):
#!/usr/bin/env pythonimport re
text = u'This dog \U0001f602'print(text) # with emoji
emoji_pattern = re.compile("["u"\U0001F600-\U0001F64F"# emoticonsu"\U0001F300-\U0001F5FF"# symbols & pictographsu"\U0001F680-\U0001F6FF"# transport & map symbolsu"\U0001F1E0-\U0001F1FF"# flags (iOS)"]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji
Output
This dog 😂
This dog
Note: emoji_pattern
matches only some emoji (not all). See Which Characters are Emoji.
Solution 2:
I am updating my answer to this by @jfs because my previous answer failed to account for other Unicode standards such as Latin, Greek etc. StackOverFlow doesn't allow me to delete my previous answer hence I am updating it to match the most acceptable answer to the question.
#!/usr/bin/env pythonimport re
text = u'This is a smiley face \U0001f602'print(text) # with emojidefdeEmojify(text):
regrex_pattern = re.compile(pattern = "["u"\U0001F600-\U0001F64F"# emoticonsu"\U0001F300-\U0001F5FF"# symbols & pictographsu"\U0001F680-\U0001F6FF"# transport & map symbolsu"\U0001F1E0-\U0001F1FF"# flags (iOS)"]+", flags = re.UNICODE)
return regrex_pattern.sub(r'',text)
print(deEmojify(text))
This was my previous answer, do not use this.
defdeEmojify(inputString):
return inputString.encode('ascii', 'ignore').decode('ascii')
Solution 3:
Complete Version of remove Emojis ✍ 🌷 📌 👈🏻 🖥
import re
defremove_emojis(data):
emoj = re.compile("["u"\U0001F600-\U0001F64F"# emoticonsu"\U0001F300-\U0001F5FF"# symbols & pictographsu"\U0001F680-\U0001F6FF"# transport & map symbolsu"\U0001F1E0-\U0001F1FF"# flags (iOS)u"\U00002500-\U00002BEF"# chinese charu"\U00002702-\U000027B0"u"\U00002702-\U000027B0"u"\U000024C2-\U0001F251"u"\U0001f926-\U0001f937"u"\U00010000-\U0010ffff"u"\u2640-\u2642"u"\u2600-\u2B55"u"\u200d"u"\u23cf"u"\u23e9"u"\u231a"u"\ufe0f"# dingbatsu"\u3030""]+", re.UNICODE)
return re.sub(emoj, '', data)
Solution 4:
If you are not keen on using regex, the best solution could be using the emoji python package.
Here is a simple function to return emoji free text (thanks to this SO answer):
import emoji
defgive_emoji_free_text(text):
allchars = [strforstrin text.decode('utf-8')]
emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
clean_text = ' '.join([strforstrin text.decode('utf-8').split() ifnotany(i instrfor i in emoji_list)])
return clean_text
If you are dealing with strings containing emojis, this is straightforward
>> s1 = "Hi 🤔 How is your 🙈 and 😌. Have a nice weekend 💕👭👙">> print s1
Hi 🤔 How is your 🙈 and 😌. Have a nice weekend 💕👭👙
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend
If you are dealing with unicode (as in the exmaple by @jfs), just encode it with utf-8.
>> s2 = u'This dog \U0001f602'
>> print s2
This dog 😂
>> print give_emoji_free_text(s2.encode('utf8'))
This dog
Edits
Based on the comment, it should be as easy as:
defgive_emoji_free_text(text):
return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))
Solution 5:
If you're using the example from the accepted answer and still getting "bad character range" errors, then you're probably using a narrow build (see this answer for more details). A reformatted version of the regex that seems to work is:
emoji_pattern = re.compile(
u"(\ud83d[\ude00-\ude4f])|"# emoticonsu"(\ud83c[\udf00-\uffff])|"# symbols & pictographs (1 of 2)u"(\ud83d[\u0000-\uddff])|"# symbols & pictographs (2 of 2)u"(\ud83d[\ude80-\udeff])|"# transport & map symbolsu"(\ud83c[\udde0-\uddff])"# flags (iOS)"+", flags=re.UNICODE)
Post a Comment for "Removing Emojis From A String In Python"