How To Remove All The Escape Sequences From A List Of Strings?

February 25, 2024 Post a Comment

I want to remove all types of escape sequences from a list of strings. How can I do this? input: ['william', 'short', '\x80', 'twitter', '\xaa', '\xe2', 'video', 'guy', 'ray'] out

Solution 1:

If you want to strip out some characters you don't like, you can use the translate function to strip them out:

>>>s="\x01\x02\x10\x13\x20\x21hello world">>>print(s)
 !hello world
>>>s
'\x01\x02\x10\x13 !hello world'
>>>escapes = ''.join([chr(char) for char inrange(1, 32)])>>>t = s.translate(None, escapes)>>>t
' !hello world'

This will strip out all these control characters:

001101    SOH (startof heading)
   002202    STX (startof text)
   003303    ETX (endof text)
   004404    EOT (endof transmission)
   005505    ENQ (enquiry)
   006606    ACK (acknowledge)
   007707    BEL '\a' (bell)
   010808    BS  '\b' (backspace)
   011909    HT  '\t' (horizontal tab)
   012100A    LF  '\n' (new line)
   013110B    VT  '\v' (vertical tab)
   014120C    FF  '\f' (form feed)
   015130D    CR  '\r' (carriage ret)
   016140E    SO  (shift out)
   017150F    SI  (shift in)
   0201610    DLE (data link escape)
   0211711    DC1 (device control 1)
   0221812    DC2 (device control 2)
   0231913    DC3 (device control 3)
   0242014    DC4 (device control 4)
   0252115    NAK (negative ack.)
   0262216    SYN (synchronous idle)
   0272317    ETB (endof trans. blk)
   0302418    CAN (cancel)
   0312519    EM  (endof medium)
   032261A    SUB (substitute)
   033271B    ESC (escape)
   034281C    FS  (file separator)
   035291D    GS  (group separator)
   036301E    RS  (record separator)
   037311F    US  (unit separator)

For Python newer than 3.1, the sequence is different:

>>>s="\x01\x02\x10\x13\x20\x21hello world">>>print(s)
 !hello world
>>>s
'\x01\x02\x10\x13 !hello world'
>>>escapes = ''.join([chr(char) for char inrange(1, 32)])>>>translator = str.maketrans('', '', escapes)>>>t = s.translate(translator)>>>t
' !hello world'

Solution 2:

Something like this?

>>>from ast import literal_eval>>>s = r'Hello,\nworld!'>>>print(literal_eval("'%s'" % s))
Hello,
world!

Edit: ok, that's not what you want. What you want can't be done in general, because, as @Sven Marnach explained, strings don't actually contain escape sequences. Those are just notation in string literals.

You can filter all strings with non-ASCII characters from your list with

defis_ascii(s):
    try:
        s.decode('ascii')
        returnTrueexcept UnicodeDecodeError:
        returnFalse

[s for s in ['william', 'short', '\x80', 'twitter', '\xaa',
             '\xe2', 'video', 'guy', 'ray']
 if is_ascii(s)]

Solution 3:

You could filter out "words" that are not alphanumeric using a list comprehension and str.isalnum():

>>> l = ['william', 'short', '\x80', 'twitter', '\xaa', '\xe2', 'video', 'guy', 'ray']
>>> [word for word in l if word.isalnum()]
['william', 'short', 'twitter', 'video', 'guy', 'ray']

If you wish to filter out numbers, too, use str.isalpha() instead:

>>> l = ['william', 'short', '\x80', 'twitter', '\xaa', '\xe2', 'video', 'guy', 'ray', '456']
>>> [word for word in l if word.isalpha()]
['william', 'short', 'twitter', 'video', 'guy', 'ray']

Solution 4:

This cannot be done, at least at the broad scope you are asking. As others have mentioned, runtime python doesn't know the difference between the something with escape sequences, and something without.

Example:

print ('\x61' == 'a')

prints True. So there's no way to find the difference between these two strings, unless you try some static analysis of your python script.

Solution 5:

I had similar issues while converting from hexadimal to String.This is what finally worked in python Example

list_l = ['william', 'short', '\x80', 'twitter', '\xaa', '\xe2', 'video', 'guy', 'ray']
decode_data=[]
for l in list_l:
    data =l.decode('ascii', 'ignore')
    ifdata != "":
        decode_data.append(data)

# output :[u'william', u'short', u'twitter', u'video', u'guy', u'ray']

Python Playground