Skip to content Skip to sidebar Skip to footer

Splitting A Python String

I have a string in python that I want to split in a very particular manner. I want to split it into a list containing each separate word, except for the case when a group of words

Solution 1:

This isn't something with an out-of-the-box solution, but here's a function that's pretty Pythonic that should handle pretty much anything you throw at it.

def extract_groups(s):
    separator = re.compile("(-?\|[\w ]+\|)")
    components = separator.split(s)
    groups = []
    for component in components:
        component = component.strip()
        if len(component) == 0:
            continue
        elif component[0] in ['-', '|']:
            groups.append(component.replace('|', ''))
        else:
            groups.extend(component.split(' '))

    return groups

Using your examples:

>>> extract_groups('Jimmy threw his ball through the window.')
['Jimmy', 'threw', 'his', 'ball', 'through', 'the', 'window.']
>>> extract_groups('Jimmy |threw his ball| through the window.')
['Jimmy', 'threw his ball', 'through the', 'window.']
>>> extract_groups('Jimmy |threw his| ball -|through the| window.')
['Jimmy', 'threw his', 'ball', '-through the', 'window.']

Solution 2:

There's probably some regular expression solving your problem. You might get the idea from the following example:

import re
s = 'Jimmy -|threw his| ball |through the| window.'
r = re.findall('-?\|.+?\||[\w\.]+', s)
print r
print [i.replace('|', '') for i in r]

Output:

['Jimmy', '-|threw his|', 'ball', '|through the|', 'window.']
['Jimmy', '-threw his', 'ball', 'through the', 'window.']

Explanation:

  • -? optional minus sign
  • \|.+?\| pipes with at least one character in between
  • | or
  • [\w\.]+ at least one "word" character or .

In case , or ' can appear in the original string, the expression needs some fine tuning.

Solution 3:

You can parse that format using a regex, although your choice of delimiter makes it rather an ugly one!

This code finds all sequences that consist either of a pair of pipe characters | enclosing zero or more non-pipe characters, or one or more characters that are neither pipes nor whitespace.

import re

str = 'Jimmy |threw his| ball -|through the| window.'for seq in re.finditer(r' \| [^|]* \| | [^|\s]+ ', str, flags=re.X):
    print(seq.group())

output

Jimmy
|threw his|
ball
-
|through the|
window.

Post a Comment for "Splitting A Python String"