Python Regex In Pyparsing
Solution 1:
Pyparsing's classes are pretty much left-to-right, with lookahead implemented using explicit expressions like FollowedBy (for positive lookahead) and NotAny or the '~' operator (for negative lookahead). This allows you to detect a terminator which would normally match an item that is being repeated. For instance, OneOrMore(Word(alphas)) + Literal('end')
will never find a match in strings like "start blah blah end", because the terminating 'end' will get swallowed up in the repetition expression in OneOrMore. The fix is to add negative lookahead in the expression being repeated: OneOrMore(~Literal('end') + Word(alphas)) + Literal('end')
- that is, before reading another word composed of alphas, first make sure it is not the word 'end'.
This breaks down when the repetition is within a pyparsing class, like Word. Word(alphas)
will continue to read alpha characters as long as there is no whitespace to stop the word. You would have to break into this repetition using something very expensive, like Combine(OneOrMore(~Literal('lab') + Word(alphas, exact=1)))
- I say expensive because composition of simple tokens using complex Combine expressions will make for a slow parser.
You might be able to compromise by using a regex wrapped in a pyparsing Regex object:
>>>labword = Regex(r'(\w+)(lab)(\d+)')>>>print labword.parseString("abclab1").dump()
['abclab1']
This does the right kind of grouping and detection, but does not expose the groups themselves. To do that, add names to each group - pyparsing will treat these like results names, and give you access to the individual fields, just as if you had called setResultsName:
>>> labword = Regex(r'(?P<locn>\w+)(?P<env>lab)(?P<identifier>\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
- env: lab
- identifier: 1
- locn: abc
>>> print labword.parseString("abclab1").asDict()
{'identifier': '1', 'locn': 'abc', 'env': 'lab'}
The only other non-regex approach I can think of would be to define an expression to read the whole string, and then break up the parts in a parse action.
Solution 2:
If you strip the subgroup sign(the parenthesis), you'll get the right answer:)
>>> re.split("\w+lab\d+", "abclab1")
['', '']
Post a Comment for "Python Regex In Pyparsing"