Skip to content Skip to sidebar Skip to footer

Matching Patterns In Python

I have an XML file which contains the following strings: abcdef pqrst this

Solution 1:

You could use lxml.etree.XMLParser with recover=True option:

import sys
from lxml import etree

invalid_xml = """
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and
try to remove non xml compatible characters.</field>
"""
root = etree.fromstring("<root>%s</root>" % invalid_xml,
                        parser=etree.XMLParser(recover=True))
root.getroottree().write(sys.stdout)

Output

<root><fieldname="id">abcdef</field><fieldname="intro"> pqrst</field><fieldname="desc"> this is a test file. We will show 5&gt;2 and 35 and
try to remove non xml compatible characters.</field></root>

Note: > is left in the document as &gt; and < is completely removed (as invalid character in xml text).

Regex-based solution

For simple xml-like content you could use re.split() to separate tags from the text and make the substitutions in non-tag text regions:

import re
from itertools import izip_longest
from xml.sax.saxutils import escape  # '<' -> '&lt;'# assumptions:#   doc = *( start_tag / end_tag / text )#   start_tag = '<' name *attr [ '/' ] '>'#   end_tag = '<' '/' name '>'
ws = r'[ \t\r\n]*'# allow ws between any token
name = '[a-zA-Z]+'# note: expand if necessary but the stricter the better
attr = '{name} {ws} = {ws} "[^"]*"'# note: fragile against missing '"'; no "'"
start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >'
end_tag = '{ws}'.join(['<', '/', '{name}', '>'])
tag = '{start_tag} | {end_tag}'assert'{{'notin tag
while'{'in tag: # unwrap definitions
    tag = tag.format(**vars())

tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE)

# escape &, <, > in the text
iters = [iter(tag_regex.split(invalid_xml))] * 2
pairs = izip_longest(*iters, fillvalue='')  # iterate 2 items at a timeprint(''.join(escape(text) + tag for text, tag in pairs))

To avoid false positives for tags you could remove some of '{ws}' above.

Output

<fieldname="id">abcdef</field><fieldname="intro" > pqrst</field><fieldname="desc"> this is a test file. We will show 5&gt;2 and 3&lt;5 and
try to remove non xml compatible characters.</field>

Note: both <> are escaped in the text.

You could call any function instead of escape(text) above e.g.,

defescape4human(text):
    return text.replace('<', 'less than').replace('>', 'greater than')

Solution 2:

Seems I did it for >:

re.sub('(?<! " )(?<! ")(?! )>','greater than', xml_string)

?<! - negative lookbehind assertion,

?! - negative lookahead assertion,

(...)(...) is logical AND,

so whole expression means "substitute all occurences of '>' which (does not start with ' " ') and (does not start with ' "') and ( does not end with ' ')

case < is similar

Solution 3:

Use ElementTree for XML parsing.

Post a Comment for "Matching Patterns In Python"