Skip to content Skip to sidebar Skip to footer

Can I Get Lxml To Ignore Non-XML Content Before And After The Root Tag?

I'm trying to use lxml to process a file that may have some non-xml junk both before and after the XML content, imagine someone captured a terminal buffer and I have something lik

Solution 1:

Here's another point in the balance between convenience and correctness:

import re

xml = re.search(r"<(\w+).*</\1>", console_output, flags=re.DOTALL).group()

It expects a single root tag given in the above format.


Solution 2:

At most you can clean out everything that isn't a opening angle bracket from the front, and everything that isn't a closing angle bracket from the end:

data = data[data.find('<'):data.rfind('>')]

but this will fall over easily if there are any opening angle brackets at the start before the actual XML data, and any extra closing angle brackets at the end of the data. This is not uncommon in shell environments.

It'll be much easier on you if you just reject any such inputs instead.


Post a Comment for "Can I Get Lxml To Ignore Non-XML Content Before And After The Root Tag?"