How To Remove Duplicate Nodes Xml Python
I have a special case xml file structure is something like :
Solution 1:
First of all what you're doing is a hard problem in the library you're using, see this question: How to remove a node inside an iterator in python xml.etree.ElemenTree
The solution to this would be to use lxml
which "implements the same API but with additional enhancements". Then you can do the following fix.
You seem to be only traversing the second level of nodes in your XML tree. You're getting root
, then walking the children its children. This would get you parent2
from the first page and the element
from your second page. Furthermore you wouldn't be comparing across pages here:
your comparison will only find second-level duplicates within the same page.
Select the right set of elements using a proper traversal function such as iter
:
# Use a `set` to keep track of "visited" elements with good lookup time.
visited = set()
# The iter method does a recursive traversal
for el in root.iter('element'):
# Since the id is what defines a duplicate for you
if 'id' in el.attr:
current = el.get('id')
# In visited already means it's a duplicate, remove it
if current in visited:
el.getparent().remove(el)
# Otherwise mark this ID as "visited"
else:
visited.add(current)
Post a Comment for "How To Remove Duplicate Nodes Xml Python"