How To Remove Duplicate Nodes Xml Python

Question

I have a special case xml file structure is something like :

Solution 1:

First of all what you're doing is a hard problem in the library you're using, see this question: How to remove a node inside an iterator in python xml.etree.ElemenTree

The solution to this would be to use lxml which "implements the same API but with additional enhancements". Then you can do the following fix.

You seem to be only traversing the second level of nodes in your XML tree. You're getting root, then walking the children its children. This would get you parent2 from the first page and the element from your second page. Furthermore you wouldn't be comparing across pages here:

your comparison will only find second-level duplicates within the same page.

Select the right set of elements using a proper traversal function such as iter:

# Use a `set` to keep track of "visited" elements with good lookup time.
visited = set()
# The iter method does a recursive traversal
for el in root.iter('element'):
    # Since the id is what defines a duplicate for you
    if 'id' in el.attr:
        current = el.get('id')
        # In visited already means it's a duplicate, remove it
        if current in visited:
            el.getparent().remove(el)
        # Otherwise mark this ID as "visited"
        else:
            visited.add(current)

Python Playground

How To Remove Duplicate Nodes Xml Python

Solution 1:

Post a Comment for "How To Remove Duplicate Nodes Xml Python"