Skip to content Skip to sidebar Skip to footer

Create A Pandas Dataframe From A Nested Xml File

Here is a small section of an xml file. I would like to create a database from this with each tag unique columns names and non-duplicated data. Tried using lxml and the best I hav

Solution 1:

Consider nested xpath loops where first you loop through every <SCRSGT> nodes and then extract all SCRSGT's children using an inner dictionary that iteratively appends to a list for DataFrame call:

from lxml import etree as et
import pandas as pd

trees = et.parse('test.xml')

d = []
for srcsgt in trees.xpath('//SRCSGT'):     # ITERATE THROUGH ROOT'S CHILDREN
    inner = {}
    for elem in srcsgt.xpath('//*'):       # ITERATE THROUGH ROOT'S DESCENDANTS PER CHILDif len(elem.text.strip()) > 0:     # KEEP ONLY NODES WITH NON-ZERO LENGTH TEXT
            inner[elem.tag] = elem.text

    d.append(inner)

df = pd.DataFrame(d)

Output

print(df)

#             ADDRESS                          AGENCY  ARCHDATE CLASSCOD  \# 0  Jigjhgjas@va.gov  Department of Veterans Affairs  12172017        H   #                                              CONTACT      DATE  \# 0  COiyiyS, JUhhiuN<a href="mailto:Juggyui@va.gov...  11112017   #                   DESC                                               LINK  \# 0  CONTRACT SPECIALIST  https://www.fbo.gov/spg/VA/CaVAMC532/CaVAMC532...   #                                         LOCATION   NAICS  \# 0  Department of Veterans Affairs Medical Center  238210   #                                               OFFADD            OFFICE  \# 0  Department of Veterans Affairs;400 Fort Hill A...  Canandaigua VAMC   #       PACKAGE RECOVERY_ACT  RESPDATE SETASIDE SOLNBR  \# 0  Attachment            N  11172017      N/A   9069   #                                              SUBJECT    ZIP  # 0  H--3 YEAR TESTING/MAINTENANCE OF ELECTRICAL EQ...  14424  

Post a Comment for "Create A Pandas Dataframe From A Nested Xml File"