Skip to content Skip to sidebar Skip to footer

How To Extract Multiple Grandchildren/children From Xml Where One Child Is A Specific Value?

I'm working with an XML file that stores all 'versions' of the chatbot we create. Currently we have 18 versions and I only care about the most recent one. I'm trying to find a way

Solution 1:

Ok this code makes the assumption that your XML is going to be of the pattern of version, dialog1, dialog2, dialog3, version2, dialog1, dialog2, etc... if this is not the case then let me know and I will reevaluate the code. But basically loop over the code and creating groups of dialogs too versions then sort by version number. After that flatten to get a nested list form to create the pandas dataframe.

import xml.etree.ElementTree as ET
import pandas as pd

cols = ["BotVersion", "DialogGroup", "Dialog"]
rows = []

tree = ET.parse('test.xml')
root = tree.getroot()


for fullName in root.findall(".//botVersions"):
    versions = list(fullName)

# creating the many to one relation between the versions and bot dialogs
grouping = []
relations = []
for i, tag inenumerate(versions):
    if i == 0:
        relations.append(tag)
    elif tag.tag == 'fullName':
        grouping.append(relations)
        relations = []
        relations.append(tag)
    else:
        relations.append(tag)
        # edge case for end of list)if i == len(versions) - 1:
            grouping.append(relations)

#sorting by the text of the fullName tag to be able to slice the end for latest version
grouping.sort(key=lambda x: x[0].text)
rows = grouping[-1]

#flatening the text into rows for the pandas dataframe
version_number = rows[0].text
pandas_row = [version_number]
pandas_rows = []
for r in rows[1:]:
    pandas_row = [version_number]
    for child in r.iter():
        if child.tag in ['botDialogGroup', 'label']:
            pandas_row.append(child.text)
    pandas_rows.append(pandas_row)

df = pd.DataFrame(pandas_rows, columns=cols)
print(df)

Solution 2:

from lxml import etree

bots = """your xml above"""
cols = ["BotVersion", "DialogGroup", "Dialog"]
rows = []
ver = 'v18'

root = etree.XML(bots)

for entry in root.xpath(f"//botVersions[//fullName[.='{ver}']]"):
    rows.append([ver,entry.xpath('//botDialogGroup/text()')[0],entry.xpath('//label/text()')[0]])
df = pd.DataFrame(rows, columns=cols)
df

Output should be your expected df.

Post a Comment for "How To Extract Multiple Grandchildren/children From Xml Where One Child Is A Specific Value?"