Skip to content Skip to sidebar Skip to footer

Parse Xml To Pandas Data Frame In Python

I am trying to read the XML file and convert it to pandas. However it returns empty data This is the sample of xml structure: .format(instance.tag, ikey, ivalue)) # Loop inside every instance instance_dict = get_children_info(list(instance), instance_dict) #consolidator_dict.update({ivalue: instance_dict.copy()}) consolidator_dict[ivalue] = instance_dict.copy() df = pd.DataFrame(consolidator_dict).T df = df[df_cols] return df

Run the following to generate the desired output.

xml_source = r'grade_data.xml'
df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", "Question", "Answer",
           "ContextRequired", "ExtraInfoInAnswer", "Comments", "Watch", 'ReferenceAnswers']

df = xml2df(xml_source, df_cols, source_is_file = True)
df

Method: 2

Given you have the xml_string, you could convert xml >> dict >> dataframe. run the following to get the desired output.

Note: You will need to install xmltodict to use Method-2. This method is inspired by the solution suggested by @martin-blech at How to convert XML to JSON in Python? [duplicate] . Kudos to @martin-blech for making it.

pip install -U xmltodict

Solution

defread_recursively(x, instance_dict):  
    #print(x)
    txt = ''for key in x.keys():
        k = key.replace("@","")
        if k in df_cols: 
            ifisinstance(x.get(key), dict):
                instance_dict, txt = read_recursively(x.get(key), instance_dict)
            #else:                
            instance_dict.update({k: x.get(key)})
            #print('{}: {}'.format(k, x.get(key)))else:
            #print('else: {}: {}'.format(k, x.get(key)))# dig deeper if value is another dictifisinstance(x.get(key), dict):
                instance_dict, txt = read_recursively(x.get(key), instance_dict)                
            # add simple text associated with elementif k=='#text':
                txt = x.get(key)
        # update text to corresponding parent element    if (k!='#text') and (txt!=''):
            instance_dict.update({k: txt})
    return (instance_dict, txt)

You will need the function read_recursively() given above. Now run the following.

import xmltodict, json

o = xmltodict.parse(xml_string) # INPUT: XML_STRING#print(json.dumps(o)) # uncomment to see xml to json converted string

consolidated_dict = dict()
oi = o['Instances']['Instance']

for x in oi:
    instance_dict = dict()
    instance_dict, _ = read_recursively(x, instance_dict)
    consolidated_dict.update({x.get("@ID"): instance_dict.copy()})
df = pd.DataFrame(consolidated_dict).T
df = df[df_cols]
df

Solution 2:

Several issues:

  • Calling .find on the loop variable, node, expects a child node to exist: current_node.find('child_of_current_node'). However, since all the nodes are the children of root they do not maintain their own children, so no loop is required;
  • Not checking NoneType that can result from missing nodes with find() and prevents retrieving .tag or .text or other attributes;
  • Not retrieving node content with .text, otherwise the <Element... object is returned;

Consider this adjustment using the ternary condition expressiona if condition else b to ensure variable has a value regardless:

rows = []

s_name = xroot.attrib.get("ID")
s_student = xroot.find("StudentID").text if xroot.find("StudentID") isnotNoneelseNone
s_task = xroot.find("TaskID").text if xroot.find("TaskID") isnotNoneelseNone      
s_source = xroot.find("DataSource").text if xroot.find("DataSource") isnotNoneelseNone
s_desc = xroot.find("ProblemDescription").text if xroot.find("ProblemDescription") isnotNoneelseNone
s_question = xroot.find("Question").text if xroot.find("Question") isnotNoneelseNone    
s_ans = xroot.find("Answer").text if xroot.find("Answer") isnotNoneelseNone
s_label = xroot.find("Label").text if xroot.find("Label") isnotNoneelseNone
s_contextrequired = xroot.find("ContextRequired").text if xroot.find("ContextRequired") isnotNoneelseNone
s_extraInfoinAnswer = xroot.find("ExtraInfoInAnswer").text if xroot.find("ExtraInfoInAnswer") isnotNoneelseNone
s_comments = xroot.find("Comments").text if xroot.find("Comments") isnotNoneelseNone
s_watch = xroot.find("Watch").text if xroot.find("Watch") isnotNoneelseNone
s_referenceAnswers = xroot.find("ReferenceAnswers").text if xroot.find("ReferenceAnswers") isnotNoneelseNone

rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task, 
             "DataSource": s_source, "ProblemDescription": s_desc , 
             "Question": s_question , "Answer": s_ans ,"Label": s_label,
             "s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
             "Comments": s_comments ,  "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers     
            })

out_df = pd.DataFrame(rows, columns = df_cols)

Alternatively, run a more dynamic version assigning to an inner dictionary using the iterator variable:

rows = []
for node in xroot: 
    inner = {}
    inner[node.tag] = node.text

    rows.append(inner)

out_df = pd.DataFrame(rows, columns = df_cols)

Or list/dict comprehension:

rows = [{node.tag: node.text} for node in xroot]
out_df = pd.DataFrame(rows, columns = df_cols)

Post a Comment for "Parse Xml To Pandas Data Frame In Python"