Parse Xml To Pandas Data Frame In Python
I am trying to read the XML file and convert it to pandas. However it returns empty data This is the sample of xml structure: .format(instance.tag, ikey, ivalue))
# Loop inside every instance
instance_dict = get_children_info(list(instance),
instance_dict)
#consolidator_dict.update({ivalue: instance_dict.copy()})
consolidator_dict[ivalue] = instance_dict.copy()
df = pd.DataFrame(consolidator_dict).T
df = df[df_cols]
return df
Run the following to generate the desired output.
xml_source = r'grade_data.xml'
df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", "Question", "Answer",
"ContextRequired", "ExtraInfoInAnswer", "Comments", "Watch", 'ReferenceAnswers']
df = xml2df(xml_source, df_cols, source_is_file = True)
df
Method: 2
Given you have the xml_string
, you could convert xml >> dict >> dataframe
. run the following to get the desired output.
Note: You will need to install xmltodict
to use Method-2. This method is inspired by the solution suggested by @martin-blech at How to convert XML to JSON in Python? [duplicate]
. Kudos to @martin-blech for making it.
pip install -U xmltodict
Solution
defread_recursively(x, instance_dict):
#print(x)
txt = ''for key in x.keys():
k = key.replace("@","")
if k in df_cols:
ifisinstance(x.get(key), dict):
instance_dict, txt = read_recursively(x.get(key), instance_dict)
#else:
instance_dict.update({k: x.get(key)})
#print('{}: {}'.format(k, x.get(key)))else:
#print('else: {}: {}'.format(k, x.get(key)))# dig deeper if value is another dictifisinstance(x.get(key), dict):
instance_dict, txt = read_recursively(x.get(key), instance_dict)
# add simple text associated with elementif k=='#text':
txt = x.get(key)
# update text to corresponding parent element if (k!='#text') and (txt!=''):
instance_dict.update({k: txt})
return (instance_dict, txt)
You will need the function read_recursively()
given above. Now run the following.
import xmltodict, json
o = xmltodict.parse(xml_string) # INPUT: XML_STRING#print(json.dumps(o)) # uncomment to see xml to json converted string
consolidated_dict = dict()
oi = o['Instances']['Instance']
for x in oi:
instance_dict = dict()
instance_dict, _ = read_recursively(x, instance_dict)
consolidated_dict.update({x.get("@ID"): instance_dict.copy()})
df = pd.DataFrame(consolidated_dict).T
df = df[df_cols]
df
Solution 2:
Several issues:
- Calling
.find
on the loop variable,node
, expects a child node to exist:current_node.find('child_of_current_node')
. However, since all the nodes are the children of root they do not maintain their own children, so no loop is required; - Not checking
NoneType
that can result from missing nodes withfind()
and prevents retrieving.tag
or.text
or other attributes; - Not retrieving node content with
.text
, otherwise the<Element...
object is returned;
Consider this adjustment using the ternary condition expressiona if condition else b
to ensure variable has a value regardless:
rows = []
s_name = xroot.attrib.get("ID")
s_student = xroot.find("StudentID").text if xroot.find("StudentID") isnotNoneelseNone
s_task = xroot.find("TaskID").text if xroot.find("TaskID") isnotNoneelseNone
s_source = xroot.find("DataSource").text if xroot.find("DataSource") isnotNoneelseNone
s_desc = xroot.find("ProblemDescription").text if xroot.find("ProblemDescription") isnotNoneelseNone
s_question = xroot.find("Question").text if xroot.find("Question") isnotNoneelseNone
s_ans = xroot.find("Answer").text if xroot.find("Answer") isnotNoneelseNone
s_label = xroot.find("Label").text if xroot.find("Label") isnotNoneelseNone
s_contextrequired = xroot.find("ContextRequired").text if xroot.find("ContextRequired") isnotNoneelseNone
s_extraInfoinAnswer = xroot.find("ExtraInfoInAnswer").text if xroot.find("ExtraInfoInAnswer") isnotNoneelseNone
s_comments = xroot.find("Comments").text if xroot.find("Comments") isnotNoneelseNone
s_watch = xroot.find("Watch").text if xroot.find("Watch") isnotNoneelseNone
s_referenceAnswers = xroot.find("ReferenceAnswers").text if xroot.find("ReferenceAnswers") isnotNoneelseNone
rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task,
"DataSource": s_source, "ProblemDescription": s_desc ,
"Question": s_question , "Answer": s_ans ,"Label": s_label,
"s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
"Comments": s_comments , "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers
})
out_df = pd.DataFrame(rows, columns = df_cols)
Alternatively, run a more dynamic version assigning to an inner dictionary using the iterator variable:
rows = []
for node in xroot:
inner = {}
inner[node.tag] = node.text
rows.append(inner)
out_df = pd.DataFrame(rows, columns = df_cols)
Or list/dict comprehension:
rows = [{node.tag: node.text} for node in xroot]
out_df = pd.DataFrame(rows, columns = df_cols)
Post a Comment for "Parse Xml To Pandas Data Frame In Python"