Python/r: Generate Dataframe From Xml When Not All Nodes Contain All Variables?

July 29, 2023 Post a Comment

Consider the following XML example library(xml2) myxml <- read_xml(' John tennis

Solution 1:

A general R solution that does not require to hardcode the variables. Using xml2 and tidyverse's purrr:

library(xml2)
library(purrr)

myxml %>% 
  xml_find_all('obs')%>%# Enter each obs and return a df
  map_df(~{# Scan names
    node_names <- .x %>% 
      xml_children()%>% 
      xml_name()%>%
      unique()# Remember ob
    ob <- .x

    # Enter each node
    map(node_names,~{# Find similar nodes
      node <- xml_find_all(ob, .x)%>%
        xml_text(trim =TRUE)%>%
        paste0(collapse ='|')%>%'names<-'(.x)# ^ we need to name the element to #   overwrite it with its 'sibilings'})%>%# Return an 'ob' vector
      flatten()})#> # A tibble: 2 Ã— 3#>     name       hobby  skill#>    <chr>       <chr>  <chr>#> 1   John tennis|golf python#> 2 Robert        <NA>      R

What it does:

It 'enters' each obs, find and store the node names in that obs.
For each node find all the similar node in the obs, collapse them and store in a list.
Flattens the list, overwriting elements with the same name.
rbind (implicit in map_df()) each 'flatted' list into the resulting data.frame.

Data:

myxml <- read_xml('
                  <data>
                  <obs ID="a">
                  <name> John </name>
                  <hobby> tennis </hobby>
                  <hobby> golf </hobby>
                  <skill> python  </skill>
                  </obs>
                  <obs ID="b">
                  <name> Robert </name>
                  <skill> R </skill>
                  </obs>
                  </data>
                  ')

Solution 2:

pandas

import pandas as pd
from collections import defaultdict
import xml.etree.ElementTree as ET


xml_txt = """<data>
  <obs ID="a">
  <name> John </name>
  <hobby> tennis </hobby>
  <hobby> golf </hobby>
  <skill> python  </skill>
  </obs>
  <obs ID="b">
  <name> Robert </name>
  <skill> R </skill>
  </obs>
  </data>"""

etree = ET.fromstring(xml_txt)

defobs2series(o):
    d = defaultdict(list)
    [d[c.tag].append(c.text.strip()) for c in o.getchildren()];
    return pd.Series(d).str.join('|')

pd.DataFrame([obs2series(o) for o in etree.findall('obs')])

         hobby    name   skill
0  tennis|golf    John  python
1          NaN  Robert       R

How It Works

build an element tree from the string. Otherwise do something like et = ET.parse('my_data.xml')
etree.findall('obs') returns a list of elements within the xml structure that are 'obs' tags
I pass each of these to a pd.Series constructor obs2series
Within obs2series I loop through all child nodes in one 'obs' element.
defaultdict defaults to a list meaning I can append to a value even if the key hasn't been seen before.
I end up with a dictionary of lists. I pass this to pd.Series to get a series of lists.
Using pd.Series.str.join('|') I convert this to a series of strings as I wanted.
My list comprehension in the beginning that looped over observations is now a list of series and ready to passed to the pd.DataFrame constructor.

Solution 3:

XML

Create a function that can handle missing or multiple nodes, and then apply that to the obs nodes. I added the id column so you can see how to use xmlGetAttr too (use "." for the obs node and the leading "." on other nodes so its relative to that current node in the set).

xpath2 <-function(x, ...){
    y <- xpathSApply(x, ...)
    ifelse(length(y) == 0, NA,  paste(trimws(y), collapse=", "))
}  
obs <- getNodeSet(doc, "//obs")   
data.frame( id = sapply(obs, xpath2, ".", xmlGetAttr, "ID"),
          name = sapply(obs, xpath2, ".//name", xmlValue),
       hobbies = sapply(obs, xpath2, ".//hobby", xmlValue),
         skill = sapply(obs, xpath2, ".//skill", xmlValue))

  id   name      hobbies  skill
1  a   John tennis, golf python
2  b Robert         <NA>      R

xml2

I don't use xml2 very often, but maybe get the obs nodes and then apply xml_find_all if there are duplicate tags instead of using xml_find_first.

obs <-  xml_find_all(myxml, "//obs")  
lapply(obs, xml_find_all, ".//hobby")

data_frame(
     name = xml_find_first(obs, ".//name") %>% xml_text(trim=TRUE),
  hobbies = sapply(obs, function(x)  paste(xml_text( xml_find_all(x, ".//hobby"), trim=TRUE), collapse=", " ) ),
    skill = xml_find_first(obs, ".//skill") %>% xml_text(trim=TRUE)
)

# A tibble: 2 x 3
    name      hobbies  skill
   <chr>        <chr>  <chr>
1   John tennis, golf python
2 Robert                   R

I tested both methods using the medline17n0853.xml file at the NCBI ftp. This is a 280 MB file with 30,000 PubmedArticle nodes, and the XML package took 102 seconds to parse pubmed ids, journals and combine multiple publication types. The xml2 code ran for 30 minutes and then I killed it, so that may not be the best solution.

Solution 4:

In R, I'd probably use

library(XML)
lst <- xmlToList(xmlParse(myxml)[['/data']])
(df <- data.frame(t(sapply(lst, function(x) {
  c(x['name'], hobby=paste0(x[which(names(x)=='hobby')], collapse="|"))
}))) )
#       name           hobby# 1    John   tennis | golf # 2  Robert

and maybe do some polishing using df[df==""] <- NA and trimws() to remove whitespaces.

Or:

library(xml2)
library(dplyr)
`%|||%` <-function(x, y)if(length(x)==0) y else x 
(df <- data_frame(names= myxml %>% 
    xml_find_all("/data/obs/name")%>% 
    xml_text(trim=TRUE), 
  hobbies = myxml %>% 
    xml_find_all("/data/obs")%>% 
    lapply(function(x) xml_text(xml_find_all(x,"hobby"),T)%|||%NA_character_)))# # A tibble: 2 × 2#    names   hobbies#    <chr>    <list># 1   John <chr [2]># 2 Robert <chr [1]>

Python Playground