Skip to content Skip to sidebar Skip to footer

Parse All Xml Files In A Directory Python

Hi I'm trying to parse all XML files in a given directory using python. I am able to parse one file at a time but that would be 'impossible' for me to do due to the large number of

Solution 1:

@Kevin was correct in his comment that this error relates to the ElementTree object not being able to parse the document correctly. Something is not "true XML", and it could be something as simple as just an odd, non-unicode character or something.

What you can try to do to help debug is:

import xml.etree.ElementTree as ET
import os
directory = "C:/Users/danie/Desktop/NLP/blogs/"defclean_dir(directory):
    path = os.listdir(directory)
    print(path) 
    for filename in path:
        try:
            tree = ET.parse(filename)
            root = tree.getroot()
            doc_parser(root)
        except:
            print("ERROR ON FILE: {}".format(filename))


post_list = []
defdoc_parser(root):
    for child in root.findall('post'):
        post_list.append(child.text)

clean_dir(directory)
print(post_list[0])

Adding in a try...except statement will try each of the files, and if there is an error, print out which file is causing the error.

I don't have any data to test, but this should fix the error.

Post a Comment for "Parse All Xml Files In A Directory Python"