Skip to content Skip to sidebar Skip to footer

Iterate Through Table Of Contents In Docx Using Python-docx

I have a doc with a table of contents that was auto generated in the beginning of the doc and would like to parse through this table of contents. Is this possible using python-docx

Solution 1:

Since most of the solution is hidden in the comment section and it took me a while to figure out exactly what the OP did and how scanny's answer changed what he was doing, I'll just post my solution here, which is only what is written in the comment section of scanny's answer. I don't fully comprehend, how the code works, so if somebody wants to edit my answer, please feel free to do so.

#open docx file with python-docxdocument = docx.Document("path\to\file.docx")
#extract body elementsbody_elements = document._body._body
#extract those wrapped in <w:r> tagrs = body_elements.xpath('.//w:r')
#check if style is hyperlink (toc)table_of_content = [r.text for r in rs if r.style == "Hyperlink"]

table_of_content will be a list, comprised of first the numbering as an item, followed by the title.

Solution 2:

I believe you'll find that the actual generated contents of the TOC is "wrapped" in a non-paragraph element. python-docx won't get you there directly as it only finds paragraphs that are direct children of the w:document/w:body element.

To get at these you'll need to go down to the lxml level, using python-docx to get you as close as possible. You can get to (and print) the body element with this:

document = Document('my-doc.docx')
body_element = document._body._body
print(body_element.xml)  # this will be big if your document is

From there you can identify the specific XML location of the parts you want and use lxml/XPath to access them. Then you can wrap them in python-docx Paragraph objects for ready access:

from docx.text.paragraph import Paragraph

ps = body_element.xpath('./w:something/w:something_child/w:p'
paragraphs = [Paragraph(p, None) for p in ps]

This is not an exact recipe and will require some research on your part to work out what w:something etc. are, but if you want it bad enough to surmount those hurdles, this approach will work.

Once you get it working, posting your exact solution may be of help to others on search.

Post a Comment for "Iterate Through Table Of Contents In Docx Using Python-docx"