Iterate Through Table Of Contents In Docx Using Python-docx
Solution 1:
Since most of the solution is hidden in the comment section and it took me a while to figure out exactly what the OP did and how scanny's answer changed what he was doing, I'll just post my solution here, which is only what is written in the comment section of scanny's answer. I don't fully comprehend, how the code works, so if somebody wants to edit my answer, please feel free to do so.
#open docx file with python-docxdocument = docx.Document("path\to\file.docx")
#extract body elementsbody_elements = document._body._body
#extract those wrapped in <w:r> tagrs = body_elements.xpath('.//w:r')
#check if style is hyperlink (toc)table_of_content = [r.text for r in rs if r.style == "Hyperlink"]
table_of_content will be a list, comprised of first the numbering as an item, followed by the title.
Solution 2:
I believe you'll find that the actual generated contents of the TOC is "wrapped" in a non-paragraph element. python-docx
won't get you there directly as it only finds paragraphs that are direct children of the w:document/w:body
element.
To get at these you'll need to go down to the lxml level, using python-docx to get you as close as possible. You can get to (and print) the body element with this:
document = Document('my-doc.docx')
body_element = document._body._body
print(body_element.xml) # this will be big if your document is
From there you can identify the specific XML location of the parts you want and use lxml/XPath to access them. Then you can wrap them in python-docx Paragraph
objects for ready access:
from docx.text.paragraph import Paragraph
ps = body_element.xpath('./w:something/w:something_child/w:p'
paragraphs = [Paragraph(p, None) for p in ps]
This is not an exact recipe and will require some research on your part to work out what w:something
etc. are, but if you want it bad enough to surmount those hurdles, this approach will work.
Once you get it working, posting your exact solution may be of help to others on search.
Post a Comment for "Iterate Through Table Of Contents In Docx Using Python-docx"