Python: Searching Text Inside Pdf

October 07, 2024 Post a Comment

I want to write a function like that: input: a PDF file, a string (the PDF is searchable - it was created by MS Word, for example) output: page and position (coordinate: x and y) o

Solution 1:

You might need to check PDF specification 7.7 Document Structure and 9. Text to get at least little bit of imagination of how the text is stored in PDF.

Approach:

Traversing every single page using Page Tree contains Page Objects, where we search for its Contents field. Contents of this field is basically page elements described by Postscript language.

Example:

The text ABC is placed 10 inches from the bottom of the page and 4 inches from the left edge, using 12-point Helvetica.

BT
    /F13 12 Tf
    288720Td
    (ABC) Tj
ET

Strings inside can be represented as:

Literal string (7.3.4.2) - this is pretty much straight-forward, as you just walk the data for "(.*?)"

Hexadecimal string (7.3.4.3) - this is a tricky one, because we have to decode the data before we can compare to the string we are searching for.

After we matched the string, the last thing remaining is figure out its position. This basically requires parsing of the Postscript language.

Most of these things I have mentioned are already implemented in many products (itext, GhostScript, ...) which you can easily read as a reference implementation.

I personally do not have any experience with python based PDF library, you should figure this one out on your own.

Python Playground

Python: Searching Text Inside Pdf

Solution 1:

Post a Comment for "Python: Searching Text Inside Pdf"