PDF streams questions from a newcomer #97

sherrellbc · 2025-01-27T18:35:14Z

sherrellbc
Jan 27, 2025

As a mostly academic effort I am working to better understand the PDF specification and format. I stumbled on PDFIO and thought it might help facilitate that effort.

I've written a bit of code that attempts to extract some text from the included testpdfio.pdf PDF file in the repo. With all the rendering effects aside, at some level you fundamentally must be able to reproduce the text content - and all the accessory information telling you where to put it on the page and in what font and font size.

From the specification I found that most simply we have Tj and TJ commands that instruct the renderer to render a single string or an array of strings, respectively. According to the specification, these strings may either be a string literal (e.g. "(hello, world)") or a hex string (e.g. "<424344>").

However, while the Tj's and TJ's I see within testpdfio.pdf follow this format - the strings appear to be further obscured by something that I must be missing from the specification. Here is what I see for the first stream on the first page of this PDF doc printing only the string literals or hex-encoded strings:

 <parse>  handle_TJ
 <parse>        ^N               -> [ 0e 00 00 00 ]
 <parse>        #                -> [ 23 00 00 00 ]
 <parse>        &                -> [ 26 00 00 00 ]
 <parse>        ^Z               -> [ 1a 00 00 00 ]
 <parse>        !                -> [ 21 00 00 00 ]
 <parse>        ^^               -> [ 1e 00 00 00 ]
 <parse>        $                -> [ 24 00 00 00 ]
 <parse>        '                -> [ 27 00 00 00 ]
 <parse>        )                -> [ 29 00 00 00 ]
 <parse>        !                -> [ 21 00 00 00 ]
 <parse>        ^Y               -> [ 19 00 00 00 ]
 <parse>        #                -> [ 23 00 00 00 ]
 <parse>                         -> [ 20 00 00 00 ]
 <parse>        #                -> [ 23 00 00 00 ]
 <parse>        &                -> [ 26 00 00 00 ]
 <parse>        '                -> [ 27 00 00 00 ]
 <parse>        ^^               -> [ 1e 00 00 00 ]
 <parse>        (                -> [ 28 00 00 00 ]
 <parse>        ^V               -> [ 16 00 00 00 ]
 <parse>        !                -> [ 21 00 00 00 ]
 <parse>        ^Z               -> [ 1a 00 00 00 ]
 <parse>        (                -> [ 28 00 00 00 ]
 <parse>        00               -> [ 30 30 00 00 ]
 <parse>        ^X               -> [ 18 00 00 00 ]
 <parse>        #                -> [ 23 00 00 00 ]
 <parse>        "                -> [ 22 00 00 00 ]
 <parse>        '                -> [ 27 00 00 00 ]
 <parse>        ^Z               -> [ 1a 00 00 00 ]
 <parse>        ^X               -> [ 18 00 00 00 ]
 <parse>        (                -> [ 28 00 00 00 ]
 <parse>        ^Z               -> [ 1a 00 00 00 ]
 <parse>        (                -> [ 28 00 00 00 ]
 <parse>        )                -> [ 29 00 00 00 ]
 <parse>        &                -> [ 26 00 00 00 ]
 <parse>        ^V               -> [ 16 00 00 00 ]
 <parse>        ^Y               -> [ 19 00 00 00 ]
 <parse>        ^^               -> [ 1e 00 00 00 ]
 <parse>        $                -> [ 24 00 00 00 ]
 <parse>        ^^               -> [ 1e 00 00 00 ]
 <parse>        '                -> [ 27 00 00 00 ]
 <parse>        ^X               -> [ 18 00 00 00 ]
 <parse>        ^^               -> [ 1e 00 00 00 ]
 <parse>        "                -> [ 22 00 00 00 ]
 <parse>        ^\               -> [ 1c 00 00 00 ]

So, for more complex scenarios how might these string contents be encoded? The stream itself was FLATE compressed, but using the relevant argument of pdfioPageOpenStream allows PDFIO itself to decompress the streams. So .. what sort of data is this?

Each char is only a single character, and combining adjacent chars to make 16 bit wide characters does not appear to manifest meaningful unicode.

What am I missing?

Answered by michaelrsweet

Jan 28, 2025

Yes, the font dictionary associated with the named font provides the encoding information. I have updated the pdf2txt.c example code to use it:

[master 5b5de3a] Update pdf2txt example to support font encodings.

"There be dragons" and all that... :)

View full answer

michaelrsweet · 2025-01-27T19:25:26Z

michaelrsweet
Jan 27, 2025
Maintainer

Can you show the code that is displaying the returned tokens? A regular string will be returned as a token starting with "(" while a binary (hex) string starts with "<".

1 reply

sherrellbc Jan 28, 2025
Author

I should have replied here, but see my other comments below. Am I doing something overtly wrong here?

sherrellbc · 2025-01-27T19:29:33Z

sherrellbc
Jan 27, 2025
Author

Fairly simple, actually. I've collected the operands into a fifo, then invoke an operator handler where it will pop each operand and just print them as shown. The printed chars above are from token + 1 that is received from the PDFIO tokenizer.

    char *op;
    while(!fifo_pop(ap->fifo, &op)){
        switch(op[0]){
        case '(':
            pr("\t%s\t\t -> [ %02x %02x %02x %02x ]\n", op + 1,
                (uint8_t)op[1], (uint8_t)op[2], (uint8_t)op[3], (uint8_t)op[4]);
            break;
        case '<':
            pr("\t%s\t\t -> [ %02x %02x %02x %02x ]\n", op + 1,
                (uint8_t)op[1], (uint8_t)op[2], (uint8_t)op[3], (uint8_t)op[4]);
            break;
        default:
            break;
        }
    }

0 replies

sherrellbc · 2025-01-27T19:32:21Z

sherrellbc
Jan 27, 2025
Author

What gets literally received from the PDFIO tokenizer is this:

 <parse>  handle_TJ
 <parse>        [
 <parse>        (^N
 <parse>        -35.1522
 <parse>        (#
 <parse>        -37.1526
 <parse>        (&
 <parse>        13.9923
 <parse>        (^Z
 <parse>        -41.1525
 <parse>        (!
 <parse>        -341.84
 <parse>        (^^
 <parse>        -68.1681
 <parse>        ($
 <parse>        -40.152
 <parse>        ('
 <parse>        -39
 <parse>        ()
 <parse>        -72.1527
 <parse>        (!
 <parse>        -341.84
 ...

7 replies

michaelrsweet Jan 28, 2025
Maintainer

OK, I did some digging and this is because the first and third pages are copied from the file "testfiles/testpdfio.pdf", which is a CUPS test document that was generated using Ghostscript. The text uses a custom encoding for an embedded subset font, which is why you see (apparently) random garbage characters. The text on other pages is generated by PDFio (no subsets) and is correctly seen.

I can investigate what I might do to support different font encodings in the example program - essentially you need to take those strings and run them through a mapping table to get the correct text.

sherrellbc Jan 28, 2025
Author

Interesting. Is there any sort of metadata available that would indicate how the text was encoded? There must be, because any PDF reader I've tried seems to render this document correctly.

For what it's worth I was not able to extract any printable characters from that PDF file. Each stream on every page prints these encoded strings. And it must be a common encoding, because other PDFs I've tested with also give me the same results.

michaelrsweet Jan 28, 2025
Maintainer

Yes, the font dictionary associated with the named font provides the encoding information. I have updated the pdf2txt.c example code to use it:

[master 5b5de3a] Update pdf2txt example to support font encodings.

"There be dragons" and all that... :)

Answer selected by sherrellbc

sherrellbc Jan 28, 2025
Author

Interesting. Thanks for this, really. The PDF spec is explicit in that it has all this information, but think it may be lacking in terms of putting it all together in some useful form. Or at least for a new reader of the spec it's not immediately obvious about pages having dictionaries with encodings defined by names, set by Tf and all that. Dragons, certainly.

Thanks for updating the example. It's a great help.

sherrellbc Jan 29, 2025
Author

@michaelrsweet If you wouldn't mind, could you point me to some documentation that would help to explain what you've done in pdf2text.c? You appear to have taken a few common encodings and implemented a simple decoding scheme by using a look-up table.

However, what I am confused by is the fact that your code generates the decoding map, but there are only 256 entries. So, despite unicode allowing tens of thousands of encoded characters, the encoding scheme actually reduces the namespace back to 256?

Can you point me to some documentation on this byte stream / encoding map schema? It seems strange to encode unicode (2 or more bytes per char) as bytes, only to do a translation back to unicode by the renderer.

michaelrsweet Jan 29, 2025
Maintainer

I'll be updating the documentation on the example in the coming weeks (sorry, I can't be more specific than that due to life/work), but in short:

PDF supports so-called "simple" fonts that use an 8-bit character mapping to font glyph names. Those glyphs can also be mapped to Unicode, which is what the current pdf2text.c code does.
PDF also supports so-called "composite" fonts that map 16-bit characters to font glyph indices, which can also be mapped to Unicode (often they are Unicode).

The kicker is that 8-bit/simple fonts only have to deal with 3 relatively small base encodings (WinANSI/CP-1252, MacRoman, and MacExpert) while 16-bit/composite fonts have to deal with several hundred, and those encodings can be quite large depending on the locale. I have plans for adding support for basic Unicode composite fonts (using the "Identity" encoding) to pdf2text.c so that you can extract common Unicode text as well, although it isn't clear how well this will work beyond Plane 0 characters...

Unicode text isn't simple and the way that PDF implements it just makes things more complicated... :/

sherrellbc Jan 29, 2025
Author

Thanks for the info. One can't help but appreciate the level of effort it takes to write a PDF renderer. Graphics aside, which is a whole other massive concept to unpack regarding the interpretation of font programs, just getting basic text out of a PDF continues to become more complex than you'd think going into it.

I suppose one must give credit where credit is due for existing solutions like Poppler's pdftotext. The source of this is reasonably small as it's mostly handled internal inside the libs.

I am not yet too discouraged to continue trying to learn but the stack of turtles keeps growing ...
Looking forward to your updated documentation on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF streams questions from a newcomer #97

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

PDF streams questions from a newcomer #97

sherrellbc Jan 27, 2025

Replies: 3 comments · 8 replies

michaelrsweet Jan 27, 2025 Maintainer

sherrellbc Jan 28, 2025 Author

sherrellbc Jan 27, 2025 Author

sherrellbc Jan 27, 2025 Author

michaelrsweet Jan 28, 2025 Maintainer

sherrellbc Jan 28, 2025 Author

michaelrsweet Jan 28, 2025 Maintainer

sherrellbc Jan 28, 2025 Author

sherrellbc Jan 29, 2025 Author

michaelrsweet Jan 29, 2025 Maintainer

sherrellbc Jan 29, 2025 Author

sherrellbc
Jan 27, 2025

Replies: 3 comments 8 replies

michaelrsweet
Jan 27, 2025
Maintainer

sherrellbc Jan 28, 2025
Author

sherrellbc
Jan 27, 2025
Author

sherrellbc
Jan 27, 2025
Author

michaelrsweet Jan 28, 2025
Maintainer

sherrellbc Jan 28, 2025
Author

michaelrsweet Jan 28, 2025
Maintainer

sherrellbc Jan 28, 2025
Author

sherrellbc Jan 29, 2025
Author

michaelrsweet Jan 29, 2025
Maintainer

sherrellbc Jan 29, 2025
Author