PDF streams questions from a newcomer #97
-
As a mostly academic effort I am working to better understand the PDF specification and format. I stumbled on PDFIO and thought it might help facilitate that effort. I've written a bit of code that attempts to extract some text from the included From the specification I found that most simply we have However, while the Tj's and TJ's I see within
So, for more complex scenarios how might these string contents be encoded? The stream itself was FLATE compressed, but using the relevant argument of Each char is only a single character, and combining adjacent chars to make 16 bit wide characters does not appear to manifest meaningful unicode. What am I missing? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 8 replies
-
Can you show the code that is displaying the returned tokens? A regular string will be returned as a token starting with "(" while a binary (hex) string starts with "<". |
Beta Was this translation helpful? Give feedback.
-
Fairly simple, actually. I've collected the operands into a fifo, then invoke an operator handler where it will pop each operand and just print them as shown. The printed chars above are from
|
Beta Was this translation helpful? Give feedback.
-
What gets literally received from the PDFIO tokenizer is this:
|
Beta Was this translation helpful? Give feedback.
Yes, the font dictionary associated with the named font provides the encoding information. I have updated the pdf2txt.c example code to use it:
[master 5b5de3a] Update pdf2txt example to support font encodings.
"There be dragons" and all that... :)