Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF/UA 7.1 test 3 false negative (real content vs artifacts)? #230

Open
jzuidweg opened this issue Jul 15, 2022 · 5 comments
Open

PDF/UA 7.1 test 3 false negative (real content vs artifacts)? #230

jzuidweg opened this issue Jul 15, 2022 · 5 comments

Comments

@jzuidweg
Copy link

The author of test document 19 told me that this document has content which is not tagged. However, it does not raise an error for PDF/UA 7.1 test 3 ('Content is neither marked as Artifact nor tagged as real content').

Is this a false negative?

@MaximPlusov
Copy link
Collaborator

Could you clarify what content the author is referring to?

@MaximPlusov
Copy link
Collaborator

Acrobat and PAC don't show this error either.

@jzuidweg
Copy link
Author

jzuidweg commented Jul 28, 2022

I asked the author about this. He says that according to Acrobat Pro and PAC3, the content of test document 19 is tagged starting from the line "Het college van B&W vergadert op dinsdag." The first three lines of the document starting with "Nieuws van het college van B&W" through "Documentnummer: ZD2203686_3" do not appear to be tagged as content.

See the screenshot from Acrobat Pro: screenshot from acrobat pro and the screenshot from PAC3: screenshot from PAC3

@MaximPlusov
Copy link
Collaborator

MaximPlusov commented Jul 28, 2022

Ok. Looks like this content is meant:
artifact
It's not tagged actually, but it's marked as Artifact. So it's not 7.1-3 false negative. This error is related to rule "Real content is marked as artifact.", but now this rule isn't checked.

@bdoubrov
Copy link
Collaborator

bdoubrov commented Jul 28, 2022

Normally only the author if the document can say whether some content is an artifact or not. I'd say, in this particular example it is obviously real content. So, marking it as an artifact is wrong.

There are several approaches that can be taken to detect such cases:

  • (naïve) just report any text marked as artifact as a warning and let the user verify this. The problem here is that for example all page labels will be reported as errors.
  • (complex) try recognizing some typical use cases when artifacts are used. Like page header/footer, repeated table headers in the multipage table, line numbers. And exclude them from the naïve approach described above

@bdoubrov bdoubrov changed the title PDF/UA 7.1 test 3 false negative? PDF/UA 7.1 test 3 false negative (real content vs artifacts)? Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants