Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignored text before tables #1251

Open
panagiotis-tsolakis opened this issue Feb 24, 2025 · 3 comments
Open

Ignored text before tables #1251

panagiotis-tsolakis opened this issue Feb 24, 2025 · 3 comments

Comments

@panagiotis-tsolakis
Copy link

In a sample of 2 pdfs that I converted into TEI with Grobid, I noticed that text lines preceding a table were sometimes dropped and did not appear in the final TEI file. The following screenshots come from two consecutive pages of a pdf file:

Image

Image

The text underlined in yellow was ignored by Grobid, as you can see in the TEI file. The text of the table was ignored as well by Grobid.

Image

@lfoppiano
Copy link
Collaborator

Hi @panagiotis-tsolakis! Thanks for reporting this, could you attach here the pdf files, and let me know also which version of grobid did you use?

If you haven't used, I suggest you to try with the current master.

@panagiotis-tsolakis
Copy link
Author

I used Grobid 0.8.1.

Here's the pdf file :

zwitter-vitez-etal-2022-extracting.pdf

@lfoppiano
Copy link
Collaborator

@panagiotis-tsolakis the text missing was a bug in 0.8.1, which has been fixed in the current master, so the text is not lost. The table, unfortunately is blended into the text, the recognition should hopefully improve with #963

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants