Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some segmentation training data. #1254

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

haydn-jones
Copy link

Found some training data that was incorrectly assigning listBibl to body elements, might be contributing to the issues I'm experiencing.

@lfoppiano
Copy link
Collaborator

lfoppiano commented Feb 25, 2025

Hi @haydn-jones, thanks for the PR.

These training data are for the lightweight models, article-light, indeed the references are supposed to be considered <body>, however this is done in the XML parser 😅 so there is no need to fix this training data, actually.

See

@haydn-jones
Copy link
Author

Ah I see. Half of the files are in the standard segmentation training set though, right?

@lfoppiano
Copy link
Collaborator

Yes, I prefer to annotate them with the full segmentation approach, expecially for what concern headnote and footnote, because we might change the lightweight model in the future. I suggest you to focus only on the data into grobid-trainer/resources/dataset/segmentation/corpus

@haydn-jones
Copy link
Author

@lfoppiano Sounds good, I reverted the changes for the lightweight models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants