Address and fix issues related to processing PDFs within Harmony, as identified in the Kaggle competition. Improve PDF handling for more seamless integration. Handle Excels and Word docs.
Harmony has developed a cutting-edge functionality that allows users to upload a PDF document, which the system then processes to identify and extract the text of questionnaire questions. This technology represents a significant advancement in the field of document processing and data extraction. You can try Harmony at harmonydata.ac.uk.
Demo of Harmony's Functionality: For a better understanding of what we aim to achieve, participants can view a demo of Harmony's current PDF processing functionality on YouTube.
The objective is to build upon Harmony's existing technology to create a more efficient, accurate, and robust tool for extracting questionnaire questions from a variety of documents. Participants are encouraged to innovate and develop solutions that can handle a wide range of document formats and structures.
We have lots of example PDFs, together with the ground truths (what questions should be extracted), here:
https://github.com/harmonydata/pdf-questionnaire-extraction/tree/main/data
Issue: harmonydata/harmony#11
Try our Kaggle competition: https://www.kaggle.com/competitions/harmony-pdf-and-word-questionnaires-extract-v2
Github repo for PDF parsing: https://github.com/harmonydata/pdf-questionnaire-extraction
Code Repository: Participants may find it beneficial to explore Harmony's existing code repository related to PDF processing. This can serve as a starting point or reference for developing their solutions. The repository is available at (Harmony GitHub Repository.) and https://github.com/harmonydata/pdf-questionnaire-extraction
You might also find lists like this useful: https://ipip.ori.org/AlphabeticalItemList.htm