PDF table extraction tools #18

Rainiefantasy · 2024-10-21T14:53:50Z

Question/topic for discussion:
Can the manual table creation (pdf -> csv) be simplified? This could be great as this would mean the code would be generalisable. For example, this may help in addressing creating potential workflows for CPRD Gold not just Aurum #12 .

Tasks:

To look into PDF table extraction tools which can assist this process
Are these tools flexible and how accurate are they? If they are prone to error and take time to validate maybe better to leave as is
If it works, maybe consider testing on Gold data?

Other information

Ideas doc attached:
PDFtocsv-ideas.docx

RayStick · 2025-02-06T10:58:29Z

Some context for this task

CPRD's data specifications (example) contain all the info we want for this pipeline but they are in PDF - if it's easy to extract the metadata tables into machine readable format that's great, but don't do it if it's a messy big task. We can think about requesting them in a different format
HDRUK allow you to download a structural metadata csv for CPRD Aurum (https://healthdatagateway.org/en/dataset/692) however the file only contains 'column name' not 'field name' (and field name is the one that is in the data files) and we do not know if it is kept up to date

Rainiefantasy added the question Further information is requested label Oct 21, 2024

github-actions bot assigned Rainiefantasy Oct 21, 2024

Rainiefantasy added this to the Improvements to existing resources milestone Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF table extraction tools #18

PDF table extraction tools #18

Rainiefantasy commented Oct 21, 2024 •

edited by RayStick

Loading

RayStick commented Feb 6, 2025 •

edited

Loading

PDF table extraction tools #18

PDF table extraction tools #18

Comments

Rainiefantasy commented Oct 21, 2024 • edited by RayStick Loading

RayStick commented Feb 6, 2025 • edited Loading

Rainiefantasy commented Oct 21, 2024 •

edited by RayStick

Loading

RayStick commented Feb 6, 2025 •

edited

Loading