Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF table extraction tools #18

Open
3 tasks
Rainiefantasy opened this issue Oct 21, 2024 · 1 comment
Open
3 tasks

PDF table extraction tools #18

Rainiefantasy opened this issue Oct 21, 2024 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@Rainiefantasy
Copy link
Contributor

Rainiefantasy commented Oct 21, 2024

Question/topic for discussion:
Can the manual table creation (pdf -> csv) be simplified? This could be great as this would mean the code would be generalisable. For example, this may help in addressing creating potential workflows for CPRD Gold not just Aurum #12 .

Tasks:

  • To look into PDF table extraction tools which can assist this process
  • Are these tools flexible and how accurate are they? If they are prone to error and take time to validate maybe better to leave as is
  • If it works, maybe consider testing on Gold data?

Other information

Ideas doc attached:
PDFtocsv-ideas.docx

@Rainiefantasy Rainiefantasy added the question Further information is requested label Oct 21, 2024
@RayStick
Copy link
Member

RayStick commented Feb 6, 2025

Some context for this task

  1. CPRD's data specifications (example) contain all the info we want for this pipeline but they are in PDF - if it's easy to extract the metadata tables into machine readable format that's great, but don't do it if it's a messy big task. We can think about requesting them in a different format
  2. HDRUK allow you to download a structural metadata csv for CPRD Aurum (https://healthdatagateway.org/en/dataset/692) however the file only contains 'column name' not 'field name' (and field name is the one that is in the data files) and we do not know if it is kept up to date

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants