Skip to content

Code for scraping the corpus "Studies on Water" issued by the OECD.

License

Notifications You must be signed in to change notification settings

disaster-capitalism/scrape-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scrape the Corpus "Studies on Water"

This repository contains the code for scraping the text of the corpus "Studies on Water" issued by the OECD from PDF files as part of the "Markets for Resilience or Disaster Capitalism" project. This is done in the notebook scrape_studies_on_water_pdf.ipynb. Please note that to run the notebook the pdf documents of the corpus (which can be purchased from the OECD website) need to be stored in a pdfs/ folder. The text is stored in JSON format for further processing. Further requirements are the PyMuPDF Python Package (version: 1.19.6), which can be installed using pip:

pip install pymupdf==1.19.6

PyMuPDF requires Python 3.7 or later.

References

OECD (2022). OECD Studies on Water. https://doi.org/10.1787/22245081

About

Code for scraping the corpus "Studies on Water" issued by the OECD.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published