Skip to content

soilwise-he/harvesters

Repository files navigation

SWR Harvesters

DOI

A component to fetch metadata from remote sources as documented at https://soilwise-he.github.io/SoilWise-documentation/technical_components/ingestion/.

Harvesting tasks can best be triggered from a tast runner, such as a CI-CD pipeline. Configuration scripts for running various harvesting tasks in a Gitlab CI-CD environment are available in CI. Tasks are configured using environment variables. The result of the harvester are ingested into a PostGres storage, where follow up processes pick up the results.

flowchart LR
    c[CI-CD] -->|task| q[/Queue\]
    r[Runner] --> q
    r -->|deploys| hc[Harvest container]
    hc -->|harvests| db[(temporary storage)]
    hc -->|data cleaning| db[(storage)]
    db -->|triplify| TS[(Triple store)]
    db -->|indexing| CT[Catalogue] 
Loading

This component is tightly related to the triple store component and catalogue component. Harvested records are stored on the triple store as well as the catalogue storage.

The following harvesting tasks are available.

Fetch records

  • CSW (for example Bonares, EJP Soil, islandr, inspire)
  • ESDAC a dedicated API
  • Cordis/OpenAire combination of SPARQL and API's
  • Prepsoil a dedicated API
  • Newsfeeds imports newsfeeds from soil mission websites

Process records

  • iso-triplify exports iso19139 records to GeoDCAT-AP to be included in SWR triplestore
  • record-to-pycsw exports records to catalogue (as iso19139 or Dublin Core)
  • translate triggers a translation of non english records

Docker

Run script as docker

docker built -t soilwise/harvesters .
docker run -e POSTGRES_HOST=localhost soilwise/harvesters python csw/metadata.py