A component to fetch metadata from remote sources as documented at https://soilwise-he.github.io/SoilWise-documentation/technical_components/ingestion/.
Harvesting tasks can best be triggered from a tast runner, such as a CI-CD pipeline. Configuration scripts for running various harvesting tasks in a Gitlab CI-CD environment are available in CI. Tasks are configured using environment variables. The result of the harvester are ingested into a PostGres storage, where follow up processes pick up the results.
flowchart LR
c[CI-CD] -->|task| q[/Queue\]
r[Runner] --> q
r -->|deploys| hc[Harvest container]
hc -->|harvests| db[(temporary storage)]
hc -->|data cleaning| db[(storage)]
db -->|triplify| TS[(Triple store)]
db -->|indexing| CT[Catalogue]
This component is tightly related to the triple store component and catalogue component. Harvested records are stored on the triple store as well as the catalogue storage.
The following harvesting tasks are available.
- CSW (for example Bonares, EJP Soil, islandr, inspire)
- ESDAC a dedicated API
- Cordis/OpenAire combination of SPARQL and API's
- Prepsoil a dedicated API
- Newsfeeds imports newsfeeds from soil mission websites
- iso-triplify exports iso19139 records to GeoDCAT-AP to be included in SWR triplestore
- record-to-pycsw exports records to catalogue (as iso19139 or Dublin Core)
- translate triggers a translation of non english records
Run script as docker
docker built -t soilwise/harvesters .
docker run -e POSTGRES_HOST=localhost soilwise/harvesters python csw/metadata.py