OpenCitations Meta Software

OpenCitations Meta contains bibliographic metadata associated with the documents involved in the citations stored in the OpenCitations infrastructure. The OpenCitations Meta Software performs several key functions:

Data curation of provided CSV files
Generation of RDF files compliant with the OpenCitations Data Model
Provenance tracking and management
Data validation and fixing utilities

An example of a raw CSV input file can be found in example.csv.

Meta

The Meta process is launched through the meta_process.py file via the prompt command:

    python -m oc_meta.run.meta_process -c <PATH>

Where:

-c --config : path to the configuration file.

The configuration file is a YAML file with the following keys (an example can be found in config/meta_config.yaml).

Setting	Mandatory	Description
triplestore_url	✓	Endpoint URL to load the output RDF
input_csv_dir	✓	Directory where raw CSV files are stored
base_output_dir	✓	The path to the base directory to save all output files
resp_agent	✓	A URI string representing the provenance agent which is considered responsible for the RDF graph manipulation
base_iri	☓	The base URI of entities on Meta. This setting can be safely left as is
context_path	☓	URL where the namespaces and prefixes used in the OpenCitations Data Model are defined
dir_split_number	☓	Number of files per folder. Must be multiple of items_per_file
items_per_file	☓	Number of items per file
supplier_prefix	☓	A prefix for the sequential number in entities' URIs
rdf_output_in_chunks	☓	If True, save all the graphset and provset in one file. If False, use the OpenCitations folder hierarchy
zip_output_rdf	☓	If True, output will be zipped
source	☓	Data source URL
use_doi_api_service	☓	If True, use the DOI API service to check if DOIs are valid
workers_number	☓	Number of cores to use for processing
blazegraph_full_text_search	☓	Enable Blazegraph text index for faster queries
fuseki_full_text_search	☓	Enable Fuseki text index for faster queries
virtuoso_full_text_search	☓	Enable Virtuoso text index for faster queries
graphdb_connector_name	☓	Name of the Lucene connector for GraphDB text search
cache_endpoint	☓	Provenance triplestore URL for caching queries
cache_update_endpoint	☓	Write endpoint URL for cache triplestore
redis_host	☓	Redis host address (default: localhost)
redis_port	☓	Redis port number (default: 6379)
redis_db	☓	Redis database number (default: 0)

Plugins

Get a DOI-ORCID index

orcid_process.py generates an index between DOIs and the author's ORCIDs using the ORCID Summaries Dump (e.g. ORCID_2019_summaries). The output is a folder containing CSV files with two columns, 'id' and 'value', where 'id' is a DOI or None, and 'value' is an ORCID. This process can be run via the following commad:

    python -m oc_meta.run.orcid_process -s <PATH> -out <PATH> -t <INTEGER> -lm -v

Where:

-s --summaries: ORCID summaries dump path, subfolder will be considered too.
-out --output: a directory where the output CSV files will be store, that is, the ORCID-DOI index.
-t --threshold: threshold after which to update the output, not mandatory. A new file will be generated each time.
-lm --low-memory: specify this argument if the available RAM is insufficient to accomplish the task. Warning: the processing time will increase.
-v --verbose: show a loading bar, elapsed time and estimated time, not mandatory.

Get a Crossref member-name-prefix index

crossref_publishers_extractor.py generates an index between Crossref members' ids, names and DOI prefixes. The output is a CSV file with three columns, 'id', 'name', and 'prefix'. This process can be run via the following command:

    python -m oc_meta.run.crossref_publishers_extractor -o <PATH>

Where:

-o --output: The output CSV file where to store relevant information.

Generate CSVs from triplestore

This plugin generates CSVs from the Meta triplestore. You can run the csv_generator.py script in the following way:

    python -m oc_meta.run.csv_generator -c <PATH>

Where:

-c --config : path to the configuration file. The configuration file is a YAML file with the following keys (an example can be found in config/csv_generator_config.yaml).

Setting	Mandatory	Description
triplestore_url	✓	URL of the endpoint where the data are located
output_csv_dir	✓	Directory where the output CSV files will be stored
info_dir	✓	The folder where the counters of the various types of entities are stored.
base_iri	☓	The base IRI of entities on the triplestore. This setting can be safely left as is
supplier_prefix	☓	A prefix for the sequential number in entities' URIs. This setting can be safely left as is
dir_split_number	☓	Number of files per folder. dir_split_number's value must be multiple of items_per_file's value. This setting can be safely left as is
items_per_file	☓	Number of items per file. This setting can be safely left as is
verbose	☓	Show a loading bar, elapsed time and estimated time. This setting can be safely left as is

Prepare the multiprocess

Before running Meta in multiprocess, it is necessary to prepare the input files. In particular, the CSV files must be divided by publisher, while venues and authors having an identifier must be loaded on the triplestore, in order not to generate duplicates during the multiprocess. These operations can be done by simply running the following script:

    python -m oc_meta.run.prepare_multiprocess -c <PATH>

Where:

-c --config : Path to the same configuration file you want to use for Meta.

Afterwards, launch Meta in multi-process by specifying the same configuration file. All the required modifications are done automatically.

Utilities

Provenance Management

python -m oc_meta.run.fixer.prov.fix <input_dir> [--processes <num>] [--log-dir <path>]

Parameters:

input_dir: Directory containing provenance files
--processes: Number of parallel processes (default: CPU count)
--log-dir: Directory for log files (default: logs)

Data Validation & Analysis

Check Redis Info

python -m oc_meta.run.check.info_dir <directory> [--redis-host <host>] [--redis-port <port>] [--redis-db <db>]

Parameters:

directory: Directory to explore
--redis-host: Redis host (default: localhost)
--redis-port: Redis port (default: 6379)
--redis-db: Redis database number (default: 6)

Check Processing Results

python -m oc_meta.run.meta.check_results <directory> --root <path> --endpoint <url> [--show-missing]

Parameters:

directory: Directory containing input CSV files
--root: Root directory containing JSON-LD ZIP files
--endpoint: SPARQL endpoint URL
--show-missing: Show details of identifiers without associated OMIDs

Generate Info Directory

python -m oc_meta.run.gen_info_dir <directory> [--redis-host <host>] [--redis-port <port>] [--redis-db <db>]

Parameters:

directory: Directory to explore
--redis-host: Redis host (default: localhost)
--redis-port: Redis port (default: 6379)
--redis-db: Redis database number (default: 6)

Name		Name	Last commit message	Last commit date
Latest commit History 1,054 Commits
.github/workflows		.github/workflows
config		config
oc_meta		oc_meta
scripts		scripts
test		test
virtuoso-opensource		virtuoso-opensource
.gitattributes		.gitattributes
.gitignore		.gitignore
.releaserc.json		.releaserc.json
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
example_citations.csv		example_citations.csv
example_metadata.csv		example_metadata.csv
examples_generator.py		examples_generator.py
meta.properties		meta.properties
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
time_agnostic_config.json		time_agnostic_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenCitations Meta Software

Table of Contents

Meta

Plugins

Get a DOI-ORCID index

Get a Crossref member-name-prefix index

Generate CSVs from triplestore

Prepare the multiprocess

Utilities

Provenance Management

Data Validation & Analysis

Check Redis Info

Check Processing Results

Generate Info Directory

About

Releases 7

Packages

Contributors 7

Languages

License

opencitations/oc_meta

Folders and files

Latest commit

History

Repository files navigation

OpenCitations Meta Software

Table of Contents

Meta

Plugins

Get a DOI-ORCID index

Get a Crossref member-name-prefix index

Generate CSVs from triplestore

Prepare the multiprocess

Utilities

Provenance Management

Data Validation & Analysis

Check Redis Info

Check Processing Results

Generate Info Directory

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 7

Languages

Packages