-
Notifications
You must be signed in to change notification settings - Fork 4
Data Integration
This page describes the specific background tasks necessary to support data integration in BONSAI.
Contents
Principles of data input, quality management and editing
Classification and correspondance tables
Version control
The database will be open for additions and annual updates of national Supply Use Tables, trade data, as well as individual flow/activity data.
The requirements on data will be kept as low as possible to stimulate data supply. Quality of the data will instead be scored and reviewed subsequently; see Data review. Conflicting data sources will be allowed to co-exist, allowing algorithms for quality scoring and consolidation to provide thresholds for temporary inclusion/exclusion and to derive preferred datapoints.
Version-control using timestamps will allow editing, both at datapoint level and batch-editing of larger selections of datapoints.
- Priority: High
- Estimated person-hours: 100
- Volunteer(s)/Candidate(s): Stefano Merciai, Romain Sacchi
Functional specifications: The integration of object/activity flow data at many different levels of (dis-)aggregation, within the framework of national supply-use tables, which each may have their own native classification, requires the use and maintenance of correspondence tables that can translate between the different names and classifications of flow-objects, activities and properties.
To allow continuous development, further disaggregation of the initial classification shall be possible whenever needed to accommodate data that represent a specific part of a lowest–level class. In this way, the classification will develop over time towards increasing detail. A “not classified elsewhere” class, should be available whenever relevant (i.e. when the existing classes are not logically exhaustive). This will also allow establishing correspondence between classifications with overlapping definitions. For example, a class X in one classification covering items A and B and a class Y in another classification covering items B and C can be matched by defining separate classes for A, B and C, so that class X can be expressed as an aggregated class of A and B and class Y an aggregated class of B and C.
A well-functioning interface for the classification shall allow to identify where in the existing database any newly supplied datapoint belongs, in terms of the three core data identifiers (flow-objects, activities and properties). This can either occur as an automated matching of its identifiers with existing lowest-level or aggregated classes for each of these identifiers, or when this matching fails, as a manual addition of the missing identifiers as a new lowest-level or aggregated class with indication of the relationship between the new class and the pre-existing classes.
As far as possible, freedom shall be kept in meaning and definitions. Example: "Waste" can be defined (and tagged) in different ways by different users. Best practice will be to avoid using such terms in classifications and instead apply a flexible definition that relates to properties, which is then used as an identifier, e.g., using price as identifier for waste: waste when negative, and positive when it is a by-product.
Product classifications: The GS1 Global Product Classification (GPC) is now available as an open standard. This is the most detailed hierarchical product classification available and would be natural starting point for BONSAI. Many databases also use the UN Central Product Classification (CPC), and classes from classifications in current use by national statistical agencies should also be added. A detailed list of all products listed on Wikipedia, although without any hierarchy, has been developed by productontology.org, which could possibly be used as a starting point for a bottom-up-approach to classification (if needed), using text mining tools for discovering class relations (for example, advanced text mining of the Wikipedia page on “hammer” may allow classifying hammers under either “hand tool” or “power tool”). Product classifications should not be used to imply the extent of complementarity and substitutability between products (Example: a sofa in the couch market). Such cross-elasticity information should be stored as product-by-product relations, separately from product names.
Classification of flow-objects beyond products (i.e. for flows outside the technosphere): A possible starting point for ecosystem services could be the CICES classification. For substance flows, the ecoinvent classification may be a starting point. The SDMX Registry contains classifications for some economic flow-objects. For social flow-objects and relations to the UN Sustainable Development Goals some work is on-going to provide classifications or ontologies [links to be added].
Activity classifications: For human activities, it would be natural to start from the ISIC Rev. 4, and add classes reflecting production and consumption activities for the products in the product classification as well as from classifications in current use by national statistical agencies, adding also market activity datasets (in the ecospold format, the name of such datasets will begin with “market for …” referring to the commodity name) and import, export and re-export activities (in the ecospold format, the name of such datasets will be identical to the commodity name but with the additions “, import[ from X]”, “, export[ to X]” and “, re-export[ to X]” where the content of the square brackets is optional and X signifies the geography of origin or destination, respectively. For activities beyond the technosphere, the environmental mechanisms modelled by current impact assessment methods should be covered, including those in currently applied fate models, exposure models and effect models.
Obviously, when data are supplied from many different sources, the same name may be used with different semantic meaning. Thus, even a direct naming match between the identifiers of a new datapoint and any existing class does not guarantee that the datapoint is indeed correctly identified. The automatic matchings, both exact and fuzzy, should therefore be part of the issued manually reviewed in the subsequent review procedures.
Technical specifications:
Managing classifications: For each of the three core data identifiers (flow-objects, activities and properties) one classification shall be maintained as the time-stamped BONSAI base classification.
Each base classification develops continuously when a class instance is further disaggregated, and the original class instance then becomes an aggregated class. In addition to the base classification, synonyms for class instances and aggregated classes shall be stored and be searchable, allowing to match the identifiers of any newly supplied datapoint to either a base class, an aggregated class, or a synonym for either of these. Language variants may be treated as synonyms with a language tag.
Matching feature: For each new datapoint provided by a user, a matching feature shall match - either as a direct match or as fuzzy match - each of the datapoint core identifiers to either a base class, an aggregated class, or a synonym for either of these.
Whenever a new user-suggested identifier instance does not have a direct match to an existing class, but does have one or more fuzzy matches, the user shall be asked to select a fuzzy match and indicate whether the new instance shall be seen as a replacement for the existing class, as a synonym, or as a sub-class of an existing class or as an aggregate of existing classes. For the last two options, the manual matching of the new instance to existing classes can be facilitated e.g. as a drag-and-drop or add-to/remove-from list feature. When the manual matching is not immediately performed by the user, the datapoint may be stored as "to-be-classified" data. If the user does not find the suggested fuzzy matches to be adequate, the user shall be allowed to override the fuzzy matches as starting point and instead use free-text search and hierarchical search starting from an aggregated class until a satisfactory match is found.
Correspondence table feature: Users should be allowed to add their own time-stamped classifications to be stored with indication of correspondence to base classes or aggregated classes. This is relevant for classifications that are shared by more than one user, e.g. software- or database-specific or official classifications. Note that some existing classification systems are hierarchically organised, so that names and compositions of aggregate classes are already known. In reality, each level in such a hierarchy is a classification in its own right, and can be stored as such (e.g. ISIC level 1 - ISIC level 2 - etc.). Assistance for automatic creation of correspondence to the existing classification should be developed. Check if we can integrate features from existing reconciliation tools, e.g. OpenRefine (particularly the clustering tools, see especially Clustering in depth). This feature may also be applied to assist users in situations when the user does not find the suggested fuzzy matches from the matching feature to be adequate.
Property classifications: Since there is no property that is common to all flow-objects or flows, the Direct Requirements Table that is used for producing the Leontief inverse needs to be produced in hybrid units. This requires that for each flow-object, a property must be defined as the natural property to be used for the hybrid Direct Requirements Table.
(Example of correspondence between classifications)
- Priority: Intermediate
- Estimated person-hours:
- Volunteer(s)/Candidate(s):
Functional specifications: The aim of versioning is to allow identification, citing and retrieval of arbitrary views of data, from a single data point to an entire dataset, as it existed at a certain point in time, in a precise, machine-actionable manner that is stable across different technologies and technological changes. The RDA WG on “Data Citation of Evolving Data” recommendation document contains 13 specific recommendations, the basic idea of which is that all that is needed to reach this aim is to ensure timestamping of each change to individual data points and each query (and query output).
Technical specifications: (numbers in brackets refer to the RDA recommendations)
● Ensure that any additions, edits and deletions on individual data points and datasets are marked with a timestamp, which at the same time works as version information (R1&2)
● Ensure that data points in dataset can be sorted unambiguously and reproducibly (R5)
● Normalise queries and compute a checksum of the normalised queries so that identical queries can be detected efficiently (R4)
● Assign a new PID to new queries and store queries and the associated metadata in order to re-execute them in the future (R3, R8&9)
● Provide checksum of query results to enable verification of the correctness of a result upon re-execution (R6)
● Mark query results with a timestamp based on the query execution time or the last update to the database prior to the query execution time. This allows retrieving the data as it existed at the time a user issued a query (R7)
● Assign a new PID to each new query result, and assign the existing identical PID to identical query results, i.e. query results with identical checksums, with the same query PID, performed on an unchanged database (R8)
● Lower the barrier for citing the data by generating citation text snippets for query results, including the query PID and query result PID (R10)
● Make the PIDs resolve to a human readable and machine actionable landing page that provides the data (via query re-execution) and metadata, including citation text snippet and the PID of the database (R11&12)
● Add to data management policy that when data is migrated to a new representation (e.g. new database system, a new schema or a completely different technology), also the queries and associated checksums are migrated, and the migration verified, ensuring that queries can be re-executed correctly (R13&14).