From db10517f0386c90623e854bd9726e6d2d00b6003 Mon Sep 17 00:00:00 2001 From: Damion Dooley Date: Wed, 13 Nov 2024 09:55:22 -0800 Subject: [PATCH] Update index.md --- docs/Data_Standardization/index.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/docs/Data_Standardization/index.md b/docs/Data_Standardization/index.md index 300a0a0..aabd11f 100644 --- a/docs/Data_Standardization/index.md +++ b/docs/Data_Standardization/index.md @@ -55,15 +55,14 @@ A project is composed of many workflow-generated files and folders which need to Its important to stress that researchers shouldn't have to take on the bulk of standardization work since it does involve a number technical skills and general awareness of controlled vocabularies and standardized data formats that take time to aquire. Ideally project data management/science/analyst staff are available to help with standardizing data schemas or exported data projects. This is where the DCC team can help! -If anticipated early in the project design cycle, data schema components can be standardized as copies of existing 3rd party standards components, as we are encouraging in the Schema Library section. Often however, whether because of project momentum or a commitment to existing information technology infrastructure, a data schema is non-standardised, and so data products require a mapping process to transform existing project data into standardised format for external use. This mapping process is performed either by specialised conversion scripts that often have to be tweaked over time, or ideally by using a more generic programmatic template to convert between project schema components and target standardised data products (e.g. specific data formats supported by [storage repositories](https://github.com/ClimateSmartAgCollab/Documentation-en/blob/main/docs/storage/index.md)). One can also expect iteration of specification work as data schemas evolve for surveillance or longitudinal research. +If anticipated early in the project design cycle, data schema components can be standardized as copies of existing 3rd party standards components, as we are encouraging in the [Schema Library](https://climatesmartagcollab.github.io/HUB-Harmonization/) section. Often however, whether because of project momentum or a commitment to existing information technology infrastructure, a data schema is non-standardised, and so data products require a mapping process to transform existing project data into standardised format for external use. This mapping process is performed either by specialised conversion scripts that often have to be tweaked over time, or ideally by using a more generic programmatic template to convert between project schema components and target standardised data products (e.g. specific data formats supported by [storage repositories](https://github.com/ClimateSmartAgCollab/Documentation-en/blob/main/docs/storage/index.md)). One can also expect iteration of specification work as data schemas evolve for surveillance or longitudinal research. * **Standardised data components**: - * **Data schema attribute level**: If anticipated early in the project design cycle, data attributes can be standardized (or customized) as copies of existing 3rd party standards elements, sharing attribute specification elements described above. The [NCBI BioSample](https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/) pooled specification is one good source of attributes used to describe the environmental/farm or human/animal/plant context of biosamples for genomics research. + * **Data schema attribute level**: Data attributes can be standardized (or customized) as copies of existing 3rd party standards elements. The [NCBI BioSample](https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/) pooled specification is one good source of attributes used to describe the environmental/farm or human/animal/plant context of biosamples for genomics research. The [Phenopacket](https://phenopacket-schema.readthedocs.io/en/latest/index.html) JSON-based standard for health science/clinical datasets, which offers a collection of "building blocks" including [Biosample attributes](https://phenopacket-schema.readthedocs.io/en/latest/biosample.html) is another source of ready-made standardized fields, not only by name but also referencing attribute picklist structured vocabularies including [recommended ontologies](https://phenopacket-schema.readthedocs.io/en/latest/recommended-ontologies.html) and ISO standards. One complexity here is the presence of one-to-many or "multiplicity" relations between a biosample and its components, like measurements, which a data schema needs to capture for automated machine readability. + **Ontologies**: A data schema is the practical tree that needs to be built on which ontology terms hang. In the [ontology](https://github.com/ClimateSmartAgCollab/Documentation-en/blob/main/docs/Data_Standardization/ontology.md) section there is a list of recommended ontologies and ways of implementing them as a layer upon a data schema and project metadata. * **Data schema record level**: Usually producing standardised data products is accomplished by copying or mapping selected parts of target specifications into a project's [**data schema**](https://github.com/ClimateSmartAgCollab/Documentation-en/blob/main/docs/Data_Documentation/schemas.md). The NCBI BioSample specification is actually organized in more topical "[packages](https://www.ncbi.nlm.nih.gov/biosample/docs/packages/)" for easier comprehension and copying of parts or wholes. Work has been done to convert the majority of these specifications into a machine readable [MIxS LinkML](https://github.com/turbomam/mixs-subset-examples-first) GitHub repository format, with a browsable [documentation version](https://turbomam.github.io/mixs-subset-examples-first/). - -* **Data naming convention**: Regardless of whether a data schema is reusing elements from other schemas, it is important to impose data naming conventions (described above) on its home-grown components. This is done mainly to avoid issues in applying or developing software scripts for validation, transformation, and/or database interaction. - -* **Ontologies**: Much standardization work can be done in advance of introducing ontology id’s to a schema. In a way ontologies provide the comparable fruits of interoperability, but a data schema is the practical tree that needs to be built on which ontology terms hang. In the [ontology](https://github.com/ClimateSmartAgCollab/Documentation-en/blob/main/docs/Data_Standardization/ontology.md) section there is a list of recommended ontologies and ways of implementing them as a layer upon a data schema and project metadata. + +* **Data naming convention**: Project file names and folder structure should be standardized, and sRegardless of whether a data schema is reusing elements from other schemas, it is important to impose data naming conventions (described above) on its home-grown components. This is done mainly to avoid issues in applying or developing software scripts for validation, transformation, and/or database interaction. ## Training resources ### TBD