Skip to content

Commit

Permalink
Update index.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ddooley authored Nov 9, 2024
1 parent f44b7ce commit 6b8401b
Showing 1 changed file with 1 addition and 3 deletions.
4 changes: 1 addition & 3 deletions docs/Data_Standardization/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Geographic information — Metadata](https://www.iso.org/standard/53798.html) fo
* **Attributes**: A kind of table record, spreadsheet, computational object or class, ontological entity, or user interface form may have some number of required and/or optional **attributes**, aka **fields, properties, variables, or slots**. An attribute's schema specification allows at least one value datatype but in some schemas it may have more than one, such as a birthdate integer plus "null value list" (a picklist of missing, not collected, etc. data collection statuses). Standards such as NCBI's [missing value reporting](https://www.ddbj.nig.ac.jp/biosample/submission-e.html#missing-value-reporting) cover this. Attribute specifications should also include a definition that distinguishes it from other attributes having similar names or semantics.
* **Attribute naming**: It is important to distinguish the two kinds of name an attribute can have, a computational coding one which ideally enables lookup of its semantics (definition, synonyms etc.), and a default plain name which may vary among applications/user interfaces that apply its coding name (and semantics).
* **Plain name**: A default plain language user interface **label or title** (including spreadsheet column labels) for human readability, such as "Birth Date", "Birthdate", "date of birth", "born", etc. Enabling the title or label to have language variants also paves the way for multilingual interfaces. In software it should not be used as the key for looking up attribute information.
* **Coding name**: A computer software/script/analytic/database/serialization-level attribute "coding" name, such as "birth_date". This name is the key to machine readability, and its standard format should align with popular programming variable naming conventions to avoid errors in parsing data files and to enable code generation (e.g. alphanumeric + underscore only; no spaces, dashes, slashes, brackets, parentheses or dots etc. allowed in a name). Data schema frameworks like LinkML have been guided by [Python](https://peps.python.org/pep-0008/#naming-conventions) / [R and SQL compatible](https://bookdown.org/content/d1e53ac9-28ce-472f-bc2c-f499f18264a3/names.html) field names, and standardized table / object names, in particular:
* **Coding name**: A computer software/script/analytic/database/serialization-level attribute "coding" name, such as "birth_date". This name is the key to machine readability, and its standard format should align with popular programming variable naming conventions to avoid errors in parsing data files and to enable code generation (e.g. alphanumeric + underscore only; no spaces, dashes, slashes, brackets, parentheses or dots etc. allowed in a name). Data schema frameworks like LinkML have been guided by [Python](https://peps.python.org/pep-0008/#naming-conventions) / [R and SQL compatible](https://bookdown.org/content/d1e53ac9-28ce-472f-bc2c-f499f18264a3/names.html) attribute names, and standardized table / object names, in particular:
* **PascalCase**: **for table, object and picklist names, use an alphanumeric string beginning with a capital letter.**
* **lower_camel_case**: **for attribute coding names, use lowercase alphanumeric words separated by underscores.**

Expand Down Expand Up @@ -69,7 +69,6 @@ Going beyond subject areas, this metadata enables researchers to judge pertinenc
* More generally, the [OBI](http://purl.obolibrary.org/obo/OBI_0500000) ontology and [NCIT Thesaurus](http://purl.obolibrary.org/obo/NCIT_C15320) provide a sizeable list of study design terms which can be referenced from across life science research domains.
* [Protocols.io](https://www.protocols.io/) is a popular system for detailing and publishing protocol information.


### Provenance
The story of where datasets are hosted and the projects, people and agencies responsible for their creation and management. Common language covers:
* Authorship: [ORCID](https://orcid.org/) identifiers are now the standard way of referencing authors
Expand All @@ -93,7 +92,6 @@ Its important to stress that researchers shouldn't have to take on the bulk of s
Geographic information — Metadata](https://www.iso.org/standard/53798.html) and Cell Ontology [cell type](http://purl.obolibrary.org/obo/CL_0000000) expressions and references may be incorporated directly into a project's field specifications.
* **Record level**: Usually producing standardised data products such as [NCBI BioSample](https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/) records is accomplished by mapping selected fields of target specifications into a project's [**data schema**](https://github.com/ClimateSmartAgCollab/Documentation-en/blob/main/docs/Data_Documentation/schemas.md) fields.


* **data schema mapping**: Often however, whether because of project momentum or a commitment to existing information technology infrastructure, a data schema is non-standardised, and so data products require a mapping process to transform existing project data into standardised format for external use. This mapping process is performed either by specialised conversion scripts that often have to be tweaked over time, or ideally by using data schemas established (e.g. by [storage repositories](https://github.com/ClimateSmartAgCollab/Documentation-en/blob/main/docs/storage/index.md) that store specific formats of data) for the target standardised data products, combined with a more generic programmatic template to do the conversion from one schema to another. One can also expect iteration of specification work as data schemas evolve for surveillance or longitudinal research.

* **Data naming convention**: Regardless of whether a data schema is reusing elements from other schemas, it is important to impose data naming conventions on its home-grown components. This is done mainly to avoid issues in applying or developing software scripts for validation, transformation, and/or database interaction. Data "names" are often rather ambiguous - do we mean column display name, 3rd party standardized field name, or programmatic name for use in scripts or databases? Distinctions need to be made about a data item name for display purposes, standardization name, or database and programming reference. Ensure that a separation of concern is established in the schema between the title or label that a field or variable might have in some context, and its coding name.
Expand Down

0 comments on commit 6b8401b

Please sign in to comment.