Skip to content

Commit

Permalink
Update index.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ddooley authored Nov 13, 2024
1 parent 0e9198e commit 64db4fc
Showing 1 changed file with 4 additions and 3 deletions.
7 changes: 4 additions & 3 deletions docs/Data_Standardization/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,6 @@ Ultimately, a key requirement for success is a **well-coordinated technical lang

* **Values**: A form field, record field, table row field, spreadsheet cell, computational object/class attribute or property or slot, or variable can hold a **value** (aka data item or datum).

* **Character sets**: Data "serialized" into a text file will be encoded as strings characters from a character set which may include accents etc. A popular [UTF-8](https://en.wikipedia.org/wiki/UTF-8) standard (used to encode most web pages) includes character encodings that cover many languages and [dingbats](https://en.wikipedia.org/wiki/Dingbat) to boot! Sadly software often has to guess what encoding an input file has, and some versions of programs like [MS Excel](https://support.guidebook.com/hc/en-us/articles/360016372414) have their own coding, leading to confusion in translation.

* **Fundamental datatypes**: Crucial to machine readability, a value can be of a certain fundamental "literal" or syntactic **datatype**, like a string, date, time, integer or decimal number, boolean, categorical value or URL reference type. A few common standard "data-interchange languages" exist that express these: [XML](https://www.w3.org/TR/xmlschema11-2/#built-in-datatypes), [JSON](https://json-schema.org/understanding-json-schema/reference/type) and [SQL](https://www.digitalocean.com/community/tutorials/sql-data-types). (There one can see translation issues where one standard allows a less atomic "number" datatype while another only has "decimal" or "integer" - so a conversion from a schema using one datatype standard to another schema with a different data type requires "sniffing" or parsing what kind of number the former contains.)
* **Units**: Numeric values may be accompanied by units (e.g. "1m" for a meter, or "2d" for 2 days). Whether a unit is bundled with a number as a single string datatype value, or whether they are separated out into separate datatype values is a matter for the schema developers to settle. By themselves, units need a string or coding representation, such as provided by [UCUM codes](https://units-of-measurement.org/) or an ontology of units (e.g. [QUDT](http://qudt.org/), [OM](http://www.ontology-of-units-of-measure.org/), [UO](https://obofoundry.org/ontology/uo)).
* A data schema can also provide more complex string data type extensions by imposing further constraints on their syntax in order to express for example the [ISO 19115-1:2014
Expand Down Expand Up @@ -54,15 +52,18 @@ Encountering a value that has a syntactic structure beyond random characters sug
* **Compound attributes**: These are object or data structure specifications - which may be sumarized as a single attribute - made out of several attributes, so for example a "location" might be a combination of latitiude and longitude in one string. Another example is the variety of address kinds (postal box, street, legal, head office, home) which could be serialized in a single string value, instead we prefer to break them up into a general "address" object containing street or post office box, city, postal code, region etc. Particular address kinds inherit attributes of the general address kind, and add their own, such as for the postal box kind.

* **Permanent identifiers**: Given the need to reference vocabulary and other data resources on the web, standardization work often involves a kind of value called a **permanent identifier** reference that points to a resource like a dataset, document, or vocabulary term detail page. For a web reference this is called a **permanent URL or "purl"**, such as [http://purl.obolibrary.org/obo/OBI_0001167](http://purl.obolibrary.org/obo/OBI_0001167). Once a purl goes into circulation on the web, it is expected to remain there so it can always retrieve the resource, or, if what it points to becomes archaic or discontinued, a "deprecated" code response. Additionally, if a newer vocabulary or resource replaces it, a replacement identifier is indicated. This way data content can be updated to harmonize and simplify federation and querying.

There are registries of purl-endowed resources which include databases and ontologies, such as the W3C Permanent Identifier Community Group's [purl registry](https://w3id.org/), and [bioregistry.io](https://bioregistry.io/) which has more of a life science research focus and is an excellent place for projects to add their own resource links (since there are many vocabularies referenced in databases that are not yet represented on the web).

* **Structured vocabulary**: We use the term "structured vocabulary" to describe a file of vocabulary terms such as a taxonomy or ontology that includes attribute details to some extent - such as plain english or other language names, coding names, purls, definitions, and attribute semantics such as hierarchies of terms. There are many structured vocabulary catalogues as lists or searchable portals, including:

* The international CGIAR agricultural research agency has published a resource of common [Ontologies for agriculture](https://bigdata.cgiar.org/ontologies-for-agriculture/).
* [AgroPortal](https://agroportal.lirmm.fr/) supported by a number of leading French research agencies is another source of agriculture research vocabulary.
* [OBO Foundry](https://obofoundry.org/) is a multi-agency collaborative effort of life science ontologies related to agriculture, biology, climate, and ecology research, all operating within an aligned curational methodology.
* [OLS](https://www.ebi.ac.uk/ols4) The European Molecular Biology Laboratory (EMBL) European Bioinformatics Institute ontology search interface reflects the agencies commitment to developing ontologies in the life science area.
* [Fairsharing.org](https://fairsharing.org/search?fairsharingRegistry=Standard) has a standards registry dedicated to workflow and ontology resources.
* Wikidata has an extensive vocabulary for **geographical names**, etc.

As well
More details on applying suitable ontologies to dataset standardization is provided in the [ontology](https://github.com/ClimateSmartAgCollab/Documentation-en/blob/main/docs/Data_Standardization/ontology.md) section.

## Standardisation layers
Expand Down

0 comments on commit 64db4fc

Please sign in to comment.