-
Notifications
You must be signed in to change notification settings - Fork 6
2.1 Architecture of FAIR Data Cube
The architecture diagram of FAIR Data Cube is given in Fig 3, which illustrates its main components and their interactions. The numbered arrow forms a pipeline for creating and publishing metadata.
Fig 3. FAIR Data Cube architecture.
Below is a walk-through of each step in the pipeline.
-
Storage of raw and processed omics data:
1&2: We employ existing data standards developed by omics communities including the [HUPO Proteomics Standards Initiative] (https://www.psidev.info/specifications), the Metabolomics Standards Initiative (MSI), and the Global Alliance for Genomics and Health. Raw and processed data of different omics types (genomics, proteomics, metabolomics, etc.) are typically kept in flat files in community standard formats (e.g. VCF, mzML, MAF, etc.), or stored in a suitable EMBL-EBI data repository for archiving.
-
Creation of metadata
For analysis procedures, the ISA schema is used as a standard metadata schema, capturing metadata across multiple omics data, and serialized in an intermedia ISA json file using the ISA tools. For phenotype data, phenopackets is used to encode the phenotype packet schema and into the phenotype metadata, and are serialized as a phenopacket json file.
-
Import/registration of metadata
3: Both the ISA json file and the phenopacket json file are converted into linked data via using YARRRML.
4: The FAIRified linked data from the previous step is imported into the knowledge graph which is an embedded component of the FDP.
-
Querying of metadata
5: The FAIR Data Point serves as a metadata hub for publishing the metadata of datasets. For user friendliness, FAIR Data Point may display complete/partial metadata in a human-readable portal for browser/search/query.
6: A user can search/browse/query the metadata on FDP, after finding an interesting dataset, send computation requests to the Vantage 6 server and retrieves the results.
-
Analysis of data We use Vantage 6 to help deliver the user's computational request to a real dataset to run analysis.
Vantage6 Server
Vantage6 works as a relay to pass computation requests and results between the researcher (potential dataset user) and the datasets. This implements the idea of bringing computation requests to a dataset, which would comply with the privacy/legal regulation and address the intellectual property concern, cause the script is run on the dataset on-site, under the supervision of the dataset guardian.
A computation request consists of:
-
A reference to a docker image, which contains the code (computation) that the researcher would like to run on the target dataset
-
A list describing the dataset of interest and purpose of use
The Vantage6 server handles authentication, keeps track of all computation requests, assigns them to nodes, and stores the results of the computation requests.
This server could also host a private Docker registry.
Vantage6 Node
A Vantage6 node has access to its site’s data. It listens (WebSockets) for work (computation requests). Once it receives a request, it executes the request by:
Downloading the corresponding Docker image.
Running the image with the input parameters.
The code that runs in the image has access to the local data through the node.
The results should never contain any identifiable (patient) information, but only aggregated statistics.
-