-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathusing-the-dccvalidator-1kD.Rmd
80 lines (51 loc) · 6.18 KB
/
using-the-dccvalidator-1kD.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
title: "Using the dccvalidator app in 1kD"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{using-the-dccvalidator-1kD}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
We've built the dccvalidator tool to streamline the process of data validation and quality assurance/quality control. This application performs many of the routine data quality checks that were previously conducted by hand, with the hopes that it will help you, the data contributor, get your data checked, validated, and shared easily and quickly.
## Application Requirements
To use this application you must:
1. Be logged in to Synapse in your browser
2. Be a [Synapse certified user](https://docs.synapse.org/articles/accounts_certified_users_and_profile_validation.html){target="_blank"}
3. Be a member of either of the teams: [1kD_Assembloids](https://www.synapse.org/#!Team:3436716), [1kD_BRAINRISE](https://www.synapse.org/#!Team:3436720), [1kD_Connectome](https://www.synapse.org/#!Team:3436722), [1kD_DyadicSociometrics_NTU](https://www.synapse.org/#!Team:3436713), [1kD_DyadicSociometrics_Cambridge](https://www.synapse.org/#!Team:3466183),[1kD_First1000Daysdatabase](https://www.synapse.org/#!Team:3436714), [1kD_InfantNaturalStatistics](https://www.synapse.org/#!Team:3436721), [
1kD_KHULA](https://www.synapse.org/#!Team:3436718), [1kD_M4EFaD_LABS](https://www.synapse.org/#!Team:3436717), [1kD_M4EFaD_BMT](https://www.synapse.org/#!Team:3458847),[1kD_MicrobiomeBrainDevelopment](https://www.synapse.org/#!Team:3436509),[1kD_M4EFaD_Auckland](https://www.synapse.org/#!Team:3460645), [1kD_Computer Vision Analysis of PCI](https://www.synapse.org/#!Team:3464137){target="_blank"}
Some portions of the app submit data to Synapse. This allows curators at Sage to troubleshoot issues if needed. No one outside the [1kD admins](https://www.synapse.org/#!Team:3433360) will be able to download the data.
## Terminology
#### What is a biospecimen?
A biospecimen is a sample of material such as tissue, cells, DNA, RNA or protein that has a unique identifier associated with it. The same biospecimen may be characterized in multiple assay types.
#### What is a manifest? How is a manifest different than a metadata file?
A manifest is .tsv or .txt file with data files to be uploaded to Synapse as entries in each row. Details of a manifest are described in the [Uploading and Downloading Data in Bulk](https://docs.synapse.org/articles/uploading_in_bulk.html){target="_blank"} Synapse User Guide. While a metadata file will be stored on Synapse as a flat file, and select variables added as file annotations, all variables in a manifest file will live as annotations respective to the file in that row. To successfully upload a file, you must specify the local `path` to the file and the Synapse ID of the folder in the `parent` column.
## Data validation
### Metadata requirements
Each study should include metadata that would help a new researcher understand and reuse the data.
1. A **[1kD_minimum_ChildParent_metadata](https://www.synapse.org/#!Synapse:syn29358381)** describing each child/parent/other caregiver in the study. Each row corresponds to a unique individual.
2. A **[UploadManifest](https://www.synapse.org/#!Synapse:syn27832372)** listing each file and its assay specific information that will be uploaded. Remember to include your metadata files in the manifest.
3. A **[Biospecimen metadata](https://www.synapse.org/#!Synapse:syn29358373)** describing the specimens that were collected. **Only applicable to metabolomics and metagenomics data.**
Metadata file templates are [available in the 1kD Synapse project](https://www.synapse.org/#!Synapse:syn27817580){target="_blank"}.
If you don't see a template for the assay(s) in your study, please send a request for a new schema to [1kD_SageAdmin@synapse.org](mailto:1kD_SageAdmin@synapse.org){target="_blank"}. We depend on your expertise to develop schemas that capture the most pertinent metadata!
### Validating metadata
The data validation portion of the app allows you to upload metadata files (as .csv) and the manifest (as .tsv or .txt) and view the results of a series of automated checks.
Examples of the types of checks we perform are:
* All required columns from the templates are present
* Individuals and specimens have unique identifiers
* Metadata terms conform to a controlled vocabulary if the column uses controlled values (for example, TRUE is used instead of Yes or Y)
* If both 1kD_standardized_demographic_data and biospecimen templates are used, each biospecimen is linked to an individual
### Viewing data summary
We also provide a summary of the files you have uploaded, showing the number of individuals, specimens, and files. We visualize the data in each column by its data type to help spot unexpected missing values.
## Uploading data to Synapse after validation
Once data has passed validation, and the 1kD data curators permit edit permissions to the **Staging** folder, you will use your newly created manifest file to upload data using `syncToSynapse`. You can execute `syncToSynapse` in the [Python client](https://python-docs.synapse.org/build/html/synapseutils.html#synapseutils.sync.syncToSynapse){target="_blank"} and [R client](https://github.com/Sage-Bionetworks/synapserutils#upload-data-in-bulk){target="_blank"}. The Synapse Python client supports multithreaded upload and will provide faster upload speeds than the Synapse R client. For getting started with the Synapse programmatic clients, please visit our [Synapse docs](https://docs.synapse.org/articles/api_documentation.html){target="_blank"}.
## Data Release
Data is uploaded to a **Staging** folder, private to each study group. Once curated, data is moved to a **Data** folder where team members have access to the data via the associated 1kD Team. All data upload takes place in the [1kD Synapse project](https://www.synapse.org/#!Synapse:syn26133760/wiki/613444){target="_blank"}. While access to the data is team specific, the data remains private for study team.
## Get support
Please send questions to [1kD_SageAdmin@synapse.org](mailto:1kD_SageAdmin@synapse.org){target="_blank"}.