-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathusing-the-dccvalidator-app-amp-ad.Rmd
132 lines (99 loc) · 7.2 KB
/
using-the-dccvalidator-app-amp-ad.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
title: "Using the dccvalidator app in AMP-AD"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{using-the-dccvalidator-app-amp-ad}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
As the AMP-AD Knowledge Portal has grown to 50+ studies and over
70,000 data files, we've realized a need to be more standardized in our
approaches to data curation. Thus, we built an application that performs many of
the routine data quality checks we previously conducted by hand, with the hopes
that it will help you, the data contributor, get your data checked, validated,
and shared more easily and quickly.
The dccvalidator tool checks your metadata for, among other things:
- The correct structure (are all of the columns there?)
- The values in the cells are correct if the column uses controlled values (for example, TRUE is used instead of Yes or Y)
- If both individuals and biospecimen templates are used, each biospecimen is linked to an individual
# Requirements
To use this application you must:
1. Be logged in to Synapse in your browser
2. Be a [Synapse certified user](https://docs.synapse.org/articles/accounts_certified_users_and_profile_validation.html){target="_blank"}
3. Be a member of the AD_PortalContributor Synapse team
Some portions of the app submit data to Synapse. This allows curators at Sage to
troubleshoot issues if needed; no one outside the Sage curation team will be
able to download the data.
# Instructions
This topic has a general overview of the data contribution process and
detailed instructions for each step, including uploading documentation, metadata
requirements, validating and reviewing the metadata, and uploading the dataset.
## General Process Overview
1. Contact the [AD Data Liaison](mailto:AMPAD_SageAdmin@synapse.org){target="_blank"} to discuss the study and the expected data. Receive
staging folder synIDs for each expected dataset.
2. Using the templates located at the bottom of the [Onboarding page](https://www.synapse.org/#!Synapse:syn24986618){target="_blank"}, create a study description, methods description(s), and acknowledgement statement.
3. Upload the [study description](https://www.synapse.org/#!Synapse:syn25014531){target="_blank"}, [methods description](https://www.synapse.org/#!Synapse:syn25018065){target="_blank"} and [acknowledgement statement](https://www.synapse.org/#!Synapse:syn25014532){target="_blank"} using the [Submission Form](https://www.synapse.org/#!Synapse:syn25051271){target="_blank"}. Instructions on writing this documentation is located on our [Data Contributor Information](https://www.synapse.org/#!Synapse:syn24986618){target="_blank"} page.
4. Validate metadata + manifest files in [dccvalidator](https://shinypro.synapse.org/users/kwoo/dccvalidator-app/){target="_blank"}.
5. Contact the [AD Data Liaison](mailto:AMPAD_SageAdmin@synapse.org){target="_blank"} when all files pass validation. The team will verify
items not checked by the dccvalidator. Receive permissions to upload data to the
staging folder.
6. Use the validated manifest to upload the data with `syncToSynapse` (see
[Synapse documentation](https://docs.synapse.org/articles/uploading_in_bulk.html#validate-the-manifest-and-upload-files) for uploading data in bulk){target="_blank"}.
7. Contact the [AD Data Liaison](mailto:AMPAD_SageAdmin@synapse.org){target="_blank"}. The AD Curation team will do the final verifications before
releasing the data.
## Documentation Upload
Each study in AMP-AD has the accompanying [documentation in the portal](https://adknowledgeportal.synapse.org/#/Explore/Studies?Study=syn8391648){target="_blank"}:
1. **Study Description:** use template located [here](https://www.synapse.org/#!Synapse:syn25014531){target="_blank"} to write your study description
2. **Methods Description(s):** use template located [here](https://www.synapse.org/#!Synapse:syn25018065){target="_blank"} to write a methods description for **each** of your assay types
3. **Acknowledgement statement:** use template located [here](https://www.synapse.org/#!Synapse:syn25014532){target="_blank"} to write an acknowledgement statement for the use of those that cite your data
You can submit your documentation through the [Submission Form](https://www.synapse.org/#!Synapse:syn25051271){target="_blank"} located on our [Onboarding Documentation](https://www.synapse.org/#!Synapse:syn24986618){target="_blank"}
## Data Validation
### Metadata Requirements
Each study should include metadata that would help a new researcher understand
and reuse the data. In most cases, we will expect 4 files:
1. **Individual metadata** a csv file describing each individual in the study.
2. **Biospecimen metadata** a csv file describing the specimens that were collected.
3. **Assay metadata** a csv file describing the assay that was performed. If multiple
assays were part of the study, there will be one assay file for each.
4. A **manifest** listing each file that will be uploaded. You will use this
file to upload your data after it has been validated and approved. The manifest
should be in tsv (tab-delimited text) format.
We provide templates for all of the metadata files within the portal:
https://www.synapse.org/#!Synapse:syn18512044
You can download these files, fill out the first tab, and save it as a .csv or .tsv
file. The other tabs exist to describe the variables and allowed values in the
template. If you do not have any data for some of the columns, you can leave
them blank (but do not remove the column header).
If you don't see a template for the assay(s) in your study, or if not all of the
metadata types above seem relevant to your study, please get in touch with us at
AMPAD_SageAdmin@synapse.org.
### Validating the Metadata and Manifest
The data validation portion of the app allows you to upload metadata files (as
.csv) and the manifest (as .tsv or .txt) and view the results of a series of
automated checks.
Examples of the types of checks we perform are:
- All required columns from the templates are present
- Individuals and specimens have unique identifiers
- Metadata terms conform to a controlled vocabulary where applicable
### Viewing Data Summary
We also provide a summary of the files you have uploaded, showing the number of
individuals, specimens, and files. We visualize the data in each column by its
data type to help spot unexpected missing values.
## Uploading Data
Once data has passed validation, and the AMP-AD data curators permit edit
permissions to the staging folder for your study, you will use your newly
created manifest file to upload your data and metadata using `syncToSynapse`.
You can execute `syncToSynapse` in the
[Python client](https://python-docs.synapse.org/build/html/synapseutils.html#synapseutils.sync.syncToSynapse){target="_blank"}
or
[R client](https://github.com/Sage-Bionetworks/synapserutils#upload-data-in-bulk){target="_blank"}.
For uploads with more than 100 files or large file sizes, the Python client or command line client will upload substantially faster than the R client.
For getting started with the Synapse programmatic clients, please visit our
[Synapse docs](https://docs.synapse.org/articles/api_documentation.html){target="_blank"}.