-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME.md.bak
171 lines (133 loc) · 5.84 KB
/
README.md.bak
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# SV detection pipeline
Talkowski Lab structural variant detection pipeline. Documentation in progress.
## Installation and Usage
As a `snakemake` workflow, the pipeline is intended to be cloned for each
project, e.g.
```
$ git clone git@github.com:talkowski-lab/sv-pipeline.git MySVDiscoveryProject
```
Create and activate a new Anaconda environment with all the necessary
dependencies.
```
$ cd MySVDiscoveryProject
$ conda env create -f environment.yaml
$ source activate sv_pipeline
```
After cloning the pipeline, edit `config.yaml` to update the configuration as
necessary for the project, then link or copy raw data into the `data/` or
`ref/` directories. (More detail below or to come. For current testing
purposes, symlink the `data/` and `ref/` directories in
`/data/talkowski/Samples/SFARI/deep_sv/asc_540/sv-pipeline-devel/`).
Optionally, run a dry run of `snakemake` to test the configuration, then run
the pipeline with `snakemake`.
```
$ vim config.yaml
$ ln -s ${raw_SV_calls} data/
$ cp ${reference_data} ref/
$ snakemake -np
$ snakemake
```
The pipeline can remove all files it has generated without affecting
configuration or data files.
```
$ snakemake clean
```
## Explicit Dependencies
If you would prefer to use your own python environment, the following packages
are necessary to run the pipeline.
### Python 3
The pipeline requires Python 3. If you already have an established Python 3
environment, you can ignore this subsection. Otherwise, the recommended way to
create a Python 3 environment with the required packages is with
[Anaconda](https://www.continuum.io/downloads).
```
$ conda create -n $environment -c bioconda python=3.5 numpy scipy pysam snakemake
$ source activate $environment
```
### Snakemake
The pipeline is built with `snakemake`, which can be installed through `pip` or
`conda` with one of the two following commands.
```
$ pip install snakemake
$ conda install -c bioconda snakemake
```
`snakemake` is an excellent and recommended tool for building bioinformatics
pipelines, but a comprehensive understanding of the tool is not necessary to
run this pipeline. If interested in learning more, extended `snakemake`
documentation can be found on their [Read the Docs
page](https://snakemake.readthedocs.io/en/stable/). A
[tutorial](https://snakemake.bitbucket.io/snakemake-tutorial.html) and
[demonstrative slides](http://slides.com/johanneskoester/deck-1#/) are also
available.
### SVtools
The pipeline requires the `svtools` Python package, which is currently
available only on github.
```
$ git clone git@github.com:talkowski-lab/svtools.git
$ cd svtools
$ pip install -e .
```
### Bedtools
The pipeline requires bedtools 2.26 or later. Earlier versions may throw an
error when `bedtools merge` is passed an empty file.
### pybedtools
In order to perform per-chromosome parallelization, the master branch of
`pybedtools` is required (or at least commit `b1e0ce0`).
```
$ pip install git+git://github.com/daler/pybedtools.git@master
$ pip install git+git://github.com/daler/pybedtools.git@b1e0ce0
```
# Running the pipeline
The pipeline consists of multiple independent modules. Documentation of each
module is provided in the respective subdirectory.
0. [Preprocessing](preprocessing/README.md)
(optional, not intended for all users)
1. [Algorithm integration](algorithm_integration/README.md)
2. [RD-test](rdtest/README.md)
3. SR-test
## Rolling your own preprocessing
The preprocessing module here is provided for reproducibility and as an
example implementation of SV algorithm standardization, but is not intended to
be generalizable to all use cases.
If you would like to implement your own preprocessing before beginning
integration and filtering, the pipeline can be bootstrapped to begin with the
integration module by providing the following input files:
* `algorithm_integration/input_vcfs/{source}.{group}.vcf.gz`
Tabix-indexed PE/SR VCFs per algorithm.
- {source} corresponds to the name of the source algorithm.
- {group} corresponds to an ID for a subgroup of samples which were called
jointly. (Historically a quad in the SSC cohort.)
- VCF records are required to include INFO fields for SVTYPE, CHR2, END,
STRANDS [++,+-,-+,--], SVLEN, and SOURCES. Currently, DEL, DUP, INV, and
BND are supported SVTYPEs; INS have not been tested. BND and INV
breakpoints must be segregated and annotated with strand.
* `algorithm_integration/input_beds/{batch}.{svtype}.bed.gz`
All per-sample depth calls in a given batch, merged across algorithms.
- Depth calls must be segregated by {svtype}, which may be DEL or DUP.
- Each line corresponds to a single call in a single sample.
- First six columns must be [chrom, start, end, name, sample, svtype]. Any
following columns are discarded during integration.
- In our preprocessing module, these files are generated by taking all
algorithm's calls for a given sample and performing a bedtools merge,
then concatenating and sorting all sample calls.
## Pipeline configuration
All variables controlling pipeline operation can be modified in `config.yaml`.
This documentation is incomplete and in progress.
* `quads` : filepath
Path to list of quads (or other subgroup identifier) on which to run the
pipeline. (Currently required. If only one subgroup exists, simply provide one
ID for it.)
* `samples` : filepath
Path to list of all samples represented in the union of quads (or other
subgroups).
* `pesr_sources` : list of strings
List of all PE/SR algorithms to merge.
* `svtypes` : list of strings
List of SV classes to consider when combining PE/SR variants. (Default:
DEL,DUP,INV,BND)
* `cnv_types`: list of strings
List of SV classes to consider when combining depth variants. (Default:
DEL,DUP)
* `chroms`: filepath
List of chromosomes to consider. Integration and filtering will be
parallelized by chromosome.