-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathllama.html
269 lines (204 loc) · 13.8 KB
/
llama.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
---
layout: default
---
<h1>Llama</h1>
<center><h3>Local Lineage And Monophyly Assessment</h3></center>
<br><br>
<div id="pangolin_grid">
<div>
<div class="contrib">
Llama pulls out local trees from a large tree (e.g. the global SARS-CoV-2 phylogeny) from a set of queries. Will also optionally add in new sequences to the local trees on the fly<br><br>
Llama is available as a command line tool. It is open source software available under the GNU General Public License v3.0.
</div>
<br>
<a href="https://github.com/cov-lineages/llama" class="llama-btn" style="color: white !important">Download Llama</a>
</div>
<div>
<img src="{{ 'assets/images/llama.svg' | absolute_url }}"/>
</div>
</div>
<br><br>
<h2>llama</h2>
<h3>Requirements</h3>
<p>Llama runs on MacOS and Linux. The conda environment recipe may not build on Windows and is not supported but can be run using the Windows subsystem for Linux.</p>
<ol class="listWithCode">
<li>Some version of conda, we use Miniconda3. Can be downloaded from <a href="https://docs.conda.io/en/latest/miniconda.html">here</a></li>
<li>Your input query file with a row for each sequence name you want to analyse/ create local trees for. These can be present in the big tree already or in the fasta file you supply</li>
<li>Optional fasta file if there are sequences you want to add into the tree</li>
<li>A directory of data containing the following files:
<ul class="listWithCode">
<li><strong>global.tree</strong>: a large tree that you want to place your sequences in</li>
<li><strong>alignment.fasta</strong>: an alignment file with the fasta sequences used to make the tree</li>
<li><strong>metadata.csv</strong>: associated metadata with minimally the name of the sequences in the tree alignment and a lineage designation.
The names of the tips of the tree, the sequence ids in the alignment, and the column you select as <span class="code">--data-column</span> in the metadata must match</li>
</ul>
</li>
</ol>
<h3>Install Llama</h3>
<ol class="listWithCode">
<li>Clone this repository and <span class="code">cd llama</span></li>
<li><span class="code">conda env create -f environment.yml</span></li>
<li><span class="code">conda activate llama</span></li>
<li><span class="code">python setup.py install</span></li>
</ol>
<div class="contrib">Note: we recommend using llama in the conda environment specified in the <span class="code">environment.yml</span> file as per the instructions above. If you can't use conda for some reason, dependency details can be found in the <span class="code">environment.yml</span> file.</div>
<h3>Check the Install Worked</h3>
<p>Type (in the llama environment): <span class="code">llama</span> and you should see the help menu of Llama printed.</p>
<h3>Updating llama</h3>
<div class="contrib"> Note: Even if you have previously installed Llama, as it is being worked on intensively, we recommend you check for updates before running.</div>
<p>To update, first update the conda environmemt, and then activate it.</p>
<o> To update the conda environment:</o>
<ol class="listWithCode">
<li><span class="code">conda activate llama</span></li>
<li><span class="code">git pull</span> to pull the latest changes from github</li>
<li><span class="code">python setup.py install</span> to re-install llama</li>
<li><span class="code">conda env update -f environment.yml</span></li>
</ol>
<p>To activate the environemnt:</p>
<ol class="listWithCode">
<li>Activate the environment <span class="code">conda activate llama</span></li>
<li>Run <span class="code">llama</span></li>
</ol>
<h3>Example Usage</h3>
<br>
<p class="code">llama -i <input.csv> -f <input.fasta> -d <path/to/data> [options]</p>
<br>
<h3>Full Usage</h3>
<pre>
usage: llama -i <input.csv> -d <path/to/data> [options]
llama: Local Lineage And Monophyly Assessment
optional arguments:
-h, --help show this help message and exit
-i QUERY, --input QUERY
Input csv file with minimally `name` as a column
header. Alternatively, `--input-column` can specifiy a
column name other than `name`
-fm [FROM_METADATA [FROM_METADATA ...]], --from-metadata [FROM_METADATA [FROM_METADATA ...]]
Generate a query from the metadata file supplied.
Define a search that will be used to pull out
sequences of interest from the large phylogeny. E.g.
-fm country=Ireland sample_date=2020-03-01:2020-04-01
-f FASTA, --fasta FASTA
Optional fasta query. Fasta sequence names must
exactly match those in your input query.
-a, --align-sequences
Just align sequences.
-s SEQS, --seqs SEQS Sequence file containing sequences used to create the
tree. For use in combination with the `--align-
sequences` option.
-ns, --no-seqs Alignment not available. Note, to work, all queries
must already be in global tree.
-r, --report Generate markdown report of input queries and their
local trees
--colour-fields COLOUR_FIELDS
Comma separated string of fields to colour by in the
report.
--label-fields LABEL_FIELDS
Comma separated string of fields to add to tree report
labels.
--id-string Indicates the input is a comma-separated id string
with one or more query ids. Example:
`EDB3588,EDB3589`.
-o OUTDIR, --outdir OUTDIR
Output directory. Default: current working directory
-d DATADIR, --datadir DATADIR
Local directory that contains the data files
--tempdir TEMPDIR Specify where you want the temp stuff to go. Default:
$TMPDIR
--no-temp Output all intermediate files, for dev purposes.
--input-column INPUT_COLUMN
Column in input csv file to match with database.
Default: name
--data-column DATA_COLUMN
Column in database to match with input csv file.
Default: sequence_name
--distance DISTANCE Extraction from large tree radius. Default: 2
--collapse-threshold THRESHOLD
Minimum number of nodes to collapse on. Default: 1
--lineage-representatives
Include a selection of representative sequences from
lineages present in the local tree. Default: False
--number-of-representatives NUMBER_OF_REPRESENTATIVES
How many representative sequeneces per lineage to keep
in the collapsed tree. Default: 5
--max-ambig MAXAMBIG Maximum proportion of Ns allowed to attempt analysis.
Default: 0.5
--min-length MINLEN Minimum query length allowed to attempt analysis.
Default: 10000
-n, --dry-run Go through the motions but don't actually run
-t THREADS, --threads THREADS
Number of threads
--verbose Print lots of stuff to screen
--outgroup OUTGROUP Optional outgroup sequence to root local subtrees.
Default an anonymised sequence that is at the base of
the global SARS-CoV-2 phylogeny.
-v, --version show program's version number and exit
</pre>
<h3>Options</h3>
<p>Curate the input sequences into an alignment padded against an early lineage A reference:</p>
<p class="code">
llama -a -s your_input_sequences.fasta
</p><br>
<p>Generate a report with your sequences summarised:</p>
<p class="code">
llama -r -i test.csv -f test.fasta -d <path/to/data>
</p><br>
<p>Generate a report with a custom set of sequences defined by the metadata file supplied. After the `-fm` or `--from-metadata` argument, one or more columns in the metadata and search pattern to match can be described. A special case exists if a date range is detected (colon separated dates). The required date format is YYYY-MM-DD. </p>
<p>The format of this search is as follows:</p>
<p class="code">
--from-metadata column1=value1 column2=YYYY-MM-DD:YYYY-MM-DD
</p><br>
<p>For example, the following command will pull out sequences from Ireland with samples between 2020-03-01 and 2020-04-1, provided that information exists metadata.csv file found in the data directory.</p>
<p class="code">
llama -r -fm country=Ireland sample_date=2020-03-01:2020-04-01 -d <path/to/data>
</p><br>
<p>Include a selection of representative sequences for each lineage present in the local tree:</p>
<p class="code">
llama -i test.csv --fasta test.fasta --datadir <path/to/data> --lineage-representatives --number-of-representatives 5<br>
</p><br>
<h3>Analysis Pipeline</h3>
<h4>Overview</h4>
<ul class="listWithCode">
<li>From the input csv <span class="code"><query></span> Llama attempts to match the ids with ids in the metadata.csv.</li>
<li>If the id matches with a record, the corresponding metadata is pulled out.</li>
<li>If the id doesn't match with a record and a fasta sequence with that query id has been provided, it's passed into a workflow (<span class="code">find_closest_in_db.smk</span>) to identify the closest sequence. In brief, this search consists of quality control steps that maps the sequence against a reference (an early, anonymised sequence from lineage A at the root of the global tree), pads any indels relative to the reference and masks non-coding regions. llama then runs a <span class="code">minimap2</span> search against the alignment.fasta file and finds the best hit to the query sequence.</li>
<li>The metadata for the closest sequences are then also pulled out of the large metadata.csv.</li>
<li>Combining the metadata from the records of the closest hit and the exact matching records found in the csv, Llama queries the large global.tree phylogeny. The local trees around the relevant tips are pruned out of the large phylogeny, merging overlapping local phylogenys as needed. By default, Llama pulls out a local tree two above the query id tips, but this can be customised with the `--distance` argument if larger or smaller trees are desired. </li>
<li>If these local trees contain "closest-matching" tips that have been found based on the input fasta file, the sequence records for the tips on the tree and the sequences of the relevant queries are added into an alignment. Llama then checks what lineages are present in the local tree and flags a maximum of 10 sequences per lineage to retain the surrounding context of the tree. Any peripheral sequences coming off of a polytomy that are not flagged and are not the query sequences are collapsed to a single node and summaries of the tip's contents are output. An outgroup sequence from the base of the tree at lineage A is added into the alignment.</li>
<li>After collapsing the nodes, Llama runs <span class="code">iqtree</span> on the new alignment, with the outgroup and query sequences in, and then prunes off the outgroup sequence.</li>
<li>Llama then annotates this new phylogeny with lineage assignments and can produce a report (COMING SOON!).</li>
</ul>
<h3>Output</h3>
<ul>
<li>Catchment trees around the query sequences (uncollapsed)</li>
<li>Collapsed local trees (containing query sequences if optional fasta file supplied) with a representative set of sequences from surrounding lineages and query tips uncollapsed</li>
</ul>
<h3>Acknowledgements</h3>
<p>Llama makes use of <a href="https://github.com/cov-ert/datafunk">datafunk</a> and <a href="https://github.com/cov-ert/clusterfunk">clusterfunk</a>, functions which have been written by members of the Rambaut Lab, specificially Rachel Colquhoun, JT McCrone, Ben Jackson and Shawn Yu.</p>
<p>Llama runs a java implementation <a href="https://github.com/cov-ert/clusterfunk">jclusterfunk</a> written by Andrew Rambaut.</p>
<p><a href="https://github.com/evogytis/baltic/tree/master/baltic">baltic</a> by Gytis Dudas is used to visualize the trees.</p>
<h3>References</h3>
<div class="contrib">
<a href="https://github.com/lh3/minimap2">minimap2</a><br><br>
Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34, Issue 18, 15 September 2018, Pages 3094–3100, https://doi.org/10.1093/bioinformatics/bty191
</div>
<div class="contrib">
<a href="http://www.iqtree.org/#download">iqtree</a><br><br>
L.-T. Nguyen, H.A. Schmidt, A. von Haeseler, B.Q. Minh (2015) IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.. Mol. Biol. Evol., 32:268-274. https://doi.org/10.1093/molbev/msu300<br><br>
D.T. Hoang, O. Chernomor, A. von Haeseler, B.Q. Minh, L.S. Vinh (2018) UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol., 35:518–522. https://doi.org/10.1093/molbev/msx281<br><br>
Stéphane Guindon, Jean-François Dufayard, Vincent Lefort, Maria Anisimova, Wim Hordijk, Olivier Gascuel, New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0, Systematic Biology, Volume 59, Issue 3, May 2010, Pages 307–321, https://doi.org/10.1093/sysbio/syq010
</div>
<div class="contrib">
<a href="https://snakemake.readthedocs.io/en/stable/index.html">snakemake</a><br><br>
Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.
</div>
<h3>Software versions</h3>
<ul>
<li>python=3.6</li>
<li>snakemake-minimal=5.13</li>
<li>iqtree=1.6.12</li>
<li>minimap2=2.17-r941</li>
<li>pandas==1.0.1</li>
<li>pytools=2020.1</li>
<li>dendropy=4.4.0</li>
</ul>