-
Notifications
You must be signed in to change notification settings - Fork 16
/
Copy pathpaper.Rmd
175 lines (158 loc) · 9.43 KB
/
paper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
```{r paper_setup, cache=FALSE, include=FALSE}
library(knitcitations)
# create bibtex entries
write.bibtex(c(knitr=citation('knitr'),
knitcitations=citation('knitcitations'), plyr=citation('plyr'),
ggplot2=citation('ggplot2'), ape=citation('ape'),
directlabels=citation('directlabels'),
plyr=citation('plyr'), foreach=citation('foreach')),
file='citations.bib')
bib = read.bibtex('citations.bib')
```
# PrimerTree: Visually Assessing the Specificity and Informativeness of Primer Pairs#
James Hester and David Serre
Genomic Medicine Institute, Cleveland Clinic Lerner Research Institute, Cleveland, OH 44195, USA
# Abstract #
## Summary ##
Designing PCR primers is a critical step in most genetic studies. Primers need
to be sensitive and specific, as well as yield informative DNA sequences. For
example, in metagenomic studies, primers need to amplify only DNA from the
targeted community and amplify DNA sequences that enable differentiating all
members of this community. Several informatics tools exist for designing
primers based on a template DNA sequence and for identifying potential target
sequences in a database. However, there are no tools to systematically analyze
and visualize the specificity of PCR primers and the informativeness of the
amplified sequence. PrimerTree is an R package that enables identifying
potential target sequences for a set of primers and generates
taxonomically annotated phylogenetic trees with the predicted amplification
products.
## Availability ##
PrimerTree is an R package released under the GPL-2 license and is available through
[CRAN](http://cran.rproject.org/web/packages/primerTree/index.html). Source
code and developmental versions are available at <http://www.github.com/jimhester/primerTree>.
## Contact ##
[hesterj@ccf.org](mailto:hesterj@ccf.org)
## Supplemental Information ##
Supplementary data are available at Bioinformatics online.
# Introduction #
Designing primers for PCR is the initial step in many biomedical, forensic and
metagenomic studies. There are a number of important factors to consider in
primer design. First one must consider the chemical properties of the primers,
including their length, melting temperature, GC content, secondary structure
and the likelihood of primer dimer formation. The primer pair must also
amplify specifically DNA from the target of interest and must not produce
offtarget products. In addition, for some studies, the DNA sequence amplified
must provide enough information to identify the source of the DNA. In
metagenomic studies, the primers need to amplify DNA from the taxon of interest
(e.g., bacteria, fungus, birds), not amplify unrelated taxon and provide enough
information after sequencing to characterize which members of the communities
are present in a given sample. Some clinical studies need to amplify only a
specific pathogen species and enable identifying which strains are present.
While there are efficient online tools for designing primers `r citep('10.1093/nar/gks596')`
and testing their specificity `r citep(c(primer_BLAST='10.1186/1471-2105-13-134'))`, there is no simple
tool to systematically assess the informativeness of the primer products.
PrimerTree is an R package that, given a set of primer pairs, provides visual
assessment of their specificity and informativeness by constructing
phylogenetic trees of the predicted PCR products along with their taxonomic
annotation. PrimerTree can run on a wide variety of hardware and only requires
two commands, making it accessible to a large audience.
# Methods #
PrimerTree successively performs primer search, retrieval of DNA sequences
predicted to be amplified, taxonomic identification of these sequences,
multiple sequence alignment, reconstruction of a phylogenetic tree, and
visualization of the tree with taxonomic annotation. Note that multiple primer
pairs can be queried simultaneously using PrimerTree.
PrimerTree utilizes the primer search implemented in Primer-BLAST
`r citep(c(primer_BLAST='10.1186/1471-2105-13-134'))`
by directly querying the NCBI Primer-BLAST search page. This allows using
all the options available on the NCBI site. By default PrimerTree searches the
NCBI (nt) nucleotide database but alternative NCBI databases can be chosen
through the function options. Note that when the primers are degenerated,
PrimerTree automatically tests all possible combinations of primer sequences
in Primer-BLAST and merges the results. The primer alignment results are then
processed using the NCBI Eutilities `r citep('http://www.ncbi.nlm.nih.gov/books/NBK25500/')`
to i) retrieve DNA sequences located between the primers (i.e., “amplified”)
and ii) obtain taxonomic information related to each DNA sequence using the
NCBI’s taxonomy database `r citep('http://www.ncbi.nlm.nih.gov/books/NBK21100/')`.
PrimerTree next aligns all amplified sequences using Clustal Omega
`r citep('10.1038/msb.2011.75')` and reconstructs a Neighbor-Joining tree using the ape package
`r citep('10.1093/bioinformatics/btg412')`. Finally, PrimerTree displays the resulting phylogenetic
tree using the ggplot2 package, labeling each taxon in a different color and
adding the names of the main taxa using the directlabels package
`r citep(c(bib[['ggplot2']], bib[['directlabels']]))`.
# Results #
Figure 1 shows a subset of PrimerTree results for universal non-vascular plant (Bryophyte) primers
`r citep('10.1111/j.1365-294X.2012.05537.x')` targeting the chloroplast trnL gene (Figure 1A)
and mammal mitochondrial 16S ribosomal rna gene (Figure1B)
`r citep('10.1093/oxfordjournals.molbev.a025566')`. The tree display enables
rapid evaluation of the specificity of the primer pairs (e.g., offtarget
amplification of amphibians and ray-finned fishes on Fig. 1B). In addition, the
information encoded in the amplified DNA sequence can also be easily assessed
by the length of the branches leading to different sequences (scaled in number
of nucleotide differences). By default, PrimerTree displays the annotated
phylogenetic tree for all taxonomic levels (e.g., kindom, phylum, class)
enabling the user to determine the level of specificity of each primers (see
e.g., Supplemental Figure 1).
For nondegenerated primers and using a single thread, PrimerTree usually runs
in less than 240 seconds. However, the average runtime varies greatly depending
on the primer specificity (i.e. how many DNA sequences are “amplified”), the
search parameters chosen, and the current load on the NCBI servers and the
internet connection. Highly degenerated primers result in large numbers of
possible primer pairs, which can increase the runtime considerably. To limit
maximum runtime in this situation, PrimerTree randomly samples only a portion
of the total primer permutations. This provides a representative sample for
most cases. Changing the number of sampled permutations or turning off sampling
completely is possible. PrimerTree uses the plyr package extensively and has
full support for any of the parallel backends compatible with the foreach
package `r citep(c(bib[['plyr']], bib[['foreach']]))`. In particular, parallel retrieval of
the primer sequences from NCBI speeds up the total runtime considerably. Note
that parallel queries to Primer-BLAST are queued by NCBI’s servers and only
processed once there is free compute time.
# Conclusions #
Designing primers can be challenging, especially for metagenomic study, and
dramatically impact the study results. We have developed the PrimerTree
package to help assessing large numbers of designed primers and choosing the
best primer pair for a given experiment. PrimerTree is an R package available
on a wide variety of platforms and is very simple to use and install.
PrimerTree is released under an opensource license and can be downloaded from
<http://github.com/jimhester/primerTree>, which also provides additional
documentation.
## Funding ##
XXX
## Conflict of Interest ##
None declared
# References #
```{r paper_bibliography, results='asis', echo=F}
bibliography(style='markdown')
```
```{r paper_get_data, echo=F, eval=F}
library(doMC)
registerDoMC(8)
mammals_16S = search_primer_pair(name='Mammals 16S', 'CGGTTGGGGTGACCTCGGA', 'GCTGTTATCCCTAGGGTAACT', num_aligns=1000, .parallel=T)
bryophytes_trnL = search_primer_pair(name='Bryophyte trnL', 'GATTCAGGGAAACTTAGGTTG', 'CCATTGAGTCTCTGCACC', num_aligns=1000, .parallel=T)
save(file='data/mammals_16S.RData', mammals_16S)
save(file='data/bryophytes_trnL.RData', bryophytes_trnL)
```
```{r paper_figure1, dev='png', fig.height=4.5, fig.width=6, warning=FALSE}
change_mapping = function(plot){
plot$layers[[1]]$geom_params$colour = 'lightgrey'
plot$layers[[4]]$mapping = aes(x=x, y=y, color=class, shape=class) #add shape to points mapping
plot$layers[[5]]$mapping = aes(x=x, y=y, label=class) #remove color from direct labels mapping
plot
}
p1 = change_mapping(plot(mammals_16S, ranks='class', main='Mammal 16S', rotate=45, size=4) + scale_colour_grey(start=0, end=.8)) #scale_color_brewer(palette="Set1"))
p2 = change_mapping(plot(bryophytes_trnL, ranks='class', main='Bryophyte trnL', size=4) + scale_colour_grey(start=0, end=.8)) #scale_color_brewer(palette="Set1"))
library(gridExtra)
p3 = arrangeGrob(p2,p1, ncol=2, clip=F)
grid.draw(p3)
```
```{r paper_figure1_big, dev='svg', fig.height=12, fig.width=12, depth=300, dev='postscript', warning=FALSE}
setEPS()
grid.draw(p3)
```
```{r paper_supplemental, fig.height=12, fig.width=12, depth=300, dev='postscript', warning=FALSE}
setEPS()
rm(scale_colour_discrete)
plot(mammals_16S)
plot(bryophytes_trnL)
```