You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+66-49
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,28 @@
1
1
# Progenomics: toolkit for prokaryotic comparative genomics
2
2
3
-
Progenomics is a toolkit-under-construction for comparative genomics of prokaryotes. It should be able to handle large genome datasets of small to medium sequence divergence (i.e., genomes from the same species, genus, family and possibly order). Its most useful feature at the moment is probably that it is able to quickly infer the core genome of a large set of genomes without having to infer the pangenome as an intermediate step.
4
-
5
-
Progenomics depends on [OrthoFinder](https://github.com/davidemms/OrthoFinder) for some of its functionalities.
3
+
Progenomics is a toolkit for fast and scalable comparative genomics. It has been designed for prokaryotes but should work for eukaryotic genomes as well. Progenomics can handle large genome datasets on a range of taxonomic levels; it has been tested on datasets up until the order level. Its most useful features are fast pangenome inference, sensitive search of query sequences in a pangenome, rapid core genome inference using a heuristic strategy and the construction of concatenated core gene alignments ("supermatrices").
6
4
7
5
## Dependencies
8
6
9
-
Dependencies with suggested installation instructions:
7
+
Essential dependencies:
10
8
11
9
*[Python3](https://www.python.org/) version >= 3.6.7
12
10
* Python libraries:
13
-
*[Biopython](https://biopython.org/) version >= 1.67 (`pip3 install biopython`)
14
-
*[pandas](https://pandas.pydata.org/) version >= 0.24.1 (`pip3 install pandas`)
15
-
*[blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) version >= 2.6.0 (`conda install -c bioconda blast`)
16
-
*[MCL](https://www.micans.org/mcl/index.html?sec_software) version >= 14-137 (`conda install -c bioconda mcl`)
17
-
*[OrthoFinder](https://github.com/davidemms/OrthoFinder) version >= 2.1.2 (`conda install -c bioconda orthofinder`)
18
-
*[mafft](https://mafft.cbrc.jp/alignment/software/) version >= 7.407 (`conda install -c bioconda mafft`)
19
-
*[HMMER](http://hmmer.org/) version >= 3.1b2 (`conda install -c bioconda hmmer`)
11
+
*[biopython](https://biopython.org/) version >= 1.67
12
+
*[ete3](http://etetoolkit.org/) version >= 3.1.1
13
+
*[scipy](https://www.scipy.org/) version >= 1.4.1
14
+
*[MAFFT](https://mafft.cbrc.jp/alignment/software/) version >= 7.407
15
+
*[MMseqs2](https://github.com/soedinglab/MMseqs2) version >= d36dea2
16
+
17
+
Dependencies for the search module and core pipeline:
18
+
19
+
*[HMMER](http://hmmer.org/) version >= 3.1b2
20
+
21
+
Dependencies when using [OrthoFinder](https://github.com/davidemms/OrthoFinder) for pangenome inference:
22
+
23
+
*[blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) version >= 2.6.0
24
+
*[MCL](https://www.micans.org/mcl/index.html?sec_software) version >= 14-137
25
+
*[OrthoFinder](https://github.com/davidemms/OrthoFinder) version >= 2.1.2
20
26
21
27
## Usage
22
28
@@ -35,9 +41,54 @@ A full core and pangenome pipeline are also implemented:
35
41
*`pan-pipeline`: infer a pangenome, build a profile HMM database and train score cutoffs from a set of faa files
36
42
*`core-pipeline`: infer a core genome, build a profile HMM database and train score cutoffs from a set of faa files
37
43
38
-
### Core genome pipeline
44
+
### Infering a pangenome
45
+
46
+
If you want to infer the pangenome of a set of genomes, you only need their faa files (fasta files with protein sequences) as input. If the faa files are stored in a folder `faas`, you can infer the pangenome using 16 threads by running:
47
+
48
+
ls faas/*.faa > faapaths.txt
49
+
progenomics pan faapaths.txt pan -t 16
50
+
51
+
The pangenome will be stored in `pan/pangenome.tsv`.
52
+
53
+
The above example will use the builtin "FH" strategy to infer the pangenome; it is fast and scales more or less linearly with the number of input genomes. If you prefer to use OrthoFinder for pangenome inference, you can run:
54
+
55
+
ls faas/*.faa > faapaths.txt
56
+
progenomics pan faapaths.txt pan -d O-B -t 16
57
+
58
+
This will be a bit slower though.
59
+
60
+
### Searching a pangenome
61
+
62
+
Disclaimer: many aspects of this pipeline can still change, especially the way that hmmer score cutoffs are trained.
63
+
64
+
**Quick version**
65
+
66
+
If you want to infer a pangenome of your genomes as well as build a pangenome database that you can later query with one or more genes of interest, you can run:
67
+
68
+
ls faas/*.faa > faapaths.txt
69
+
progenomics pan-pipeline faapaths.txt pan -t 16
70
+
71
+
If you then want to identify whether some genes of interest (let's say in a file called `querygenes.fasta`) are present in the pangenome database, you can run:
This will produce a `hits` output folder with the file `hits.tsv`.
77
+
78
+
**Detailed version**
79
+
80
+
The pangenome pipeline can also be performed by running individual tasks:
81
+
82
+
ls faas/*.faa > faapaths.txt
83
+
progenomics pan faapaths.txt pan
84
+
progenomics build faapaths.txt pan/pangenome.tsv db
85
+
progenomics search faapaths.txt db pan2 -s pan -p pan/pangenome.tsv
86
+
87
+
The final `search` step is required because it will train a hmmer score cutoff for each profile HMM in the pangenome database and add these cutoffs to the database. In addition, it produces an orthogroup assignment for each protein in the set of input genomes (`pan2/hits.tsv`). Importantly, these assignments are not always the same as the orthogroup assignments listed in `pan/pangenome.tsv` because they are produced by a hmmer search with orthogroup-specific cutoffs, while the original orthogroup assignments have been produced by the pangenome inference process. A comparison between these two strategies of orthogroup assignment could be interesting.
88
+
89
+
### Inferring a core genome only
39
90
40
-
Let's say we want to infer the core genome for a set of prokaryotic genomes and we have one faa file (amino acid sequences of predicted genes) per genome in the folder `faas`.
91
+
Let's say we want to infer the core genome for a set of genomes and we have one faa file (amino acid sequences of predicted genes) per genome in the folder `faas`. This can be done using the core pipeline of progenomics, which can be a lot faster than full pangenome inference.
41
92
42
93
**Quick version**
43
94
@@ -84,38 +135,9 @@ If we want more fine-grained control, we could achieve the same result by runnin
84
135
85
136
The output folder `core` will now contain the file coregenome.tsv.
86
137
87
-
### Pangenome pipeline
88
-
89
-
Disclaimer: this pipeline is still in active development. Parts of it can still change drastically, especially the way that hmmer score cutoffs are trained.
90
-
91
-
**Quick version**
92
-
93
-
If you want to infer a pangenome of your genomes as well as build a pangenome database that you can later query with one or more genes of interest, you can run:
94
-
95
-
ls faas/*.faa > faapaths.txt
96
-
progenomics pan-pipeline faapaths.txt pan -t 16
97
-
98
-
If you then want to identify whether some genes of interest (let's say in a file called `querygenes.fasta`) are present in the pangenome database, you can run:
This will produce a `hits` output folder with the file `hits.tsv`.
104
-
105
-
**Detailed version**
106
-
107
-
The pangenome pipeline can also be performed by running individual tasks:
108
-
109
-
ls faas/*.faa > faapaths.txt
110
-
progenomics pan faapaths.txt pan
111
-
progenomics build faapaths.txt pan/pangenome.tsv db
112
-
progenomics search faapaths.txt db pan2 -s pan -p pan/pangenome.tsv
113
-
114
-
The final `search` step is required because it will train a hmmer score cutoff for each profile HMM in the pangenome database and add these cutoffs to the database. In addition, it produces an orthogroup assignment for each protein in the set of input genomes (`pan2/hits.tsv`). Importantly, these assignments are not always the same as the orthogroup assignments listed in `pan/pangenome.tsv` because they are produced by a hmmer search with orthogroup-specific cutoffs, while the original orthogroup assignments have been produced by the pangenome inference process. A comparison between these two strategies of orthogroup assignment could be interesting.
115
-
116
138
## License
117
139
118
-
Progenomics is free software, licensed under [GPLv3](https://github.com/sanger-pathogens/Roary/blob/master/GPL-LICENSE).
140
+
Progenomics is free software, licensed under [GPLv3](https://github.com/SWittouck/progenomics/blob/master/LICENSE).
119
141
120
142
## Feedback
121
143
@@ -125,9 +147,4 @@ All feedback and suggestions very welcome at stijn.wittouck[at]uantwerpen.be. Yo
125
147
126
148
When you use progenomics for your publication, please cite:
127
149
128
-
[Wittouck, Stijn, Sander Wuyts, Conor J Meehan, Vera van Noort, and Sarah Lebeer. 2019. “A Genome-Based Species Taxonomy of the Lactobacillus Genus Complex.” Edited by Sean M
[Emms, D.M., Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20, 238 (2019)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1832-y)
150
+
[Wittouck, S., Wuyts, S., Meehan, C. J., van Noort, V., & Lebeer, S. (2019). A Genome-Based Species Taxonomy of the Lactobacillus Genus Complex. mSystems, 4(5), e00264–19. https://doi.org/10.1128/mSystems.00264-19.](https://doi.org/10.1128/mSystems.00264-19)
0 commit comments