Skip to content

Commit 7ebdb7f

Browse files
committed
Update README for version 0.2.0
1 parent 3878442 commit 7ebdb7f

File tree

1 file changed

+94
-65
lines changed

1 file changed

+94
-65
lines changed

README.md

+94-65
Original file line numberDiff line numberDiff line change
@@ -1,104 +1,133 @@
11
# Progenomics: toolkit for prokaryotic comparative genomics
22

3-
Progenomics is a general toolkit-under-construction for comparative genomics of prokaryotes. It should be able to handle large genome datasets of small to medium sequence divergence (i.e., genomes from the same species, genus, family and possibly order). What is currently implemented is a pipeline to get the __core genome__ for up to thousands of genomes overnight on a decent desktop computer. A __pangenome pipeline__ is planned for the near future.
3+
Progenomics is a toolkit-under-construction for comparative genomics of prokaryotes. It should be able to handle large genome datasets of small to medium sequence divergence (i.e., genomes from the same species, genus, family and possibly order). Its most useful feature at the moment is probably that it is able to quickly infer the core genome of a large set of genomes without having to infer the pangenome as an intermediate step.
44

5-
Progenomics depends on [OrthoFinder](https://github.com/davidemms/OrthoFinder) for gene family inference.
5+
Progenomics depends on [OrthoFinder](https://github.com/davidemms/OrthoFinder) for some of its functionalities.
66

77
## Dependencies
88

9-
Tools for the (hopefully) painless installation of the dependencies:
10-
11-
* [anaconda](https://www.anaconda.com/distribution/#download-section) version >= 2019.3
12-
* pip3 (sudo apt install pip3)
13-
14-
The actual dependencies, with suggested installation instructions:
9+
Dependencies with suggested installation instructions:
1510

1611
* [Python3](https://www.python.org/) version >= 3.6.7
1712
* Python libraries:
18-
* [Biopython](https://biopython.org/) version >= 1.67 (pip3 install biopython)
19-
* [pandas](https://pandas.pydata.org/) version >= 0.24.1 (pip3 install pandas)
20-
* [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) version >= 2.6.0 (conda install -c bioconda blast)
21-
* [MCL](https://www.micans.org/mcl/index.html?sec_software) version >= 14-137 (conda install -c bioconda mcl)
22-
* [OrthoFinder](https://github.com/davidemms/OrthoFinder) version >= 2.1.2 (conda install -c bioconda orthofinder)
23-
* [mafft](https://mafft.cbrc.jp/alignment/software/) version >= 7.407 (conda install -c bioconda mafft)
24-
* [HMMER](http://hmmer.org/) version >= 3.1b2 (conda install -c bioconda hmmer)
25-
* [R](https://www.r-project.org/) version >= 3.5.1
26-
* R packages:
27-
* ROCR version >= 1.0.7 (install.packages("ROCR"))
28-
* tidyverse version >= 1.2.1 (install.packages("tidyverse"))
13+
* [Biopython](https://biopython.org/) version >= 1.67 (`pip3 install biopython`)
14+
* [pandas](https://pandas.pydata.org/) version >= 0.24.1 (`pip3 install pandas`)
15+
* [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) version >= 2.6.0 (`conda install -c bioconda blast`)
16+
* [MCL](https://www.micans.org/mcl/index.html?sec_software) version >= 14-137 (`conda install -c bioconda mcl`)
17+
* [OrthoFinder](https://github.com/davidemms/OrthoFinder) version >= 2.1.2 (`conda install -c bioconda orthofinder`)
18+
* [mafft](https://mafft.cbrc.jp/alignment/software/) version >= 7.407 (`conda install -c bioconda mafft`)
19+
* [HMMER](http://hmmer.org/) version >= 3.1b2 (`conda install -c bioconda hmmer`)
20+
21+
## Usage
22+
23+
Progenomics is able to perform a number of specific tasks related to prokaryotic core and pangenomes (see also `progenomics -h`):
24+
25+
* `pan`: infer a pangenome from a set of faa files
26+
* `build`: build a profile HMM database for a core/pangenome
27+
* `search`: search query genes in a core/pangenome database
28+
* `checkgenomes`: assess the quality of genomes in a core genome
29+
* `checkgroups`: assess the quality of orthogroups in a core genome
30+
* `filter`: filter the genomes/orthogroups in a pangenome
31+
* `supermatrix`: construct a concatenated core orthogroup alignment from a core genome
32+
33+
A full core and pangenome pipeline are also implemented:
34+
35+
* `pan-pipeline`: infer a pangenome, build a profile HMM database and train score cutoffs from a set of faa files
36+
* `core-pipeline`: infer a core genome, build a profile HMM database and train score cutoffs from a set of faa files
37+
38+
### Core genome pipeline
39+
40+
Let's say we want to infer the core genome for a set of prokaryotic genomes and we have one faa file (amino acid sequences of predicted genes) per genome in the folder `faas`.
41+
42+
**Quick version**
43+
44+
To get the core genome, we can simply run the following commands:
45+
46+
ls faas/*.faa > faapaths.txt
47+
progenomics core-pipeline faapaths.txt core -t 16
2948

30-
## Core genome pipeline
49+
This will create the output folder `core`, run progenomics with 16 threads and produce the file `coregenome.tsv`. This output tsv file contains the core orthogroups and has the columns gene, genome and orthogroup.
3150

32-
**How it works**
51+
If we now want to construct a supermatrix (concatenated alignment) of these core orthogroups, we could do it as follows:
3352

34-
Progenomics works in four stages to be able to rapidly determine single-copy core genes (SCGs) for large genome datasets:
53+
progenomics supermatrix faapaths.txt core/coregenome.tsv supermatrix
3554

36-
1. Gene family inference on a small, random subset of N (e.g. 30) seed genomes, using OrthoFinder.
37-
2. Selection of candidate SCGs by requiring single-copy presence in at least K (e.g. 25) out of N seed genomes.
38-
3. Search for the candidate SCGs in all genomes, using HMMER, and training of SCG-specific score cutoffs.
39-
4. Selection of the definitive SCGs by enforing that a candidate SCG is present in a single copy in P% (e.g. 95%) of all genomes.
55+
This will create a `supermatrix` output folder, with in it a file supermatrix.fasta.
4056

41-
**Tutorial**
57+
And that's it! Three lines of code to get from the faa files to the supermatrix fasta file, ready to start constructing your phylogenetic tree.
4258

43-
In his workflow, we start from a set of genomes (up to ~ 3000 if you want to run overnight on a decent desktop computer) and we want to extract a complete set of single-copy core genes (SCGs) of those genomes. As input data, we need to have a set of predicted protein sequences (.faa file) for each genome. We need to supply the __paths to these .faa files__ to progenomics as a single txt file (files should be uncompressed). If all .faa files are in the same folder, we can create such a file as follows:
59+
**Detailed version**
4460

45-
ls genomes/*.faa > genomepaths.txt
61+
If we want more fine-grained control, we could achieve the same result by running individual progenomics tasks. These individual tasks also give insight in how the core genome pipeline actually works.
4662

47-
Next, we extract candidate SCGs from the genomes and search for them in the complete genome dataset (__steps 1 - 3__). Candidate SCGs are gene families that are present in a single copy in at least K genomes out of N randomly chosen seed genomes. We could set K to 25 and N to 30, for example:
63+
**Step 1:** infer the pangenome of a random subset of seed genomes (e.g. 30).
4864

49-
progenomics prepare_candidate_scgs \
50-
--fin_genomepaths genomepaths.txt \
51-
--n_seed_genomes 30 \
52-
--min_presence_in_seeds 25 \
53-
--dout cand_scgs \
54-
--threads 8
65+
mkdir seeds cands
66+
ls faas/*.faa > faapaths.txt
67+
shuf -n 30 faapaths.txt > seeds/faapaths.txt
68+
progenomics pan seeds/faapaths.txt seeds/pan
5569

56-
We now select the definitive SCGs from the candidates by requiring that they are present in a single copy in P% of the total number of genomes (__step 4__). For P = 95, this gives us:
70+
**Step 2:** build a profile HMM database of "candidate core orthogroups" that are present in at least M seed genomes (e.g. 25).
5771

58-
progenomics select_scgs \
59-
--fin_score_table cand_scgs/score_table.csv \
60-
--fin_candidate_scg_table cand_scgs/candidate_scg_table.csv \
61-
--candidate_scg_cutoff 0.95 \
62-
--fout_scg_list scg_list.txt \
63-
--fout_genome_table genome_table.csv
72+
progenomics build seeds/faapaths.txt seeds/pan/pangenome.tsv cands/db -m 25
6473

65-
As a result, we get two output files: a list with SCG names and a table with for each genome, the percentage "completeness" and "redundancy"; those two measures can be used for __genome quality filtering__. For this demonstration, we keep all of our genomes and save their names in a txt file:
74+
**Step 3:** identify the candidate core genes in the full set of genomes by searching all proteins of all genomes against the database of candidate core genes.
6675

67-
cut -d , -f 1 genome_table.csv | tail -n +2 > selected_genomes.txt
76+
progenomics search faapaths.txt cands/db cands/core -y core
6877

69-
Finally, we can construct a __SCG matrix__ where the rows are genomes, the columns are SCGs and the cells contain the actual names of individual genes:
78+
**Step 4:** identify the core genes from the candidates by imposing a minimum percentage presence cutoff (e.g. 98%) in the full set of genomes.
7079

71-
progenomics construct_scg_matrix \
72-
--fin_score_table cand_scgs/score_table.csv \
73-
--fin_candidate_scg_table cand_scgs/candidate_scg_table.csv \
74-
--fin_genome_list selected_genomes.txt \
75-
--fin_scg_list scg_list.txt \
76-
--fout_scg_matrix scg_matrix.csv
80+
progenomics checkgroups cands/core/coregenome.tsv cands/groups
81+
awk '{ if ($2 > 0.98) { print $1 } }' cands/groups/orthogroups.tsv \
82+
> orthogroups.txt
83+
progenomics filter cands/core/coregenome.tsv core -o orthogroups.txt
7784

78-
Progenomics is also capable of producing a concatenated nucleotide alignment of the SCGs by aligning the amino acid sequences, backtranslating the alignments to alignments of nucleic acid sequences and concatenating these. For this purpose, we do of course need the nucleotide sequences of the genes (ffn files).
79-
80-
progenomics nucleotide_supermatrix_from_scg_matrix \
81-
--fin_scg_matrix scg_matrix.csv \
82-
--din_ffns dir_with_ffns \
83-
--din_faas dir_with_faas \
84-
--dout ./
85+
The output folder `core` will now contain the file coregenome.tsv.
8586

86-
## Pangenome pipeline
87+
### Pangenome pipeline
8788

88-
Coming soon!
89+
Disclaimer: this pipeline is still in active development. Parts of it can still change drastically, especially the way that hmmer score cutoffs are trained.
90+
91+
**Quick version**
92+
93+
If you want to infer a pangenome of your genomes as well as build a pangenome database that you can later query with one or more genes of interest, you can run:
94+
95+
ls faas/*.faa > faapaths.txt
96+
progenomics pan-pipeline faapaths.txt pan -t 16
97+
98+
If you then want to identify whether some genes of interest (let's say in a file called `querygenes.fasta`) are present in the pangenome database, you can run:
99+
100+
echo querygenes.fasta > querypath.txt
101+
progenomics search querypath.txt pangenome/db hits
102+
103+
This will produce a `hits` output folder with the file `hits.tsv`.
104+
105+
**Detailed version**
106+
107+
The pangenome pipeline can also be performed by running individual tasks:
108+
109+
ls faas/*.faa > faapaths.txt
110+
progenomics pan faapaths.txt pan
111+
progenomics build faapaths.txt pan/pangenome.tsv db
112+
progenomics search faapaths.txt db pan2 -s pan -p pan/pangenome.tsv
113+
114+
The final `search` step is required because it will train a hmmer score cutoff for each profile HMM in the pangenome database and add these cutoffs to the database. In addition, it produces an orthogroup assignment for each protein in the set of input genomes (`pan2/hits.tsv`). Importantly, these assignments are not always the same as the orthogroup assignments listed in `pan/pangenome.tsv` because they are produced by a hmmer search with orthogroup-specific cutoffs, while the original orthogroup assignments have been produced by the pangenome inference process. A comparison between these two strategies of orthogroup assignment could be interesting.
89115

90116
## License
91117

92118
Progenomics is free software, licensed under [GPLv3](https://github.com/sanger-pathogens/Roary/blob/master/GPL-LICENSE).
93119

94120
## Feedback
95121

96-
All feedback and suggestions very welcome at stijn.wittouck[at]uantwerpen.be. You are of course also welcome to file [issues](https://github.com/SWittouck/progenomics/issues).
122+
All feedback and suggestions very welcome at stijn.wittouck[at]uantwerpen.be. You are of course also welcome to file [issues](https://github.com/SWittouck/progenomics/issues).
97123

98124
## Citation
99125

100-
If you use progenomics in your publication, please try to cite the following preprint:
126+
When you use progenomics for your publication, please cite:
127+
128+
[Wittouck, Stijn, Sander Wuyts, Conor J Meehan, Vera van Noort, and Sarah Lebeer. 2019. “A Genome-Based Species Taxonomy of the Lactobacillus Genus Complex.” Edited by Sean M
129+
Gibbons. MSystems 4 (5): e00264-19. https://doi.org/10.1128/mSystems.00264-19.](https://doi.org/10.1128/mSystems.00264-19)
101130

102-
[Wittouck, Stijn, Sander Wuyts, Conor J Meehan, Vera van Noort, and Sarah Lebeer. 2019. “A Genome-Based Species Taxonomy of the Lactobacillus Genus Complex.” BioRxiv, January, 537084. doi:10.1101/537084.](https://www.biorxiv.org/content/10.1101/537084v1)
131+
Please also cite OrthoFinder:
103132

104-
If citing preprints is not allowed for your journal, don't worry about it. Hopefully, a peer-reviewed publication will be available soon!
133+
[Emms, D.M., Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20, 238 (2019)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1832-y)

0 commit comments

Comments
 (0)