Important
PADLOC >v2.0.0
is only compatible with PADLOC-DB >v2.0.0
and vice-versa. After you update PADLOC, make sure to update your database by running: padloc --db-update
.
PADLOC is a software tool for identifying antiviral defence systems in prokaryotic genomes. PADLOC screens genomes against a database of HMMs and system classifications to find and annotate defence systems based on sequence homology and genetic architecture.
If you use PADLOC or PADLOC-DB please cite:
Payne, L. J., Todeschini, T. C., Wu, Y., Perry, B. J., Ronson, C. W., Fineran, P. C., Nobrega, F. L., Jackson, S. A. (2021) Identification and classification of antiviral defence systems in bacteria and archaea with PADLOC reveals new system types. Nucleic Acids Research, 49, 10868-10878. doi: https://doi.org/10.1093/nar/gkab883
If you use the PADLOC web server please additionally cite:
Payne, L. J., Meaden S., Mestre M. R., Palmer C., Toro N., Fineran P. C. and Jackson S. A. (2022) PADLOC: a web server for the identification of antiviral defence systems in microbial genomes. Nucleic Acids Research, 50, W541-W550. doi: https://doi.org/10.1093/nar/gkac400
The HMMs and system models in PADLOC-DB were built and curated using the data and conclusions from many different sources, we encourage you to also give credit to these groups by reading their work and citing them where appropriate. References to relevant literature can be found at the PADLOC-DB repository.
It is recommended that PADLOC be installed via conda.
# Install PADLOC into a new conda environment
conda create -n padloc -c conda-forge -c bioconda -c padlocbio padloc=2.0.0
# Activate the environment
conda activate padloc
# Download the latest database
padloc --db-update
If you're having installation issues, refer to Issue #35.
# BASIC: Search an amino acid fasta file with accompanying GFF annotations
padloc --faa genome.faa --gff features.gff
# INTERMEDIATE: Use multiple cpus and save output to a different directory
padloc --faa genome.faa --gff features.gff --outdir path_to_output --cpu 4
# ADVANCED: Supply ncRNA and CRISPR array data
padloc --faa genome.faa --gff features.gff --ncrna genome.ncrna --crispr genome.crispr
Note
Refer to padloc/etc/README.md
for instructions on pre-computing ncRNA and CRISPR array data.
# Try running PADLOC on the test data provided
padloc --faa padloc/test/GCF_001688665.2.faa --gff padloc/test/GCF_001688665.2.gff
padloc --fna padloc/test/GCF_004358345.1.fna
General:
--help Print this help message
--version Print version information
--citation Print citation information
--check-deps Check that dependencies are installed
--debug Run with debug messages
Database:
--db-list List all PADLOC-DB releases
--db-install [n] Install specific PADLOC-DB release [n]
--db-update Install latest PADLOC-DB release
--db-version Print database version information
Input:
--faa [f] Amino acid FASTA file (only valid with [--gff])
--gff [f] GFF file (only valid with [--faa])
--fna [f] Nucleic acid FASTA file
--crispr [f] CRISPRDetect output file containing array data
--ncrna [f] Infernal output file containing ncRNA data
Output:
--outdir [d] Output directory
Optional:
--data [d] Data directory
--cpu [n] Use [n] CPUs (default '1')
--fix-prodigal Set this flag when providing an FAA and GFF file
generated with prodigal to force fixing of sequence IDs
Extension | Description |
---|---|
.domtblout | Domain table file generated by HMMER. |
_prodigal.faa | Amino acid FASTA file generated by prodigal. |
_prodigal.gff | GFF annotation file generated by prodigal. |
_padloc.csv | PADLOC output file for identified defence systems. |
_padloc.gff | GFF annotation file for identified defence systems. |
Column | Description |
---|---|
system.number |
Distinct system number. |
seqid |
Sequence ID of the contig. |
system |
Name of the system identified. |
target.name |
Protein ID. |
hmm.accession |
PADLOC HMM accession number. |
hmm.name |
PADLOC HMM name. |
protein.name |
Defence system protein name. |
full.seq.E.value |
Full sequence E-value. From the HMMER Documentation: "The E-value is a measure of statistical significance. The lower the E-value, the more significant the hit." |
domain.iE.value |
Domain E-value. From the HMMER Documentation: "If the full sequence E-value is significant but the single best domain E-value is not, the target sequence is probably a multidomain remote homolog". This is the primary value used to filter the HMMER output for putative hits. |
target.coverage |
Fraction of the target sequence aligning to the HMM. |
hmm.coverage |
Fraction of the HMM aligning to the target sequence. |
start |
Start position of the target sequence in the contig. |
end |
End position of the target sequence in the contig. |
strand |
Strand; forward (+) or reverse (-) |
target.description |
Target sequence descrition taken from the input file. |
relative.position |
Relative position of the target sequence in the contig. |
contig.end |
Relative position of the last sequence in the contig. |
all.domains |
Concatenated list of all domains identified with HMMER. |
best.hits |
Top 5 hits identified with HMMER. |
The HMMs and defence system models used by PADLOC are available from the PADLOC-DB repository. The latest version of the database can be downloaded by running padloc --db-update
. Alternatively, a custom database can be specified with --data
, refer to PADLOC-DB for more information about the database.
-
What are the requirements for an FAA/GFF file pair as input?
The GFF file should conform to the GFF3 specification. Each sequence in the FAA file is matched to an entry in the GFF file based on its ID attribute e.g. for the following sequence:
>WP_000000001.1 molybdopterin-dependent oxidoreductase, partial [Escherichia coli] AAAAAAAGLSVPGVARAVLVSRKPSNGIKAPCRFCGTGCGVLVGTQQGRVVACQGDPDAPVNRGLNCIKG YFLPKIMYGKDRLTQPLLRMKNGKYDKEGEFTPITWDQAFDVMEEKFKTALKEKGPESIGMFGSGQWTIW EGYAASKLFKAGFRSNNIDPNARHCMASAVVGFMRTFGMDEPMGCYDDIEQADAFVLWGANMAEMHPILW SRITNRRLSN
The corresponding entry in the GFF file should contain an ID attribute of the form:
ID=WP_000000001.1
orID=cds-WP_000000001.1
FAA/GFF combinations that are known to work 'out-of-the-box' are from genomes annotated with:
- NCBI's prokaryotic genome annotation pipeline (i.e. genomes from RefSeq and GenBank)
- JGI's IMG annotation pipeline (i.e. genomes from IMG)
- Prokka
-
Why are there parsing failures when using a GFF file from prokka?
The following warning may be thrown when using a GFF file generated by prokka:
Warning: 46324 parsing failures. row col expected actual file 2612 -- 9 columns 1 columns 'prokka.gff' 2613 -- 9 columns 1 columns 'prokka.gff' 2614 -- 9 columns 1 columns 'prokka.gff' 2615 -- 9 columns 1 columns 'prokka.gff' 2616 -- 9 columns 1 columns 'prokka.gff' .... ... ......... ......... ............ See problems(...) for more details.
This is because these GFF files are appended with the contig sequences of the annotated genome. This warning can be avoided by removing the contig sequences from the GFF file with:
sed '/^##FASTA/Q' prokka.gff > nosequence.gff
-
Why can't I use a nucleotide FASTA file with < 100 kbp?
According to Prodigal's own documentation, sequences < 100 kbp are "too short to gather enough statistics to predict genes well". To avoid issues arising from this, PADLOC won't try to run prodigal over anything < 100 kbp.
If you know what you're doing then you can use Prodigal or another gene prediction program to generate your own FAA and GFF files to then use with PADLOC.
-
Is the
full.seq.E.value
ordomain.iE.value
used to filter HMM hits?The primary method of filtering hits is via the domain iE-value, based on the thresholds specified in
hmm_meta.txt
, under the columne.val.threshold
. -
How do I use
--ncrna
and--crispr
to identify ncRNAs and CRISPR arrays?With
--ncrna
and--crispr
, pre-computed files from Infernal and CRISPRDetect respectively can be supplied to PADLOC to be included in the detection of Retrons and CRISPR-Cas systems. Infernal and CRISPRDetect are run automatically when using the PADLOC web server, but can also be run and supplied to the command line version.
Note
Refer to padloc/etc/README.md
for instructions on pre-computing ncRNA and CRISPR array data.
Bugs and feature requests can be submitted to the Issues tab (see Sample bug report).
These dependencies are automatically installed when installing PADLOC via conda
.
- R == 4.3.1
R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
- tidyverse == 2.0.0 Wickham et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686.
- yaml == 2.3.7 Stephens, J., et al. (2020). yaml: Methods to Convert R Data to YAML and Back. https://CRAN.R-project.org/package=yaml.
- getopt == 1.20.3 Davis, T., et al. (2019). getopt: C-Like 'getopt' Behavior. https://CRAN.R-project.org/package=getopt.
- HMMER == 3.3.2 Finn, R.D., Clements, J., and Eddy, S.R. (2011). HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39, W29–W37.
- Prodigal == 2.6.3 Hyatt, D., Chen, GL., Locascio, P.F., Land, M.L., Larimer, F.W., and Hauser, L.J. (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119.
This software and data is available as open source under the terms of the MIT License.