SABAT is toolkit for identifying gene homologs in other species using a gene coding sequence (CDS) and BLAST.
Homologs can be identified using the following steps:
- Make a BLASTdb of your genome of interest (with parameters: -parse_seqids, -blast_dbversion 5)
- BLAST your gene's CDS against the BLASTdb. (we suggest using dc-megaBLAST)
- Convert the BLAST output to a BED file and predict potential gene loci (SABAT blast2bed).
- Visualise your BED file with the genome of interest using IGV (Optional).
- Assemble the CDS and protein sequence from a predicted locus or set of exons (SABAT assemble-locus or assemble-exons).
The following dependencies are required for SABAT:
- Python (v3.12.1)
- Biopython (v1.78)
- Pandas (v2.1.4)
- BLAST+ (2.12.0+)
- IGV (optional for visualising BED files)
The following commands can be used to setup an environment with all the dependencies using anaconda:
conda create -n SABAT python=3.12.1
conda install -c bioconda biopython blast
conda install pandas
All SABAT commands can be run as follows:
source activate SABAT
python3 cli.py <command> ...
blast2bed analyses a BLAST output to identify potential gene homolog loci using information about the gene provided (see table below). All BLAST results and predicted loci are output as a single BED file that can be visualised with IGV. Manual curation can facilitate the identification of loci that are missed by blast2bed. This occurs when homolog gene architecture varies significantly from your gene of interest. E.g. differences in intron/exon size and structure. The commands can be used as follows:
python3 cli.py blast2bed \
--input results.blastn \
--exons 1 \
--coverage 1.1 \
--locus_size 1000 \
--refseq False \
> output.bed
Parameter | Description |
---|---|
-i, --input | BLAST output in tabular format (outfmt6) |
-e, --exons | Expected number of exons in the gene. Default: 1 |
-c, --coverage | A float that represents the proportion of gene CDS that must be covered by a group of BLAST hits for a locus to be defined (0.0-1.0). Default: 1.1 |
-l, --locus_size | Expected size of the gene locus in base pairs. Default: 1000 |
-r, --refseq | Boolean values that indicates whether the BLASTDB provided uses a refseq sequence (RefSeq headers are different from the standard headers in a BLASTDB). Default: False |
The BED file generated is directed to the stdout.
Assemble-locus generates a CDS and protein sequence for a gene locus predicted by blast2bed. Assembly will fail if the CDS doesn't end in a stop codon. The flank option can be used to extend the last exon of the CDS and include a stop codon that may not have been captured when BLASTing the CDS of your gene of interest. The command can be used as follows:
python3 cli.py assemble-locus \
--input output.bed \
--database path/to/blastdb/ \
--locus locus_0 \
--output locus_prediction \
--flank 100
Parameter | Description |
---|---|
-i, --input | Path to a Bed file (generated by blast2bed) |
-db, --database | Path to BLASTDB used to generate the blast output used by blast2bed |
-l, --locus | The name of the locus to assemble. |
-o, --output | Base name for output files |
-f, --flank | The number of nucleotides to add to the last exon. Default: 0. |
assemble-exons generates a CDS and protein sequence for a set of exons identified by the user from the output of blast2bed. Assembly will fail if the CDS doesn't end in a stop codon. The flank option can be used to extend the last exon of the CDS and include a stop codon that may not have been captured when BLASTing the CDS of your gene of interest. The commands can be used as follows:
python3 cli.py assemble-exons \
--input output.bed \
--database path/to/blastdb/ \
--exons exon_1 exon_2 exon_3 \
--output exon_prediction \
--flank 100
Parameter | Description |
---|---|
-i, --input | Path to a Bed file (generated by blast2bed) |
-db, --database | Path to BLASTDB used to generate the blast output used for blast2bed |
-e, --exons | A list of exons |
-o, --output | Base name for output files |
-f, --flank | The number of nucleotides to add to the last exon. Default: 0. |
SABAT's assemble commands makes the following assumptions when predicting homologs CDS:
- The first exon starts with start codon (ATG).
- The last exon ends with a stop codon (TGA, TTA or TAG).
- Intermediate exons don’t have to start with ATG, or end with TGA, TTA or TAG.
- If intermediate exons contain a stop codon that is the end of the exon.
- The longest exon is the correct exon (providing all other criteria are satisfied).