S.A.B.A.T - Semi-Automatic BLAST Annotation Toolkit

SABAT is toolkit for identifying gene homologs in other species using a gene coding sequence (CDS) and BLAST.

Homologs can be identified using the following steps:

Make a BLASTdb of your genome of interest (with parameters: -parse_seqids, -blast_dbversion 5)
BLAST your gene's CDS against the BLASTdb. (we suggest using dc-megaBLAST)
Convert the BLAST output to a BED file and predict potential gene loci (SABAT blast2bed).
Visualise your BED file with the genome of interest using IGV (Optional).
Assemble the CDS and protein sequence from a predicted locus or set of exons (SABAT assemble-locus or assemble-exons).

Dependencies

The following dependencies are required for SABAT:

Python (v3.12.1)
Biopython (v1.78)
Pandas (v2.1.4)
BLAST+ (2.12.0+)
IGV (optional for visualising BED files)

The following commands can be used to setup an environment with all the dependencies using anaconda:

conda create -n SABAT python=3.12.1
conda install -c bioconda biopython blast
conda install pandas

Running SABAT

All SABAT commands can be run as follows:

source activate SABAT
python3 cli.py <command> ...

Commands

blast2bed

blast2bed analyses a BLAST output to identify potential gene homolog loci using information about the gene provided (see table below). All BLAST results and predicted loci are output as a single BED file that can be visualised with IGV. Manual curation can facilitate the identification of loci that are missed by blast2bed. This occurs when homolog gene architecture varies significantly from your gene of interest. E.g. differences in intron/exon size and structure. The commands can be used as follows:

python3 cli.py blast2bed \
--input results.blastn \
--exons 1 \
--coverage 1.1 \
--locus_size 1000 \
--refseq False \
> output.bed

Parameters

Parameter	Description
-i, --input	BLAST output in tabular format (outfmt6)
-e, --exons	Expected number of exons in the gene. Default: 1
-c, --coverage	A float that represents the proportion of gene CDS that must be covered by a group of BLAST hits for a locus to be defined (0.0-1.0). Default: 1.1
-l, --locus_size	Expected size of the gene locus in base pairs. Default: 1000
-r, --refseq	Boolean values that indicates whether the BLASTDB provided uses a refseq sequence (RefSeq headers are different from the standard headers in a BLASTDB). Default: False

The BED file generated is directed to the stdout.

assemble-locus

Assemble-locus generates a CDS and protein sequence for a gene locus predicted by blast2bed. Assembly will fail if the CDS doesn't end in a stop codon. The flank option can be used to extend the last exon of the CDS and include a stop codon that may not have been captured when BLASTing the CDS of your gene of interest. The command can be used as follows:

python3 cli.py assemble-locus \
--input output.bed \
--database path/to/blastdb/ \
--locus locus_0 \
--output locus_prediction \
--flank 100

Parameters

Parameter	Description
-i, --input	Path to a Bed file (generated by blast2bed)
-db, --database	Path to BLASTDB used to generate the blast output used by blast2bed
-l, --locus	The name of the locus to assemble.
-o, --output	Base name for output files
-f, --flank	The number of nucleotides to add to the last exon. Default: 0.

assemble-exons

assemble-exons generates a CDS and protein sequence for a set of exons identified by the user from the output of blast2bed. Assembly will fail if the CDS doesn't end in a stop codon. The flank option can be used to extend the last exon of the CDS and include a stop codon that may not have been captured when BLASTing the CDS of your gene of interest. The commands can be used as follows:

python3 cli.py assemble-exons \
--input output.bed \
--database path/to/blastdb/ \
--exons exon_1 exon_2 exon_3 \
--output exon_prediction \
--flank 100

Parameters

Parameter	Description
-i, --input	Path to a Bed file (generated by blast2bed)
-db, --database	Path to BLASTDB used to generate the blast output used for blast2bed
-e, --exons	A list of exons
-o, --output	Base name for output files
-f, --flank	The number of nucleotides to add to the last exon. Default: 0.

Assumptions made by SABAT during Assembly

SABAT's assemble commands makes the following assumptions when predicting homologs CDS:

The first exon starts with start codon (ATG).
The last exon ends with a stop codon (TGA, TTA or TAG).
Intermediate exons don’t have to start with ATG, or end with TGA, TTA or TAG.
If intermediate exons contain a stop codon that is the end of the exon.
The longest exon is the correct exon (providing all other criteria are satisfied).

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

S.A.B.A.T - Semi-Automatic BLAST Annotation Toolkit

Dependencies

Running SABAT

Commands

blast2bed

Parameters

assemble-locus

Parameters

assemble-exons

Parameters

Assumptions made by SABAT during Assembly

About

Releases 1

Packages

Languages

Orpowell/SABAT

Folders and files

Latest commit

History

Repository files navigation

S.A.B.A.T - Semi-Automatic BLAST Annotation Toolkit

Dependencies

Running SABAT

Commands

blast2bed

Parameters

assemble-locus

Parameters

assemble-exons

Parameters

Assumptions made by SABAT during Assembly

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages