Frameshift-Aware-Alignment

Sequence Bioinformatics Group Project WS 21/22 Samira Breitling, Jules Kreuer, Sebastian Meyer and Catalina Schlotterer

Introduction

Sequence alignment is the basis of many bioinformatic problems. The Needleman-Wunsch algorithm as well as the Smith-Waterman algorithm are dynamic programming approaches to compare two sequences and compute a global/local aligment. These algorithms perform reliable alignments for short as well as high quality reads. However, when comparing a DNA sequence to an amino acid sequence erronous insertions and deletions result in so called frameshifts. The following example illustrates this issue:

When inserting three bases in an error-free read:

Then translate this read to its aminoacid sequence and compare it with the actual protein:

The green part of the alignment shows perfect matching parts of the alignment. However between the first and the third insertion, where the alignment is again in the original frame, the alignment is of poor quality. A frameshift-aware alignment deals with this frameshifts and reports forward frameshifts (inserting two bases) and backward frameshifts (inserting one base) in the alignment, which leads to an alignment of significantly higher alignment quality.

In this group project we constructed a dynamic programm which reports an alignment and the corresponding score considering frameshifts.

Affine gap penalty

In an alignment the gap penalty is an important parameter to score alignments. In our frameshift-aware alignment program we first considered the linear gap penalty, which penalizes each gap equally. However, this is not realistic when it comes to biological events. Each mutation (insertion or deletion) is an indivitiual evolutionary event. As a consequence, we consider an affine gap model, which penalizes a gap opening higher than a gap extending. This gap model is called affine gap penalty.

In addition to the plain frameshift aware alignment, we also implemented a frameshift-aware alignment program, which also considers the affine gap penalty model.

Teamwork

In the beginning of the group project we first implemented the different helper functions: translating the DNA sequence to an amino acid sequence, reading in the fasta files and reading in the blosum matrix. We tested these functions and added different blossum matrices. Additionally we added the main function and discussed the input paramteters and the output format. Next we started with the actual frameshift aware alginment. Therefore we decided to implement two algorithms in pairs of two and compared both implementations. This allowed us to have two different approaches, where we could discuss difficulties and pair programming enabled efficient implementing of the algorithm. After the implementation of the frameshift aware alignment, we considered integrating the affine gap penalty. After some research we came up with a paper, that seemed to solve this problem ("Frameshift alignment: statistics and post-genomic applications" by Sergey L. Sheetlin, Yonil Park, Martin C. Frith, and John L. Spouge). However, they used a different approach doing the frameshift alignment than we did. Consequently, one group decided to implement the affine gap penalty by adding the affine gap penalty restrictions to our linear gap penalty implementation. The other group tried to implement as described in the paper. Sadly, the preconditions and main ideas of the paper were too different compared to ours, so we decided to go with our own recursion . Therefore working in pairs to resolve the tasks was a good idea.

Algorithm

As already mentioned in the introduction, the frameshift-aware algorithm is a dynamic porgramm. Hence, a matrix which scores is filled to compute the alignment and the score. In the picture below it is illustrated how to compute the score for a a cell. The black errors indicate the same cases as in a "basic" alignment, while the blue errors show which cell we need to consider for frameshifts. In this case we either insert one base or two bases into the DNA query sequence q.

where d is the gap penalty, and k the frameshift penalty.

Note that the algorithm only works if the frameshift penalty is larger than the gap penalty. Hence, the occurance of an insertion or deletion is more probable than the occurance of a frameshift.

Affine gap penalty

Several papers have shown that an implementation of a frameshift-aware aligner with affine gap penalty is possible. All of them differ a lot in their objective and methodology. For this project, attempts were made to implement this on or own. However, it quickly became apparent that this was a very large field to cover, so we reduced it to the assumptions we could make for sure.

The following section shows the incomplete attempt. The code can be found in the branch 'affine_test_jules'.

For the affine gap penalty we need two additional matrices: The top matrix should provide a minimal score if the extension of an amino-acid insertion is optimal, whereas the bottom matrix is responsible for the dna insertions.

We initialized the three matrices as follows:

Iniitialization columns

basePenalty = -3*gap_open - (i*gap_extend)

score_matrix[i][0] = basePenalty
top_matrix[i][0] = top_matrix[i][1] = top_matrix[i][2] = - infinity
score_matrix[i][1] = score_matrix[i][2] = basePenalty - k

Initialization rows

basePenalty = -3*gap_open - (j*gap_extend)

score_matrix[0][j] = basePenalty
score_matrix[0][j+1] = score_matrix[0][j+2] = basePenalty - k
bottom_matrix[0][j] = bottom_matrix[0][j+1] = bottom_matrix[0][j+2] = - infinity

Recursion:

All figures except for the affine gap penalty recursion were taken out of the Sequence Bioinformatics lecture 2021/2022 script by Daniel Huson

Runtime

The frameshift-aware global aligner has a theoretical runtime O(AA*DNA) with AA=length of the amino-acid sequence and DNA=length of the dna sequence.

To determine the actual runtime, random DNA and AA sequences with different length were generated and aligned. In order to obtain more accurate results, each length-combination was repeated 20 times, the median was taken. The results can be viewed in the previous heatmap.

To predict the running time, the following formula determined by a simple linear model, can be used:

time_in_ms = 0.70030 + 0.00778 * AA * DNA

Installation

This project was build as a python module and can be installed with pip:

# Clone repo
cd Frame-Shift-Aware-Alignment
pip install .

Usage

To align two sequences, the function align can be called. Some sequences for testing can be found within the tests folder. Several arguments should be provided:

    Parameters:
        dnaSeq:  string, dna sequence.
        aaSeq:   string, amino-acid sequence.
        gap:     int, score for gap.
        shift:   int, score for frameshift.
        bm:      int, string, One of [45, 50, 62, 80, 90] or path to blosum matrix.
        verbose: bool, print score and alignment.
    Returns:
        score: Int, Score of aligment
        dnaSeq_align: String, alignment of the DNA sequence.
            With \\ denoting a backward-framshift.
            With / denoting a forward-framshift.
            With - denoting a gap.
        aaSeq_align: String, alignment of the AA sequence.
            With - denoting a gap.

Example

from frameshift_aware_alignment import align

score, dna, aa = align("AGTCAGT", "MM", 6, 15, 62, false)

Standalone version

A usage without installation is also possible. After downloading the archive, execute top-level file align.py.

python3 align.py [-h] -d DNASEQ -aa AASEQ -gp GAP-Penalty -sp SHIFT-Penalty [-b BLOSUM] [-bp BLOSUM_PATH] [-o OUT]

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
Meeting		Meeting
Research Papers		Research Papers
runtime		runtime
src/frameshift_aware_alignment		src/frameshift_aware_alignment
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
align.py		align.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
test.txt		test.txt
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Frameshift-Aware-Alignment

Introduction

Affine gap penalty

Teamwork

Algorithm

Affine gap penalty

Runtime

Installation

Usage

Standalone version

About

Releases

Packages

Contributors 4

Languages

License

meyer-sp/Frame-Shift-Aware-Alignment

Folders and files

Latest commit

History

Repository files navigation

Frameshift-Aware-Alignment

Introduction

Affine gap penalty

Teamwork

Algorithm

Affine gap penalty

Runtime

Installation

Usage

Standalone version

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages