1.4 Alignments 2

16.05.2017 | Slides | Lecture Recording

1. Recap

3D Comparison: There is a way to look at proteins in 3 dimensions

1D - secondary structure prediction
2D - distance map
3D - real 3D coordinates

Dynamic Programming

Global (Needleman-Wunsch) or Local (Smith-Waterman)
With or without Gaps
- Linear Gap Penalty: Opening and extending a gap have the same cost
- Affine Gap Penalty: Opening a new gap is more expensive than extending an existing one
Scoring Matrices
- For each pair of residues you can read off, how much you gain by aligning these 2 residues

BLAST: Basic Local Alignment Search Tool

Dynamic Programming is slow for large scale comparisons
Speed search up by hashing words (seed = 3 amino acids / residues)
After matching a word, try to extend the match by dynamic programming
Major Challenge: Get the statistics right
- How significant is a match against the background probability entire database

Question: What is the major challenge of BLAST?

Getting the statistics right: BLAST needs to know, h_ow significant a match is_, by comparing it against the background probability of the entire database.

2. Pairwise Alignment Accuracy

PSI: Percentage Sequence Identity

Zones:

Daylight-Zone: PSI, where it can be assumed that from a similar sequence follows similar structure
Twilight-Zone: PSI, where it is not possible to infer similar structure from similar sequence (the signal fades)
Midnight-Zone: PSI, where sequence similarity does not tell anything about structure similarity (down to random changes)

# Note: The Midnight-Zone is, where most proteins of similar structure sit

Going deeper into the Twilight-Zone, the following results are to be expected:

True Positives go up in absolute numbers
False positive increase (drastically) in absolute numbers

2.1 HSSP Curve

HSSP: Homology-derived Secondary Structure of Proteins

How to get the curve?

Get all 3D structures from PDB
Remove bias (sequence unique subset)
And compare 'all vs redundancy reduced set'
1. compare 3D structure (e.g. RSMD)
2. compare sequence

Result: Sequence Conservation of protein structur

![](/assets/Screen Shot 2017-07-03 at 09.08.35.png)🕵🏻 Observations

Over the curve (daylight-zone), many true-positives (sequence identity -> similar 3D structure)
Under the curve (upon entering the twilight zone)
- Explosion of false positives
- but also **significant increase **(x10) of true positives (This is why we want to go into the Twilight Zone!!)

Question: Why is it interesting to find similar proteins out of the Twilight / Midnight Zone?

The Midnight-Zone is, where most proteins of similar structure sit.

Question: Why is it that even with only 40% PSI, we can still assume similar structure? Could we randomly change 60% of the residues in the lab and get a new protein with similar structure?

These 60% of changed residues happened under evolutionary pressure and are not random

mutations that did not change structure & function survived (we can observe them today)

mutations that did change structure & function most likely did not survive

Thus randomly changing 60% of residues in a proteins, would not result in a similar protein

Question: Why are certain proteins / structure multiple times in the PDB?

different resolution of 3D structure

different goals of publication produced (new) 3D structures

folding sites

binding partners

etc ...

3. Multiple Sequence Alignment

Dynamic Programming cannot be done efficiently with multiple sequences.

3.1 Multiple Alignment Hack 1: Iterative Pairwise Dynamic Programming

Progressive 1

Align sequences pairwise int consensus sequences
Align resulting consensus

Progressive 2

Align all sequences pairwise
Start with highest matching alignment
Iteratively align sequences into consensus sequence

How to find consensus? Different methods, e.g. the first, the more meaningful aa, ...

3.2 Multiple Alignment Hack 1: Map to Tree / Pairwise

ClustalW/ClustalX

all against all (pairs) by dynamic programming (varying substitution matrices)
build phylogenetic tree
slow, dynamic programming, for experts

4. Profiles

_Profiles profit from relation of 'families'. _

Building up a profile, we can see certain amino acids that are more conserved than others. Computationally we can identify sequence motifs that describe such a profile as a regular expression.

PSSM (Position Specific Scoring Matrix)

You could also write a **profile **into a substitution matrix: A matrix of numbers with scores for each residue or nucleotide at each position.

Building a PSSM

Absolute Frequencies
Add pseudo-counts if necessary
relative frequency
log likelihoods

❓ Question: How are profiles built up? How are the normal noted down? Do we have to know a specific algorithm?

Build up algorithm:

Take all proteins of PSI over a certain threshold ...

❓

Profile Formats:

Regular Expression

PSSM (Position Specific Scoring Matrix) ❓

Question: What is a PSSM (Position Specific Scoring Matrix)?

A matrix of numbers with scores for each residue or nucleotide at each position. Built, e.g. by PSI-BLAST.

5. PSI-BLAST

Position-Specific Iterative Basic Local Alignment Tool

PSI-BLAST Steps

1) Fast Hashing: Like BLAST, match 'word'
2) Dynamic Programming Extension between matches: BLAST + Smith-Waterman
3) Compile Statistics: EVAL - Expectation Values
4) Collect all pairs and build profile
5) ... compare sequences (profile-sequence) and iterate

![](/assets/Screen Shot 2017-07-03 at 10.39.37.png)

Question: Which steps are involved in building up a profile with PSI-BLAST?

1) Fast Hashing: Like BLAST, match 'word'
2) Dynamic Programming Extension between matches: BLAST + Smith-Waterman
3) Compile Statistics: EVAL - Expectation Values
4) Collect all pairs and build profile
5) ... compare sequences (profile-sequence) and iterate

Question: Why is PSI-BLAST so fast?

Because it drastically reduces the length of the comparisons with dynamic programming.

6. Hidden Markov Models (HMM)

Hidden Markov Model are another method for creating Machine Learning Models. They are a good choice, if the structure of the problem is known beforehand.

different states
each state has transition probabilities to the neighboring states / itself

7. HMM for Alignment

Generic Profile HMM for alignment

Captures matches, insertions, deletions
Transition and emmission probabilities
gap penalty handled by variation of transition probabilities
calculation of probability by multiplying path variables

![](/assets/Screen Shot 2017-07-03 at 10.57.05.png)**Entropy in alignment: **Consider the residue at position i

BEFORE any amino acid is aligned, we expect a particular amino acid to have some prior background probability $$P_0$$ with entropy $$H_0$$
AFTER the alignment we consider the same column with a _posterior probability _$$P_i + priors \rightarrow H_i$$.
**We expect **$$H_i = \begin{cases} 0, & if\ conserved \ H_0, & if\ varied \end{cases}$$
$$H_i - H_0$$ reflects the "bits_saved" by the alignment

With small families (few members, little divergence) the entropy is dominated by priors (= the background noise dominates)

8. Genetic Algorithm for alignment

❗️❗️❗️ Independence Assumption NOT needed for genetic algorithm

# All algorithms so far assumed *Independence between residues*: 
# What happens at position i is independent of what happens at position i+x.

# The genetic algorithm does not make this indepence assumption!!

The genetic algorithm works on segments
through mutations it creates new alignments

T-Coffee: much slower, requires pre-processing, Genetic algorithm

9. Profile-Profile Alignments

Compare Profiles to go even deeper into the Twilight- / Midnight-Zone

![](/assets/Screen Shot 2017-07-03 at 11.19.32.png)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

14-alignments-2.md

14-alignments-2.md

1.4 Alignments 2

16.05.2017 | Slides | Lecture Recording

1. Recap

2. Pairwise Alignment Accuracy

2.1 HSSP Curve

How to get the curve?

Result: Sequence Conservation of protein structur

3. Multiple Sequence Alignment

3.1 Multiple Alignment Hack 1: Iterative Pairwise Dynamic Programming

Progressive 1

Progressive 2

3.2 Multiple Alignment Hack 1: Map to Tree / Pairwise

ClustalW/ClustalX

4. Profiles

PSSM (Position Specific Scoring Matrix)

Building a PSSM

5. PSI-BLAST

PSI-BLAST Steps

6. Hidden Markov Models (HMM)

7. HMM for Alignment

8. Genetic Algorithm for alignment

9. Profile-Profile Alignments

Files

14-alignments-2.md

Latest commit

History

14-alignments-2.md

File metadata and controls

1.4 Alignments 2

16.05.2017 | Slides | Lecture Recording

1. Recap

2. Pairwise Alignment Accuracy

2.1 HSSP Curve

How to get the curve?

Result: Sequence Conservation of protein structur

3. Multiple Sequence Alignment

3.1 Multiple Alignment Hack 1: Iterative Pairwise Dynamic Programming

Progressive 1

Progressive 2

3.2 Multiple Alignment Hack 1: Map to Tree / Pairwise

ClustalW/ClustalX

4. Profiles

PSSM (Position Specific Scoring Matrix)

Building a PSSM

5. PSI-BLAST

PSI-BLAST Steps

6. Hidden Markov Models (HMM)

7. HMM for Alignment

8. Genetic Algorithm for alignment

9. Profile-Profile Alignments