Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods".

I am Ge Gao from the Center for Bioinformatics, Peking University.

Let's start this week's topic: Markov Model.

First, let's review some basic concepts.

As mentioned in Unit 1, Week2, since an event of insertion or deletion

often involve multiple residues,

a gap often has a length of more than one residue.

This is different from substitutions.

Thus,Gap penalty is often implemented as a linear combination of gap opening penalty

and gap extending penalty which were usually given different penalty scores.

To distinguish between opening and extending a gap, we need to introduce the concept of "state".

In other words, we need to "remember" whether a gap has been opened before.

Specifically, we denote by three different states

the three possible alignments between a pair of residues

from sequence X and Y, respectively.

"M" denotes an alignment of two residues, which are not necessarily the same.

"X" means that the residue in sequence X is aligned to a gap,

or that there is an insertion in sequence X.

"Y" means that the residue in sequence Y is aligned to a gap,

or that there is an insertion in sequence Y.

Based on such a definition of state,

we gave the optimal solution for sequence alignment in the week 2 supplementary learning material.

with affine gap penalty using the FSA(Finite State Automation) model developed in computer sciences.

Specifically, sequence alignment can be formulated as transitions between different statuses.

The "M(i,j)" stands for the score of the best alignment

between the subsequence of sequence X from the first base to the i-th base

and the subsequence of sequence Y from the first base to the j-th base,

with Xi aligned to Yj.

The "X(i,j)" or "Y(i,j)" stands for the score of the best alignment

the same subsequences when Xi or Yj is aligned to a gap.

These scoring functions are thus used in the recursive formula for dynamic programming.

Similar to the F function in previous units this week,

we can also represent these three recursive formulae as filling in the cells on three planes

using the dynamic programming matrix.

Please note that the backtracking path might cross between planes.

In fact, we have just created a Markov Chain.

The Markov Chain was introduced by the Russian mathematician Andrei Andreyevich Markov in 1906.

This probabilistic model for stochastic process is used to

depict a series of interdependent random events.

Specifically, the Markov Chain describes how to transit among a set of discrete states

at each different time.

Please note that for each transition it is NOT a must

to transit to a uniquely defined state;

rather, a probability distribution is used to describe how likely it will be to transit to each state.

The only thing needed in Markov Model is that the probability distribution

of state at time t depends and only depends on its pevious finite m states.

Such a Markov Chain is also called an m-order Markov Chain.

We often only consider a simplified case where m is one.

The current state thus depends and only depends on its immediate previous state.

Specifically, we'd like to introduce the concept of transition probability here

Denoting by α{k,l} the probability of state k at time t transiting to state l,

a transition probability matrix can thus be constructed.

This definition does not require alpha_{k,l} and alpha_{l,k} to be equal.

Thus, the matrix is not symmetric with respect to the main diagonal.

We often assume that the transition probability is independent of t.

In this case, the Markov Chain will become the Stationary Markov Chain,

As mentioned before, we only need to "remember" the previous state

to distinguish between gap opening and gap extending.

If the previous state is M, then we can open a new gap and transit to state X or Y.

If the previous state is X or Y, then we can either extend the gap, or close it.

Let's take a closer look.

We have three different states: M, X, and Y.

They denote an align of a pair of residues,

of a residue from sequence X to a gap,

and of a residue from sequence Y to a gap, respectively.

From state M we can transit to state X or Y to open a new gap.

From state X (or Y) we can also transit to state X (or Y) itself to extend a gap.

Of course, from state M we can also transit to state M itself

to extend the alignment with no gap added.

On the other hand, from state X (or Y) we can transit back to M to close a gap.

Further, we can assign to each possible transition a probability.

We denote the probability of transition from M to X or Y (i.e. gap open)

and the probability of transition from X (or Y) to itself (i.e. gap extending)

Then we can derive other transition probabilities by the law of total probability by delta and epsilon, respectively.

Then we can derive probabilities of any alignments by the law of multiplication theorem.

For example, the states for the alignment in the previous example is XMMY.

The probability of this alignment can thus be worked out

by multiply the corresponding probabilities together.

In fact, we have used Markov Chain to give a probabilistic interpretation of sequence alignment.

In next unit,

we will introduce the Hidden Markov Model, and discuss further on this topic.

Here are some summary questions.

You are encouraged to think about them and discuss them

with other students and TAs in the online forum.

Let's meet at the next unit!