6.095/ Computational Biology: Genomes, Networks, Evolution. Sequence Alignment and Dynamic Programming

Samankaltaiset tiedostot
Efficiency change over time

Capacity Utilization

Bounds on non-surjective cellular automata

Genome 373: Genomic Informatics. Professors Elhanan Borenstein and Jay Shendure

Gap-filling methods for CH 4 data

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

Tietorakenteet ja algoritmit

16. Allocation Models

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

LYTH-CONS CONSISTENCY TRANSMITTER

The CCR Model and Production Correspondence

Alternative DEA Models

Information on preparing Presentation

Eukaryotic Comparative Genomics

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Functional Genomics & Proteomics

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

7. Product-line architectures

anna minun kertoa let me tell you

Manolis Kellis Piotr Indyk

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

BLOCKCHAINS AND ODR: SMART CONTRACTS AS AN ALTERNATIVE TO ENFORCEMENT

Miksi Suomi on Suomi (Finnish Edition)

make and make and make ThinkMath 2017

Kvanttilaskenta - 1. tehtävät

Use of spatial data in the new production environment and in a data warehouse

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

Voice Over LTE (VoLTE) By Miikka Poikselkä;Harri Holma;Jukka Hongisto

AYYE 9/ HOUSING POLICY

Travel Getting Around

State of the Union... Functional Genomics Research Stream. Molecular Biology. Genomics. Computational Biology

1. Liikkuvat määreet

Choose Finland-Helsinki Valitse Finland-Helsinki

Results on the new polydrug use questions in the Finnish TDI data

The Viking Battle - Part Version: Finnish

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

FinFamily Installation and importing data ( ) FinFamily Asennus / Installation

Innovative and responsible public procurement Urban Agenda kumppanuusryhmä. public-procurement

Plasmid Name: pmm290. Aliases: none known. Length: bp. Constructed by: Mike Moser/Cristina Swanson. Last updated: 17 August 2009

Searching (Sub-)Strings. Ulf Leser

Other approaches to restrict multipliers

Operatioanalyysi 2011, Harjoitus 4, viikko 40

Nuku hyvin, pieni susi -????????????,?????????????????. Kaksikielinen satukirja (suomi - venäjä) ( (Finnish Edition)

BDD (behavior-driven development) suunnittelumenetelmän käyttö open source projektissa, case: SpecFlow/.NET.

MEETING PEOPLE COMMUNICATIVE QUESTIONS

7.4 Variability management

Operatioanalyysi 2011, Harjoitus 2, viikko 38

1.3Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Toppila/Kivistö Vastaa kaikkin neljään tehtävään, jotka kukin arvostellaan asteikolla 0-6 pistettä.

Information Retrieval

The role of 3dr sector in rural -community based- tourism - potentials, challenges

FIS IMATRAN KYLPYLÄHIIHDOT Team captains meeting

Uusi Ajatus Löytyy Luonnosta 3 (Finnish Edition)

Green Growth Sessio - Millaisilla kansainvälistymismalleilla kasvumarkkinoille?

MRI-sovellukset. Ryhmän 6 LH:t (8.22 & 9.25)

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

11/17/11. Gene Regulation. Gene Regulation. Gene Regulation. Finding Regulatory Motifs in DNA Sequences. Regulatory Proteins

Counting quantities 1-3

Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. David R. Kelley

Strict singularity of a Volterra-type integral operator on H p

Lab SBS3.FARM_Hyper-V - Navigating a SharePoint site

Salasanan vaihto uuteen / How to change password

Chapter 9 Motif finding. Chaochun Wei Spring 2019

Salausmenetelmät 2015/Harjoitustehtävät

1.3 Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Statistical design. Tuomas Selander

Huom. tämä kulma on yhtä suuri kuin ohjauskulman muutos. lasketaan ajoneuvon keskipisteen ympyräkaaren jänteen pituus

KMTK lentoestetyöpaja - Osa 2

C470E9AC686C

S SÄHKÖTEKNIIKKA JA ELEKTRONIIKKA

A DEA Game I Chapters

MUSEOT KULTTUURIPALVELUINA

lpar1 IPB004065, IPB002277, and IPB Restriction Enyzme Differences from REBASE Gained in Variant Lost from Reference

Tarua vai totta: sähkön vähittäismarkkina ei toimi? Satu Viljainen Professori, sähkömarkkinat

Oma sininen meresi (Finnish Edition)

Exercise 1. (session: )

INNOVATION PROJECT2014 FLORIANE MASSÉ KIA KOPONEN RASMUS LÖNNQVIST MIRA HÖLTTÄ EMMI KAINULAINEN TONI GRÖNMARK

21~--~--~r--1~~--~--~~r--1~

Capacity utilization

Särmäystyökalut kuvasto Press brake tools catalogue

Kielenkäytön näkökulma oppimisvuorovaikutukseen

CASE POSTI: KEHITYKSEN KÄRJESSÄ TALOUDEN SUUNNITTELUSSA KETTERÄSTI PALA KERRALLAAN

Mitä mahdollisuuksia tuloksemme tarjoavat museoille?

Data quality points. ICAR, Berlin,

Immigration Studying. Studying - University. Stating that you want to enroll. Stating that you want to apply for a course.

Kvanttilaskenta - 2. tehtävät

TIETEEN PÄIVÄT OULUSSA

Strategiset kyvykkyydet kilpailukyvyn mahdollistajana Autokaupassa Paula Kilpinen, KTT, Tutkija, Aalto Biz Head of Solutions and Impact, Aalto EE

Mitä Master Class:ssa opittiin?

Keskeisiä näkökulmia RCE-verkoston rakentamisessa Central viewpoints to consider when constructing RCE

Arkkitehtuuritietoisku. eli mitä aina olet halunnut tietää arkkitehtuureista, muttet ole uskaltanut kysyä

Master's Programme in Life Science Technologies (LifeTech) Prof. Juho Rousu Director of the Life Science Technologies programme 3.1.

Kylänetti projektin sivustojen käyttöohjeita Dokumentin versio 2.10 Historia : 1.0, 1.2, 1.6 Tero Liljamo / Deserthouse, päivitetty 25.8.

Biojätteen keruu QuattroSelect - monilokerojärjestelmällä Tiila Korhonen SUEZ

PAINEILMALETKUKELA-AUTOMAATTI AUTOMATIC AIR HOSE REEL

Ostamisen muutos muutti myynnin. Technopolis Business Breakfast

Harjoitustehtävät. Laskarit: Ti KO148 Ke KO148. Tehtävät viikko. VIIKON 42 laskarit to ko salissa IT138

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

Transkriptio:

6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Sequence lignment and Dynamic Programming Tue Sept 13, 2005

Challenges in Computational Biology 4 Genome ssembly 5 Regulatory motif discovery 1 Gene Finding DN 2 Sequence alignment 6 Comparative Genomics 7 Evolutionary Theory TCTGCTT TCGTGT TGGGTT TTTCTT TTTGTTT 3 Database lookup 8 Gene expression analysis 11 Protein network analysis RN transcript 9 Cluster discovery 10 Gibbs sampling 12 Regulatory network inference 13 Emerging network properties

Reminder: Last lecture / recitation Schedule for the term Foundations till midterm Frontiers lead to final project Duality: basic problems / fundamental techniques Biology introduction DN, RN, protein, transcription, translation Why computational biology First problem: Motif discovery Counting motif instances across the genome Counting conserved motif instances Problem set: discover motifs in actual yeast genome

GL10 Counting conserved motif instances Scer TTTTTGTTTTCTTCTTCTTTTTTTTTGGTGGCGCGGTTTTTCTTTCTGGCTTCCCCTTC Spar CTTGTTGTCTTTTCGTTTTT-CCTTTTGTGGGTGCGGTGTGTTTTTTTCTCGCTTTCCTTCTCC Smik GTTTTGTTTTTCGTTTTTTTTCCTTCTTCGGTTTGT-TGTCGTTTTCTTTCGTTCTTCTCC Sbay TTTTTTTGTTTCTTTGTTTTCTTTCTTTCTTCTTTGGTGTGTCCTCTGCTTCT-GTCCTTCCTT * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * * Scer TTCCTTCTTCTTCTTTTGTTGT-GGT-GTGGCCCCTTTCTTGCCTCC--TTCTCTTTGGCTTTCGTTCG Spar TTCCTTCTGTCTTCTTTTGTTGT-GGGT-GTTGTCCCCGTTCTTCCCGGCC--TT-TCTTGCTTGCTG-TCG Smik TCCGTGTCTGTCTTCTTTTGTTC-GGGTTGTTGGTTCCCGTCTCCCGTCGGT--CTTTCTTGGGCTTTG-CT-TTG Sbay TGTTTTCTGTCTTTCTTTTTTTGGGTGCCTCGTGCTCCTCGCGGGGGTTTTCTGTGGGCTTTCCCTTTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * * Scer CTTCTGCTCTTGC-----TTTTGGTCGGTTGGCCGCCGGCGGGCGCGCCCTCCGCGGGCTCTCCTCCGTGCGTCCTCGTCT Spar CTCTGCTCTTGC-----TTTGGTCGGTCGGCCGCCGGCGGCGCGCCCTCCGCGGTTTCCCCTCCGTGCGTCGCCGTCT Smik TTTGCTGTTCG--------TTTGTCGGTGGGCCGCCGCGGCGCTTCCCCGCGGCTTCTCCTCCGCGCGGCGTCCTCT Sbay TCTTTTGTCCTTCTTCGCTGTTGTCGGTCGGCTGCCGCCGGTGCGTCTCCGGCGGCTGTCCTCCGTGCGGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * *** Scer TCCCGG-TCGCGTTCCTGCGCGTGTGCCTCGCGCCGCCTGCTCCGCTGTTCTC-----TCTGCTTTT--TGGTTTG Spar TCGTCGGGTTGTGTCCCTT-CTCGTGTCCTCGCGCCGCCCTGCTCCGCTGGTTCTCG-TCTTGTTTTTTTTGGTTTGC Smik CGTTGG-TCGCGTCCCTG-CTGGTCGGCTCGCCCCCGTGGTCCGCTTTCTGGCTGGGTCTTTTCT--CGGTGTGCC Sbay GTG-CGGTCCGTCCCTGT-TCTGGCGTCTCGCCCCGCCTCCCCGCTGCTGCGC-TGCCTGTGTG--GCGTTTGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** *** Scer GGG-TTGGCGT----CCTGGCCCCCCCTT-CTTCGTCTTCCCT-GGTGTTGCG------TTG--T Spar GGCTGCGCCC----CTGCCCCTTCCTTTCCTTTGTCTTGGCCGCT-TGGTTGTCG------TTG--G Smik CCGCTCGTCC----CCCGGCCCCCTCCTT-CTCGTGCGTCTGGCTGCT-GTTTTGGTGC-TTTG--G Sbay GCGTGTGCTTCCTTGCCCCT-CCCCTTCTTTGTTCCGTGTCGCCCTGGTGCTGTGGGGTTGCGGTCGCCTCTCG **** * * ***** *** * * * * * * * * ** MIG1 TBP Gal10 Scer TTTTTGCCTTTTTCTGGGGTTTTCGCGGCG--TGTTTTT-GTCTTTCGTTTTGGGCTGCTCCC-----TT Spar GTTTT--TCTTTTCCTGGCTTCTCCGCTTGGTTTTT-GGTCTTTGCCTTTGCGTTGCTGCCC-----TT Smik TTCTC--CCTTTCTCTGTGTTTCTCCCGTG--TGGTTT--GGCTTTGCCTTTGCGTCGCGGTC-----T Sbay TTTTCCGTTTTCTTCTGTGTGGCTCT--GCGGTTGGTTTTCTGTTCCTTTTGCCTTTTGGTGTCGCCTCTTGT * * * *** * ** * * *** *** * * ** ** * ******** **** * Scer TCTTCTTTCCTTTTCGT--TTGTTTCTT-CTTTTCT----GTCTGTTCC-TTGTTTTCCTCTTCT Spar TTC-TTTGCTCCTCCGTT--TTTTTTCGT-TTTGTTTTTT----GTCTGGTTTC-CGTGTTTTCTCTTCT Smik TCTTCC-TTCGCCTTTGGCTTTTTTTGTCTGTTTTCTTTGGGTTTGTCC-TGTCGTTCTTCTC Sbay TGTTTTTCTTTTTCCGTTTGTCTTCTTGTTTGTTTTTCCGGTTTTCTTTGTCTCCTTTCCTCTCGTTTCCTTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** * Scer TT-CGTCGG---GCTT Spar TTT-CGTCGG-GCCTT Smik TCGTTCTCG----CT.. Sbay TTTCCCCCCCTT * * ** * ** ** ** MIG1 GL1 Gal4 Gal1 GL4 GL4 GL4 GL4 TBP Factor footprint Conservation island We can read evolution to reveal functional elements

Today s goal: How do we actually align two genes?

Genomes change over time begin C G T C T C C G T G T C G T G T C mutation deletion T G T G T C G T G T C insertion end T G T G T C

Goal of alignment: Infer edit operations begin C G T C T C? end T G T G T C

Question 1: ligning two (ungapped) strings Given two possibly related strings S1 and S2 What is the longest common substring? (no gaps) S1 S2 C G T C T C T G T G T C Scoring function: Match(x,x) = +1 Mismatch(,G)= -½ S1 offset: +1 C G T C T C Mismatch(C,T)= -½ Mismatch(x,y) = -1 S2 T G T G T C G T C offset: -2 +1 -½ -1-1 S1 S2 C G T C T C T G T G T C G T C -½ -1-1 +1-1 -1-1 +1 -½ -1 -½ +1

Q2: ligning two (possibly gapped) sequences Given two possibly related strings S1 and S2 What is the longest common subsequence? (gaps allowed) S1 S2 C G T C T C T G T G T C S1 S2 LCSS C G T C T C T G T G T C G T T C Edit distance: Number of changes needed for S1 S2 Uniform scoring function

How can we compute best alignment S1 S2 C G T C T C T G T G T C Given additive scoring function: Cost of mutation (G vs. CT) Cost of insertion / deletion Reward of match Need algorithm for inferring best alignment Enumeration? How would you do it? How many alignments are there?

Can we simply enumerate all possible alignments? Ways to align two sequences of length m, n n + m ( m + n)! 2 m+ n = m ( m!) 2 π m For two sequences of length n n Enumeration Today's lecture 10 184,756 100 20 1.40E+11 400 100 9.00E+58 10,000

Key insight: score is additive! S1 i C G T C T C S2 T G T G T C Compute best alignment recursively For a given aligned pair (i, j), the best alignment is: Best alignment of S1[1..i] and S2[1..j] + Best alignment of S1[ i..n] and S2[ j..m] j S1 C G i i T C T C S2 T G T G j j T C

Key insight: re-use computation S1 C G T C T C S1 C G T C T C S2 T G T G T C S2 T G T G T C S1 C G T C T C S1 C G T C T C S2 T G T G T C S2 T G T G T C S1 C G T S1 C G T C T C S2 T G T G S2 T G T C Identical sub-problems! We can reuse our work!

Solution #1 Memoization Create a big dictionary, indexed by aligned seqs When you encounter a new pair of sequences If it is in the dictionary: Look up the solution If it is not in the dictionary Compute the solution Insert the solution in the dictionary Ensures that there is no duplicated work Only need to compute each sub-alignment once! Top down approach

Solution #2 Dynamic programming Create a big table, indexed by (i,j) Fill it in from the beginning all the way till the end You know that you ll need every subpart Guaranteed to explore entire search space Ensures that there is no duplicated work Only need to compute each sub-alignment once! Very simple computationally! Bottom up approach

simple introduction to Dynamic Programming Fibonacci numbers 55 8 13 2 3 5 34 Figure by MIT OCW. 21

Fibonacci numbers are ubiquitous in nature Rabbits per generation Leaves per height 4 9 1 6 (2) 3 x 5 (7) 8 10 12 14 11 15 16 13 Romanesque spirals Nautilus size Coneflower spirals Leaf ordering Figure by MIT OCW. Figure by MIT OCW. Figure by MIT OCW. Figure by MIT OCW.

Computing Fibonacci numbers: Top down Goal: Compute nth Fibonacci number. F(0)=1, F(1)=1, F(n)=F(n-1)+F(n-2) 1,1,2,3,5,8,13,21,34,55,89,144,233,377, Top-down approach: Python code def fibonacci(n): if n==1 or n==2: return 1 return fibonacci(n-1) + fibonacci(n-2) nalysis: T(n) = T(n-1) + T(n-2) = ( ) = O(2 n )

Computing Fibonacci numbers: Bottom up Top-down approach Python code fib_table F[1] 1 F[2] 1 F[3] 2 F[4] 3 F[5] 5 F[6] 8 def fibonacci(n): fib_table[1] = 1 fib_table[2] = 1 for i in range(3,n+1): fib_table[i] = fib_table[i-1]+fib_table[i-2] return fib_table[n] nalysis: T(n) = O(n) F[7] F[8] F[9] F[10] F[11] F[12] 13 21 34 55 89?

fib_table F[1] 1 F[2] 1 F[3] 2 F[4] 3 F[5] 5 F[6] 8 F[7] 13 F[8] 21 F[9] 34 F[10] 55 F[11] 89 F[12]? What have we learned? Principles of dynamic programming Reveal identical sub-problems Order computation to enable result reuse Systematically fill in table of results Express larger problems from their subparts Ordering of computations matters Naïve top-down approach very slow results of smaller problems not available repeated work Systematic bottom-up approach successful Systematically solve each sub-problem Fill-in table of sub-problem results in order. Look up solutions instead of recomputing.

How do we apply dynamic programming to sequence alignment?

Key insight: score is additive! S1 i C G T C T C S2 T G T G T C Compute best alignment recursively For a given aligned pair (i, j), the best alignment is: Best alignment of S1[1..i] and S2[1..j] + Best alignment of S1[ i..n] and S2[ j..m] j S1 C G i i T C T C S2 T G T G j j T C

Store score of aligning (i,j) in matrix M(i,j) S[1..i] i S[i..n] T[1..j] j S T[ j..m] Best alignment Best path through the matrix

Filling in the dynamic programming matrix Local update rules: Compute next alignment based on previous alignment Just like Fibonacci numbers: F[i] = F[i-1] + F[i-2] Table lookup!

Duality: seq. alignment path through the matrix C G T C T C T G T G T C S2 T G T G T C S1 C G T C T C G T C/G T C Goal: Find best path through the matrix

- 0 0. Setting up the scoring matrix - G T Initialization: Top right: 0 Update Rule: (i,j)=max{ G } Termination: Bottom right C

- 0 1. llowing gaps in s - G T -2 Initialization: Top right: 0 Update Rule: (i,j)=max{ (i-1, j ) - 2 G C -4-6 -8 } Termination: Bottom right

G 2. llowing gaps in t - G T - 0-2 -4-6 -2-4 -6-8 -4-6 -8-10 -6-8 -10-12 Initialization: Top right: 0 Update Rule: (i,j)=max{ (i-1, j ) - 2 ( i, j-1) - 2 } Termination: Bottom right C -8-10 -12-14

G 3. llowing mismatches - G T - 0-2 -4-6 -1-2 -1-3 -5-1 -4-3 -2-4 -1-1 -1-6 -5-4 -3-1 -1-1 -1-1 -1-1 Initialization: Top right: 0 Update Rule: (i,j)=max{ (i-1, j ) - 2 ( i, j-1) - 2 (i-1, j-1) -1 } Termination: Bottom right C -8-7 -6-5

G 4. Choosing optimal paths - G T - 0-2 -4-6 -1-2 -1-3 -5-1 -4-3 -2-4 -1-1 -1-6 -5-4 -3-1 -1-1 -1-1 -1-1 Initialization: Top right: 0 Update Rule: (i,j)=max{ (i-1, j ) - 2 ( i, j-1) - 2 (i-1, j-1) -1 } Termination: Bottom right C -8-7 -6-5

G 5. Rewarding matches - G T - 0-2 -4-6 1-2 1-1 -3 1-1 -1-4 -1 0-2 1-1 -6-3 0-1 -1 Initialization: Top right: 0 Update Rule: (i,j)=max{ (i-1, j ) - 2 ( i, j-1) - 2 (i-1, j-1) ±1 } Termination: Bottom right C -8-5 -2-1

What is missing? We know how to compute the best score Simply the number at the bottom right entry But we need to remember where it came from Pointer to the choice we made at each step Retrace path through the matrix Need to remember all the pointers y 1 y N x 1 x M Time needed: O(m*n) Space needed: O(m*n) Can we do better than that?

Bounded Dynamic Programming y 1 y Nx1 xm Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1 M For j = max(1, i k) min(n, i+k) F(i 1, j 1)+ s(x i, y j ) F(i, j) = max F(i, j 1) d, if j > i k(n) F(i 1, j) d, if j < i + k(n) k(n) Termination: same

Linear space alignment It is easy to compute F(M, N) in linear space llocate ( column[1] ) llocate ( column[2] ) F(i,j) For i = 1.M If i > 1, then: Free( column[i 2] ) llocate( column[ i ] ) For j = 1 N F(i, j) = What about the pointers?

Finding the best back-pointer for current column Now, using 2 columns of space, we can compute for k = 1 M, F(M/2, k), F r (M/2, N-k) PLUS the backpointers

Best forward-pointer for current column Now, we can find k * maximizing F(M/2, k) + F r (M/2, N-k) lso, we can trace the path exiting column M/2 from k * k * k *

Recursively find midpoint for left & right Iterate this procedure to the left and right! k * N-k * M/2 M/2

Total time cost of linear-space alignment k * N-k * M/2 M/2 Total Time: cmn + cmn/2 + cmn/4 +.. = 2cMN = O(MN) Total Space: O(N) for computation, O(N+M) to store the optimal alignment

Dynamic programming Summary Reuse of computation Order sub-problems. Fill table of sub-problem results Read table instead of repeating work (ex: Fibonacci) Sequence alignment Edit distance and scoring functions Dynamic programming matrix Matrix traversal path Optimal alignment Thursday: Variations on sequence alignment Local and global alignment ffine gap penalties lgorithmic speed-ups Recitation: Dynamic programming applications Probabilistic derivations of alignment scores