11/17/11. Gene Regulation. Gene Regulation. Gene Regulation. Finding Regulatory Motifs in DNA Sequences. Regulatory Proteins

Samankaltaiset tiedostot
Capacity Utilization

Efficiency change over time

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

16. Allocation Models

The CCR Model and Production Correspondence

Bounds on non-surjective cellular automata

Methods S1. Sequences relevant to the constructed strains, Related to Figures 1-6.

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Information on preparing Presentation

Alternative DEA Models

Other approaches to restrict multipliers

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

Results on the new polydrug use questions in the Finnish TDI data

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

anna minun kertoa let me tell you

Gap-filling methods for CH 4 data

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

Tietorakenteet ja algoritmit

Plasmid Name: pmm290. Aliases: none known. Length: bp. Constructed by: Mike Moser/Cristina Swanson. Last updated: 17 August 2009

The Viking Battle - Part Version: Finnish

LYTH-CONS CONSISTENCY TRANSMITTER

Statistical design. Tuomas Selander

Nuku hyvin, pieni susi -????????????,?????????????????. Kaksikielinen satukirja (suomi - venäjä) ( (Finnish Edition)

MALE ADULT FIBROBLAST LINE (82-6hTERT)

tgg agg Supplementary Figure S1.

1.3Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Miksi Suomi on Suomi (Finnish Edition)

C++11 seminaari, kevät Johannes Koskinen

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

Functional Genomics & Proteomics

Choose Finland-Helsinki Valitse Finland-Helsinki

Operatioanalyysi 2011, Harjoitus 4, viikko 40

FETAL FIBROBLASTS, PASSAGE 10

make and make and make ThinkMath 2017

Travel Getting Around

Integration of Finnish web services in WebLicht Presentation in Freudenstadt by Jussi Piitulainen

Small Number Counts to 100. Story transcript: English and Blackfoot

Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. David R. Kelley

Oma sininen meresi (Finnish Edition)

Chapter 9 Motif finding. Chaochun Wei Spring 2019

( ( OX2 Perkkiö. Rakennuskanta. Varjostus. 9 x N131 x HH145

812336A C++ -kielen perusteet,

Tynnyrivaara, OX2 Tuulivoimahanke. ( Layout 9 x N131 x HH145. Rakennukset Asuinrakennus Lomarakennus 9 x N131 x HH145 Varjostus 1 h/a 8 h/a 20 h/a

Tilausvahvistus. Anttolan Urheilijat HENNA-RIIKKA HAIKONEN KUMMANNIEMENTIE 5 B RAHULA. Anttolan Urheilijat

Metsälamminkankaan tuulivoimapuiston osayleiskaava

RINNAKKAINEN OHJELMOINTI A,

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

TM ETRS-TM35FIN-ETRS89 WTG

1.3 Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Information on Finnish Courses Autumn Semester 2017 Jenni Laine & Päivi Paukku Centre for Language and Communication Studies

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

Käyttöliittymät II. Käyttöliittymät I Kertaus peruskurssilta. Keskeisin kälikurssilla opittu asia?

WindPRO version joulu 2012 Printed/Page :42 / 1. SHADOW - Main Result

Tarua vai totta: sähkön vähittäismarkkina ei toimi? Satu Viljainen Professori, sähkömarkkinat

Salasanan vaihto uuteen / How to change password

TM ETRS-TM35FIN-ETRS89 WTG

S SÄHKÖTEKNIIKKA JA ELEKTRONIIKKA

State of the Union... Functional Genomics Research Stream. Molecular Biology. Genomics. Computational Biology

7. Product-line architectures

( ,5 1 1,5 2 km

MRI-sovellukset. Ryhmän 6 LH:t (8.22 & 9.25)

Returns to Scale Chapters

MEETING PEOPLE COMMUNICATIVE QUESTIONS

Ajettavat luokat: SM: S1 (25 aika-ajon nopeinta)

KMTK lentoestetyöpaja - Osa 2

Data Mining - The Diary

TM ETRS-TM35FIN-ETRS89 WTG

A DEA Game II. Juha Saloheimo S ysteemianalyysin. Laboratorio. Teknillinen korkeakoulu

21~--~--~r--1~~--~--~~r--1~

Searching (Sub-)Strings. Ulf Leser

Eukaryotic Comparative Genomics

TM ETRS-TM35FIN-ETRS89 WTG

MUSEOT KULTTUURIPALVELUINA

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

WindPRO version joulu 2012 Printed/Page :47 / 1. SHADOW - Main Result

Kvanttilaskenta - 1. tehtävät

Kaivostoiminnan eri vaiheiden kumulatiivisten vaikutusten huomioimisen kehittäminen suomalaisessa luonnonsuojelulainsäädännössä

Genome 373: Genomic Informatics. Professors Elhanan Borenstein and Jay Shendure

toukokuu 2011: Lukion kokeiden kehittämistyöryhmien suunnittelukokous

Information on Finnish Language Courses Spring Semester 2017 Jenni Laine

,0 Yes ,0 120, ,8

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

7.4 Variability management

SELL Student Games kansainvälinen opiskelijaurheilutapahtuma

Kvanttilaskenta - 2. tehtävät

Telecommunication Software

Capacity utilization

Digital Admap Native. Campaign: Kesko supermarket

TIEKE Verkottaja Service Tools for electronic data interchange utilizers. Heikki Laaksamo

Voice Over LTE (VoLTE) By Miikka Poikselkä;Harri Holma;Jukka Hongisto

TM ETRS-TM35FIN-ETRS89 WTG

Co-Design Yhteissuunnittelu

The role of 3dr sector in rural -community based- tourism - potentials, challenges

Transkriptio:

Gene Regulation Finding Regulatory Motifs in DNA Sequences An experiment shows that when X is knocked out, 20 other s are not expressed How can one have such drastic effects? Regulatory Proteins Gene X encodes a regulatory protein: a Transcription Factor (TF) The 20 unexpressed s rely on the TF to induce transcription Gene Regulation Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene Gene Regulation Transcription Factor DNA Regulatory Element New RNA Gene RNA polymerase ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT AAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT AGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC CCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT ACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG GGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT TGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA TATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA ATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA TCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT ATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT 1

ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT Exons" TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA Promoter motifs" GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT AAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT AGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC CCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT 3 UTR motifs" ACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG GGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT TGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA TATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG Intron" TTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA ATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA TCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT ATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT TATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATTT Regulatory networks If C then D A B C Make D If B then NOT D If A and B then D D C D If D then B D B Make B Regulatory Regions Every contains a regulatory region typically stretching 100-1000 bp upstream of the transcription start site TFs influence expression by binding to specific locations in the s regulatory region: Transcription Factor Binding Sites (TFBS) TFBS are characterized by sequence patterns (motifs), specific for a given transcription factor Transcription Factor Binding Sites A TFBS can be located anywhere within the regulatory region. TFBS may vary slightly across different regulatory regions since non-essential bases could mutate ATCCCG TFBS Example Transcription Factors and Motifs ATCCCG TTCCGG ATCCCG ATGCCG ATGCCC 2

Motif Logo Motifs can mutate on non important bases The five motif occurrences have mutations in position 3 and 5 Representations called motif logos illustrate the conserved and variable regions of a motif TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA Identifying Motifs Genes are turned on or off by regulatory proteins (TFs) These proteins bind to upstream regulatory regions of s to either attract or block an RNA polymerase TFs bind to short DNA sequences. We use motifs to represent TFBS. Finding the same motif in the regulatory regions of multiple s suggests those s are similarly regulated. Can also use the motif to identify new s that are regulated by the TF. Identifying motifs: why is it difficult We do not know the motif sequence We do not know where it is located relative to the s TSS Occurrences of the motif can differ The motifs are typically short, so the signal is very weak. A Motif Finding Analogy The Motif Finding Problem is similar to the problem posed by Edgar Allan Poe (1809 1849) in his Gold Bug story The Gold Bug Problem Given a secret message: 53++305))6*;4826)4+.)4+);806*;488`60))85;]8*:+*8 83(88)5*; 46(;88*96*?;8)*+(;485);5*2:*+(;4956*2(5*-4)8`8*; 4069285);)6 8)4++;1(+9;48081;8:8+1;4885;4)485528806*81(+9;48; (88;4(+?3 4;48)4+;161;:188;+?; Decipher the message encrypted in the fragment Hints for The Gold Bug Problem Additional hints: The encrypted message is in English Each symbol correspond to one letter in the English alphabet No punctuation marks are encoded 3

The Gold Bug Problem: Symbol Counts Naive approach to solving the problem: Count the frequency of each symbol in the encrypted message Find the frequency of each letter in the alphabet in the English language Map the symbols to letters in the alphabet according to their frequency Symbol Frequencies in the Gold Bug Message Gold Bug Message: Symbol 8 ; 4 ) + * 5 6 ( 1 0 2 9 3 :? ` - ]. Frequency 34 25 19 16 15 14 12 11 9 8 7 6 5 5 4 4 3 2 1 1 1 English Language: e t a o i n s r h l d c u m f p g w y b v k x j q z Most frequent " "Least frequent The Gold Bug Message Decoding: First Attempt The Gold Bug Problem: l-mer counts Mapping by frequency: sfiilfcsoorntaeuroaikoaiotecrntaeleyrcooestvenpinelefheeosnlt arhteenmrnwteonihtaesotsnlupnihtamsrnuhsnbaoeyentacrmuesotorl eoaiitdhimtaecedtepeidtaelestaoaeslsueecrnedhimtaetheetahiwfa taeoaitdrdtpdeetiwt The result does not make sense A better approach: Examine frequencies of combinations of 2 symbols, 3 symbols, etc. The is the most frequent 3-mer in English and ;48 is the most frequent 3- mer in the encrypted text Make inferences of unknown symbols by examining other frequent l-mers The Gold Bug Problem: the ;48 clue Mapping the to ;48 and substituting all occurrences of the symbols: 53++305))6*the26)h+.)h+)te06*thee`60))e5t]e*:+*ee3(ee)5*t h6(tee*96*?te)*+(the5)t5*2:*+(th956*2(5*h)e`e*th0692e5)t)6e )h++t1(+9the0e1te:e+1thee5th)he552ee06*e1(+9thet(eeth(+?3ht he)h+t161t:1eet+?t Decoding The Gold Bug Message: Second Attempt Make inferences: 53++305))6*the26)h+.)h+)te06*thee`60))e5t]e*:+*ee3(ee)5*t h6(tee*96*?te)*+(the5)t5*2:*+(th956*2(5*h)e`e*th0692e5)t)6e )h++t1(+9the0e1te:e+1thee5th)he552ee06*e1(+9thet(eeth(+?3ht he)h+t161t:1eet+?t thet(ee most likely means the tree Infer ( = r th(+?3h becomes thr+?3h Can we guess +? 4

The Gold Bug Problem: The Solution After figuring out all the mappings, the final message is: AGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONEDEGRE ESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBRANCHSEVENT HLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEATHSHEADABEELINE FROMTHETREETHROUGHTHESHOTFIFTYFEETOUT The Solution (cont d) Punctuation is important: A GOOD GLASS IN THE BISHOP S HOSTEL IN THE DEVIL S SEA, TWENY ONE DEGREES AND THIRTEEN MINUTES NORTHEAST AND BY NORTH, MAIN BRANCH SEVENTH LIMB, EAST SIDE, SHOOT FROM THE LEFT EYE OF THE DEATH S HEAD A BEE LINE FROM THE TREE THROUGH THE SHOT, FIFTY FEET OUT. Solving the Gold Bug Problem Prerequisites to solve the problem: The frequencies of single letters, and combinations of two and three letters in English Knowledge of the words in the English dictionary Similarities between the two problems Motif Finding: Nucleotides in motifs encode a message in the tic language In order to solve the problem, we analyze the frequencies of patterns in the nucleotide sequences Gold Bug Problem: Symbols in The Gold Bug encode for a message in English In order to solve the problem, we analyze the frequencies of patterns in the text written in English Similarities (cont) Motif Finding: Knowledge of established motifs is helpful Gold Bug Problem: Knowledge of the words in the dictionary is highly desirable Motif Finding and The Gold Bug Problem Motif Finding is harder than The Gold Bug problem: We have the dictionary of English words, but not the dictionary of regulatory signals The tic language does not have a standard grammar Only a small fraction of the nucleotide sequence encodes for motifs; the size of data is enormous 5

The Motif Finding Problem Given a sample of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc The Motif Finding Problem (cont d) Additional information: The hidden sequence is of length around 8 Each occurrence of the pattern is not exactly the same because random point mutations may occur Find the pattern that is common to each of the individual sequences, namely, the motif The Motif Finding Problem (cont d) The patterns revealed with no mutations: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc acgtacgt The Motif Finding Problem (cont d) The patterns with 2 point mutations: cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacggc Defining Motifs Motifs: Representation To define a motif, suppose we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = (s 1,s 2,s 3,,s t ) a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 A C G T A C G T Line up the patterns by their start indexes s = (s 1, s 2,, s t ) Construct a matrix with the counts (or frequencies) of each nucleotide in each column. This is known as a Position Weight Matrix (PWM) or Position Specific Scoring Matrix (PSSM) The consensus nucleotide in each position is the one with highest score in the column 6

Consensus Consensus Think of consensus as the ancestor motif, from which current-day instances evolved. The distance between an instance of a motif and the consensus sequence is rally less than the distance between two instances Motif finding and MSA Evaluating Motifs... We have a guess about a motif. How good is it? Need to introduce a scoring function to compare different guesses and choose the best one. Find best multiple local alignment Some Notation Parameters DNA - sample of DNA sequences (t x n array) t - number of DNA sequences n - length of each DNA sequence l - length of the motif s i - starting position of the motif in sequence i s=(s 1, s 2,,s t ) - array of motif starting positions t=5 s: l = 8 DNA cctgatagacgctatctggctatccaggtacttaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatccatacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgttagtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtccatataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaccgtacggc n = 69 (s 1 = 26 s 2 = 21 s 3 = 3 s 4 = 56 s 5 = 60) 7

Scoring motif strength: the consensus score Given motif starting positions s: "Score(s) = max i= 1 k { A, T, C, G} Many scoring schemes exist l count( k, i) a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G A 3 0 1 0 3 1 1 0 C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 Consensus a c g t a c g t l t Motif Finding: Problem Formulation Goal: Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the motif score Input: t DNA sequences and l, the length of the pattern to find Output: An array of t starting positions s = (s 1, s 2, s t ) maximizing Score(s) " Score 3+4+4+5+3+4+3+4=30 " Motif Finding: A Brute Force Approach Compute the scores for each possible combination of starting positions s s i = 1,,n-l+1 i = 1,,t Find the s that maximizes Score(s) BruteForceMotifSearch BruteForceMotifSearch(DNA, t, n, l) bestscore ß 0 for each s=(s 1,s 2,..., s t ) from (1,1... 1) to (n-l+1,..., n-l+1) if (Score(s) > bestscore) bestscore ß score(s) bestmotif ß (s 1,s 2,..., s t ) return bestmotif Running Time of BruteForceMotifSearch Varying (n - l + 1) positions in each of t sequences, we re looking at (n - l + 1) t sets of starting positions For each set of starting positions, the scoring function makes l operations, so complexity is l (n l + 1) t = O(l n t ) That means that for t = 8, n = 1000, l = 10 we must perform approximately 10 20 computations CONSENSUS: Greedy Motif Search Find two closest l-mers in sequences 1 and 2 and form their 2 x l alignment matrix At each of the following t-2 iterations CONSENSUS finds a best l-mer in sequence i with respect to the already constructed (i-1) x l alignment matrix for the first (i-1) sequences In other words, it finds an l-mer in sequence i maximizing "Score(s) computed over the first i sequences, where only the starting position s i is allowed to vary. CONSENSUS sacrifices optimality for speed. Infact, the bulk of the time is spent locating the first 2 l-mers 8

Some Motif Finding Programs CONSENSUS "Hertz, Stromo (1989) GibbsDNA "Lawrence et al (1993) MEME Bailey, Elkan (1995) RandomProjections Buhler, Tompa (2002) MULTIPROFILER "Keich, Pevzner (2002) MITRA Eskin, Pevzner (2002) Pattern Branching "Price, Pevzner (2003) 9