Large Scale Sequence Analysis with Applications to Genomics

Samankaltaiset tiedostot
Large Scale Sequence Analysis with Applications to Genomics

Capacity Utilization

The CCR Model and Production Correspondence

Efficiency change over time

Alternative DEA Models

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

Other approaches to restrict multipliers

Bounds on non-surjective cellular automata

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Capacity utilization

Information on preparing Presentation

16. Allocation Models

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Gap-filling methods for CH 4 data

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Functional Genomics & Proteomics

Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. David R. Kelley

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

Genome 373: Genomic Informatics. Professors Elhanan Borenstein and Jay Shendure

Alternatives to the DFT

Statistical design. Tuomas Selander

Mat Seminar on Optimization. Data Envelopment Analysis. Economies of Scope S ysteemianalyysin. Laboratorio. Teknillinen korkeakoulu

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

Bioinformatics in Laboratory of Computer and Information Science

State of the Union... Functional Genomics Research Stream. Molecular Biology. Genomics. Computational Biology

Operatioanalyysi 2011, Harjoitus 4, viikko 40

7.4 Variability management

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

TIEKE Verkottaja Service Tools for electronic data interchange utilizers. Heikki Laaksamo

Plasmid Name: pmm290. Aliases: none known. Length: bp. Constructed by: Mike Moser/Cristina Swanson. Last updated: 17 August 2009

812336A C++ -kielen perusteet,

Master's Programme in Life Science Technologies (LifeTech) Prof. Juho Rousu Director of the Life Science Technologies programme 3.1.

Characterization of clay using x-ray and neutron scattering at the University of Helsinki and ILL

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

Returns to Scale Chapters

Valuation of Asian Quanto- Basket Options

7. Product-line architectures

make and make and make ThinkMath 2017

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

Windows Phone. Module Descriptions. Opiframe Oy puh Espoo

Results on the new polydrug use questions in the Finnish TDI data

Strict singularity of a Volterra-type integral operator on H p

Kysymys 5 Compared to the workload, the number of credits awarded was (1 credits equals 27 working hours): (4)

Tietorakenteet ja algoritmit

Choose Finland-Helsinki Valitse Finland-Helsinki

Categorical Decision Making Units and Comparison of Efficiency between Different Systems

tgg agg Supplementary Figure S1.

T Statistical Natural Language Processing Answers 6 Collocations Version 1.0

anna minun kertoa let me tell you

FETAL FIBROBLASTS, PASSAGE 10

Teknillinen tiedekunta, matematiikan jaos Numeeriset menetelmät

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

Hankkeen toiminnot työsuunnitelman laatiminen

Kvanttilaskenta - 1. tehtävät

Nuku hyvin, pieni susi -????????????,?????????????????. Kaksikielinen satukirja (suomi - venäjä) ( (Finnish Edition)

MALE ADULT FIBROBLAST LINE (82-6hTERT)

11. Models With Restricted Multipliers Assurance Region Method

How to handle uncertainty in future projections?

LYTH-CONS CONSISTENCY TRANSMITTER

Telecommunication Software

Tilausvahvistus. Anttolan Urheilijat HENNA-RIIKKA HAIKONEN KUMMANNIEMENTIE 5 B RAHULA. Anttolan Urheilijat

C++11 seminaari, kevät Johannes Koskinen

toukokuu 2011: Lukion kokeiden kehittämistyöryhmien suunnittelukokous

Oma sininen meresi (Finnish Edition)

MEETING PEOPLE COMMUNICATIVE QUESTIONS

BDD (behavior-driven development) suunnittelumenetelmän käyttö open source projektissa, case: SpecFlow/.NET.

Constructive Alignment in Specialisation Studies in Industrial Pharmacy in Finland

Huom. tämä kulma on yhtä suuri kuin ohjauskulman muutos. lasketaan ajoneuvon keskipisteen ympyräkaaren jänteen pituus

CS284A Representations & Algorithms for Molecular Biology. Xiaohui S. Xie University of California, Irvine

Use of spatial data in the new production environment and in a data warehouse

Viral DNA as a model for coil to globule transition

1.3 Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Integration of Finnish web services in WebLicht Presentation in Freudenstadt by Jussi Piitulainen

Operatioanalyysi 2011, Harjoitus 2, viikko 38

Infrastruktuurin asemoituminen kansalliseen ja kansainväliseen kenttään Outi Ala-Honkola Tiedeasiantuntija

Enterprise Architecture TJTSE Yrityksen kokonaisarkkitehtuuri

Toppila/Kivistö Vastaa kaikkin neljään tehtävään, jotka kukin arvostellaan asteikolla 0-6 pistettä.

Chapter 9 Motif finding. Chaochun Wei Spring 2019

Bioinformatics. Sequence Analysis: Part III. Pattern Searching and Gene Finding. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

OP1. PreDP StudyPlan

Rekisteröiminen - FAQ

Kvanttilaskenta - 2. tehtävät

1.3Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

21~--~--~r--1~~--~--~~r--1~

HARJOITUS- PAKETTI A

The Viking Battle - Part Version: Finnish

Methods S1. Sequences relevant to the constructed strains, Related to Figures 1-6.

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

You can check above like this: Start->Control Panel->Programs->find if Microsoft Lync or Microsoft Lync Attendeed is listed

FinFamily Installation and importing data ( ) FinFamily Asennus / Installation

Topologies on pseudoinnite paths

Increase of opioid use in Finland when is there enough key indicator data to state a trend?

Tarua vai totta: sähkön vähittäismarkkina ei toimi? Satu Viljainen Professori, sähkömarkkinat

Searching (Sub-)Strings. Ulf Leser

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

Paikkatiedon semanttinen mallinnus, integrointi ja julkaiseminen Case Suomalainen ajallinen paikkaontologia SAPO

VAASAN YLIOPISTO Humanististen tieteiden kandidaatin tutkinto / Filosofian maisterin tutkinto

FIS IMATRAN KYLPYLÄHIIHDOT Team captains meeting

KONEISTUSKOKOONPANON TEKEMINEN NX10-YMPÄRISTÖSSÄ

SIMULINK S-funktiot. SIMULINK S-funktiot

Transkriptio:

Large Scale Sequence Analysis with Applications to Genomics Gunnar Rätsch, Max Planck Society Tübingen, Germany Talk at CWI, Amsterdam October 23, 2009 Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 1

Discovery of the Nuclein (Friedrich Miescher, 1869) T ubingen, around 1869 Discovery of Nuclein: from lymphocyte & salmon multi-basic acid ( 4) If one... wants to assume that a single substance... is the specific cause of fertilization, then one should undoubtedly first and foremost consider nuclein (Miescher, 1874) Gunnar R atsch (FML, T ubingen) Large Scale Sequence Analysis October, 23, 2009 2

Research Topics Machine Learning 1 Statistical inference methods for structured data Develop fast and accurate learning methods 2 Convergence properties of iterative algorithms Boosting-like algorithms and semi-infinite LPs 3 Genome annotation Predict features encoded on DNA Molecular Biology 4 Biological networks Understand interactions between gene products 5 Analysis of polymorphisms Discover polymorphisms and associate with phenotypes Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 3

Research Topics Machine Learning 1 Statistical inference methods for structured data Develop fast and accurate learning methods 2 Convergence properties of iterative algorithms Boosting-like algorithms and semi-infinite LPs 3 Genome annotation Predict features encoded on DNA Molecular Biology 4 Biological networks Understand interactions between gene products 5 Analysis of polymorphisms Discover polymorphisms and associate with phenotypes Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 3

Machine Learning Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 4

Machine Learning Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Two different classes of observations Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 4

Machine Learning Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Example: Inferred classification rule Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 4

Machine Learning Given: Observations of some complex phenomenon Goal: Learn from data & build predictive models Recent contributions: 1 Large scale sequence classification 2 Analysis and explanation of learning results 3 Sequence segmentation & structure prediction k mer Length 8 7 6 5 4 3 2 1 30 20 10 0 10 20 30 Position Log-intensity 10 5 0 transcript Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 4

Learning about the Transcriptome Machine Learning View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View 1 What can we not predict yet? What is missing? 2 Can we derive a deeper understanding of these processes? Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 5

Learning about the Transcriptome Machine Learning View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View 1 What can we not predict yet? What is missing? 2 Can we derive a deeper understanding of these processes? Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 5

Learning about the Transcriptome Machine Learning View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View 1 What can we not predict yet? What is missing? 2 Can we derive a deeper understanding of these processes? Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 5

Learning about the Transcriptome Machine Learning View How to learn to predict what these processes accomplish? How well can we predict it from the available information? Biological View 1 What can we not predict yet? What is missing? 2 Can we derive a deeper understanding of these processes? Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 5

Computational Genome Annotation Simplest formulation: Given a DNA sequence x { A, C, G, T } L Find the correct label sequence y = y 1 y 2... y L (y i Y = { intergenic, 5 UTR, coding, intron,... }) Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 6

Example: C. elegans (I: 43,500-52,050) GAAGAAATGGAGCATTTGCGCTCCATCACACTCTCAGACAATTTCATTTTCCACATCCTATATATATTTTGGTTTTTCTGTCGTATTTTGTTTTAATTTATTGGTATTTCGTTCAAAAATAATTATTTTGACTGTATTTTTGGTTGCATA CATGTAGAACTGCTGTTTTTTAAGATATTCTGCCCATTCAAGTTTTTCAGTGTAAAATTGATATATTTCATTCCAACTGAAAATGAGATCGAAACGATGGAAAACCTCGGATATTACTGATTATGGAAAGAAGAGAAAAGAATCGGAAAG TTGTGGATCAAGTTCACCGATTCTCGAAACACAGTCATCTGGCGGTGCGGAACTTGACGAAGTTACTGAGGATGAATATTCTAGTAATTCGAGCAGTAATGAAACTAGCGACGAAGAGGAAAACTCAGAAGTACCAAATGTCTTATCTAT AACAGAAAGAGGTAAGAATTGCGTCTTCTAGTGATCATACTTTTCGCCAGATTCCCTAATGTAATATATTTTGTTGTAGAGAAAAGTTGGCAAAAGTTAACGGAAAACGATTTGGGACGAATTCGTTTCATCTTGAAGTACACTAGCAAT ACTAAAAAATGCGTGAACGAGTATTTTCAATATAATCATGGGCAAAACAATGAAATTATGAAAAGTCTATTATTGGATACCGATGGAACTATGACTGCAAAGGCTTGTTCGGAATGTGCCTACGATTTGAATCAGTAAGTTACTCTCTCG ATTTATTCCCAAAATTAATATGTGCTTCAGGTGCCACTGCAAAAAACCGCTTCGCTTCATCAATGCTCCGTGTGGTTGGTTTGCTATTCAAAACTATAAATAGTTCACTGTTTCCGTTCAGAGGTCATCAACCAAGTTCTTCATGTTGAA AATGCGGAGCCCACCAGGATCAACCATGTAATCGCAACACTCTTCCGGAATCACATTGGCGAGATTTTGTTGGTCCACTCTATTTCTGTGCGAGAACTGTGATAAAACTAGTATTTTCAGCACAAAGGCTCGAACTGCGGAAGCTCGCGC ATCTGAAGAAGCTCAAATCAGGATTCAAATCCAAGACAACTCGAACGCATTCCAAAGATCGTATCATAACGATCCACAACCTTCATCAGCCGAAGAACATGAGGAAGATATCGTGGTGGATGGCTGAGTACGGAGCTCAAATGCCTTAAG GCGAAACAATTGGTTTTTTAATTTGCTGGTTATCATGTTAGATTTTGAACGTGTTAGGTCTTTCAATTGTTTTTTTTTTTCGAAATGTTGTTGTTCTAATAAATTTGTTTTATTTAATCAAACGTTTTTTAGTCTACTACGGGCGTGAAG CCAGATATCAGTGGTATCTTCTTATCAGAAGCTGAATCATTTCCGGTTGACAATGTTTGAAGGACATAAGAAAGGCTGTGTTACTGATTTCGACCATTGATTTGTTTATATATGGATATGTTCCACTGCCTTTTGGAAAGGCAGTATTCC CGGTATATATGGGCCTAATACGGAATCTAAAATAACCTGACACAAACCTGACGTTGACCTGTTGCCGGCCCGCGGCGGCTTAGTGTCAACTTGACAGCGGGTCGCGATTTCACCTGCCAGTTGTTCTCCATTCAGCAGCCAGCGACCTGC TGGCAGGTTGCCACTAACCTGACGCGGTTTACCTGTGTTATCGGCGCGTGCATAGCTTAGTGGTTTCAGGAAATGATGCTAGTAATCAGAAGATCGGGGTTCGGGAAACGGCAGGGGCTTGAAGGTTAGGTTCTATGAAGCAGGGCGAAG GGTTGACAAGGAGAGGCAATAAGCAAGTAGTAGGGGTTCTCTAGAAAACATTTTTGTCTTTAATATGCGTTTCCTACTGATTTATTATTGATATTTGGATCCCCTTTTCTAGAAAAAAAAATCAGAATCAGCAGAAAAATTTGAGAAAAA GTCATAGCAAATCAGAGTTGGTCAGAGTAAATCAGAGCTAGTCATAGTAAATCATAGCTAGTCAGAGAATATCAGAGTTAATCAGGGTAATAAGTAGACCTAGTCATAGTAAATCAGAGCTAGGCATAGTAAAGCGTGGTTACTCCGAGT AAAACCACACTTGCACCGAACTGCGGTTAGTGTGCTTTACCATTATGTAACTCCGCTTTTTACTCTGAGTTAGTATGATATGGTTTGTCTGAGCTGTGGTTGGGCTTCGCGGGAAACTTGAATAATTCGAGACAAAATCTAATTTTAGCG AATTTTCTTTAATTTCTTTGAGGTTTCTACGACAGAACTCGAAAAATTTCGGGTTTTAATGTTTACACATTTTATTTAAAATTGAATAATCAACTGCGGGACTCCTCGAAAATCACATGCTCATTTAAATTTTGAAGTTCAAACCTCAAA AAACGCGCAAAAACCAAATTCAGCTAGGATATCAAATTTATGATTGAAATCTATATTTTGATGCGGTGTTTCTGAAGTTTTCGCGATAAAATCCGAATAATAATTCCACGTACCGTATATTCTCTATCTAATTTCCAGGTCATTTTTTAA TGCAGCACTATTAGAGACTGTCGTACTACTGGAGACTGCAGCATTAATTTTCGAACGGCTACTGTCAATTATAGATCACTAGTATTTAGTCACAAAAGCTAATTTTTTAAGCAGAAATTCATAAAAATGTTTTCAATATTGCGAACTTTT GTAACAAAAAGACCCAGTAATTCAATTACTTTCGTAAATTATCAAAAAATCATCAAAAATATACAAAAAAATACCAAAAAATATTGAAACTTTCAAGTGACTCTTTCAATAGAAAATGGGGTGCAGCACTAATAGAGACTGCTGCACTAT TTTTCGGACCCTTTTTGAATGCAGCACTATTAGAGACTGCAGTATTTACTACTGGAGATGCAGCACTAATAGAGAATATACGGTATATACGTAATATATTCTTGCAGAAAAAAGTACGATTATCAATGAAAAATAGCTGATAAGAGGCTT TTGTTTGAACTAACAGACGGAACGACTCCGGTTTAGTTCAAAAAATTCTAAAAACACGTTGTGTCAGGCTGTCTCATTGCGGTTTGATCTACGAAAAATGCGGGAATATTTTTCCAGAAAAATTGTGACGTCAGCACGCTCTTAACCATG CGAAACGAGATGAGATGTCTGCGTCTCTTTTCCCGCATTTTTCGAAGATCAAAACGAATGGGACTTTCTGACTCCACGTGTAAAAAGGGGTTACGACGGACCCTGGCCTAGAAATTAGGCGTGAAAATTCTCGGGCACTGGATGTAGTGA ACGCCCGCGATGAAAAATTGGGGGAAAATTAGGCTTTCTTTGCGAGAAAGATTAATTAAAAATGTTTTCCTTTGTCGAAAATAATTTTTAAAAAACACACCACGTGTATTCAGCTCGACCAACGCCTCGAAAATTTTCAAAAAAGGCGGG AAAAATTAGTTGAATTCGCCAAGAGGAATTTCACCGCAGCGCGTGCAAAAATTTCAGCATTTGCGCGTGACGGTGTTTGCACAAATTACACCGAATGGTCGAGCTGAAAACACGTGCACACTTTTAAATAAAACTAGAAAATAAATCCCA GGCCTGCAAATATTGCACACAAAACCGTAATCCCCTTCGCGCTAAACAACACGCGCAACGATGCTCCGCTTGGGGACAAGGAAAAATTAATTTAACTCGGGATTTTCATTAAAAAATTAGGTTTTTAGTTAATTTTTCGATGTTTTCACT GCGAAAAAGTGTTAAAATAACGATTTTTCAACCTATTTTCAATTAATCCGTGCAAAAAATCGTGTATTTCTCGAGTTTTGAAAGAAATTTATGAAAATCGGCATTTTTAATAATGGTTTTTCAAATAAAAATATAATTTTTCGGTGCAGA AAAGTCGTTGCTCGTACAGTTTTTTTAAAGCATTTTCACATCAAAATCCTCCATTTTTCCAGTAAATCGATATGGAGTGCGACGAGACAAAGCTGAGCGACGGCGCAAGCGGCTGGGTGCCGAGTATCCCGACAGATATCGATTCAAAAG ACACACCGTTGCTCGATATATCTTCTCAGGCGATTTGGGCGCTTTCCAGTTGTAAAAGCGGTAAATTTTCCGACTTTCAAGGGAGAAAAGTGTAGAAAAATCGAAATTACTTCTTAAAAATCTCGTAAAAATCGAATTCTTTCAGGATTC GGCATCGACGAGCTCCTATCCGACAGTGTTGAGAAATATTGGCAAAGCGATGGCCCGCAGCCGCACACGATTCTTCTAGAATTCCAGAAAAAGACCGACGTGGCTATGATGATGTTCTATTTGGATTTTAAAAACGACGAGTCTTATACA CCGTCAAAGTTAGCATTTTTGGCTTTTTCAAACGAAAAAATACAATGAAACACTGAATATCTAGTTTTTTTCTCAATTTTTGCCTAAAAAACGGCGATTTTTCACTAGCTTTTCAATTAAAATTTGAACAAAAAGTTTTTTAAAGGAAAA ACATGAATTTCTAGCTTTTTCAGAGGTTTTCTATTAAAAAATAGAGATTTTTGTGATATCTGACTGAAAAATTACCAAACTGTCGATTTTTTTAAACTATTTTTCACTTAAAATCTGCAATTTTTTTTTTCGAGGAAACATGTGAATTTC AAGCTTTTTCAGAGATTTTCTATGAAAAAGGTTCGTGCCGAGACCCATGTGCTTTTAAACTTCAGAATTTTCCCAATTTTGAAATTAAAAAGAGAATGAAAATTGATTTTCATGGAAAAATGCGTTTTTGGCCCAAAACCTCCAAAAAGT ACAAATATAGGTCGACTTTCAACTGTTTTAGATCAATTTTTTTGCAGAATTCAAGTAAAAATGGGTTCATCTCACCAGGATATATTTTTCCGTCAAACACAAACATTCAACGAGCCCCAGGGATGGACATTTATCGATTTACGCGACAAA AATGGGAAACCGAATCGCGTTTTTTGGCTTCAAGTACAAGTTATTCAGAATCATCAAAATGGGAGAGATACTCATATAAGGTAGAGGAATTGAGAATTTCAGAACGAAAATTGCCGAAAAAATGAAATTTTAGCGAATTTGAGTCGGAAA TTTCGAAATTTGATTGATTTTAAGCAAATTTCCAACTAAAATCTTGAAAATTTGATCTTTTTAGATAAATTTTTTTTTAATTTTGTGCTTTTCAAAAAACCTCAAAAAACAATTAAAAATTGAAGTAAAATTAATTTTTCAACAATTTTT GAAAGGCCGAATTTTTGATTGAAAATTTTCACAATTTGTCCATTTTGTGGTGGGGCTTATTCCGAAAAATCGTTGTTTTTTTTTTCAAAAAAGTTATAAAAACTTTAAAATTGCCATGTAAAATATGTTTATTCTCAGACCTCGTAGGCA CGAAGCAGGCGTAGGTCGCCTCGCAATAAATTTGAAAATCTCAAGAAAAATCAATAAATTTGTGATTAATCAAAAAAATTTAATTTCCTGGTCCCAGCACGAATGCTATTTTTCGAAAAAAAAAAAGAGGCGAGCCTAATATAGACCACG CCCACAAAATGGGCAAAAGTTTGATTTTTCAAAAAATCGAAACAAAAATTTTTCCAATTTTGTGAGATTTTAAAATTTCCGGTTTTTGGAAAATCGAAAAAAAATTTCTCGTTTTTTAATTTTCAAAAAAAATTGTGCCTAAAATTCAAA AAAAAAATCAATACTTTCTCAAAATTTCCAGAAAACAGTCCATTTTCCAGGCACGTTCGAGTCCTTGGACCCCAGCGATCTCGTGTCTCCACAACGAATCGAATATTCACCGGAGAACCACACGGACCGATTCCCGATAAAAATATCACT AATTTCGACGACGAGGATTTTGCCAATTTTATCGATCACTCACTTGTTCACTTATCACTTCGTTAAATTTACCTCCAGTGATTCCAGATAATGAGCCAGTTTTGCATTGAAATTTAGTGCCAAAATATAGAAAATCGCATGATTTAACAT AAAATAGCGTTTCGAATTGAAACAATGGAAAAAAAGTGCTATGATGATTTTTTAACACTTTTAATTGTTCCAATTTGAAGTAAAATCTATTTTCAGATAAATCAACTGATTTTCTATATTCTGCCACTAAAGCTTAAAAACTTGCCCTGC TGTCCTAACCTTCAAATTGTTCCCTGCAAATTTTATTATTCTTGTTTCATATTTTTGCGATTGCTTCGCGAGACCCAAACTCACACATTTACCTGTAAAATATAATCGAATAATTATTTATATATTTTCTGTAAATTTCCTTAGTATACT ATAAATTTTCTGATCTCTCTTCAAAAATCGCTAGAAAAAATAAACAAATGTCGGTTTAAAAATTCCTGGTAATTTACCTTCTATAGAAAATTTTTCGAAAAAAAAACCGAAGAAATTCAGATGGAAATTCCCGATCCCGAACTGCCGGGA ATACCGATTGATCCGCAAGATTTGGAGATTCTAGACACGCCCACACGGTTTTACGAGAAGCTTTTAGTGCGTTTTTCGTGTCGGGACCCGGAAATTTGACATTTTTGGCGCGCGGCTTGTTAGACTCCAAACCTTTTCAAAGATTTTTTT TTCGAATTAAATAACATTCGTGCTTGGGCCCGGAAATTGAATTTTTGATTTGAAAACAATTTTTTTTGAGTCCAAAATTTTCAAAGTTTGTCCATTTTTGGCGCGTGGCCTAGTAGGATCCGCCCCTTCTAAATTTTTTTTGAGCAAGTT TTCTGAAGCATTGATTTCAAAAATTTTTTTTGGAAATTTCTGGTTTATTTTTCCGGTTTTTTTCCGAGTTGCTGTTTAAGTTTGGAGAAATTCCAGAATTTGTCAATTTTTGGGGCGTGGCTTTTTCAGTAAGCACAGTTTTTTTTTTTT GAAAAATTGAAATTTTCGCGGTGCGGTTCAAGAAAAACCACAAAAACTCAATGATTTTTTAACGAAAATTTCAAATTTCTTGCAAGACCTACTGCAATTTCGATTTTTAGAAACTTTTTGAAAAAAATCCGAATTTTCTGATTTAGCCCC GCCCCAAAAATGGAAAGATTTCCGAAAATTCGAACCAAAAGTTCGCAAAAACTTGAATTTCTCTCACACAGATTGACGCGCTAATTTGAATTTTTCCAAAAATAAGCCCCGCCCCAAAAATGGACAAATTTTAAAAATTTTGAACCAAAT AAATTCAATTTTTTTTCGCTTTTTTCCGTTTTCGAACAAAAAATTCTAAAAATATATGGTTCTAGGCGGGGCTCAGGCACCCATCTACCTACTTAAAAATGCGTTAAATTTCAGGAATTAACTGCATCAACCGAACGGCGTCTCGCATTG TGTAGTCTGTATTTGGGCGAAGGAGATCTCGAAAAAAATCTGATCGCTGCGATCCGAGAAAGATCCGAAAAATCCGAGATTGAAGTGACGATTCTGTTGGATTTTTTGCGCGGAACACGGACCAATTCAAGCGGCGAAAGTAGTGTAACA GTGCTGAAACCTATTTCGGAAAAGTCAAAAGTTGGTTTTTTTTGCAAAAAAAAATCGATAAATCGATAAAAACCGACAATTTTGAGAATTTTCATTTCAAATTTGAGTCCCACATGCGCCTTTAAATATGGTGTACTGTAGTTTTAGCTC GAATGTTGAATTTCAAAAATTGAGAATAAAGAAATGTCGTGACGAGACCCACAAATGTTTTGAAAAAAATTTTCAATTTCAAAAAAATGTAAAAAATTGGGAATTTCCCTCCAAAAGTTAAATTGGTTTAGTCACAAACTTTGAAATTTT GAAATAAAATTTTTTTCGGCTAAAAATAAGTATTTTTTAAAAACTATTTTGAAGAAAAAAAGTTAGGTCTCGCCACGATGTATCTTGTATATGTGTATCTAAATTGCCATGTCGTGACGAGACCCTCTCATATTTTACACTGCAACTTTT TCCTCACGAGGGACGAGGAAAAGTGGTTTCTAGGCCATGGCCGAGGGGCCGACAAGTTTCATCGGCCATTTATCTTGCTTTGTTTTCCGCCTGTTTTCTTTCGTTTTTCACAGCTTTTTCCCATTTTTTCTTATTAAAACTGATAAATAA ATATTTTTGCAGATGCCAAAACGATTTTCAAGTAAAAAAATCATGTATTCAGTGGGCAAGCAGCGGTGAAAGTGGGCATTGTAATATGATGGATTACGGGAATACAAAACCTAAACTTTTTCTGAAACATGATACATATGATGCTTAAAT GCTGAGACTACCTGATTTTCATAACGAGACCGCTGAAAAAGTTTTGAGGTTTTCAAAATTCAACTTTTTGTGCGAAAATCTCGACTTTTTCACCGAAAAAGTTGAATTTTGGAAACCTCAAAACTTTTTCAGCGGTCTTGATATGAAAAT CAGGTAGCTTCAGCATCTAAGCAGCATATGTATCATGTTAAAGAAAAAGTTTAGGTTTTGTATTCCTGTAATCCATCATATTACATTGCCCACTTTCACCGCTGCTTGCCCACTGAATACATAATTTTTTCACTTGGAAATTGTTTTAGC Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 7

Example: C. elegans (I: 43,500-52,050) GAAGAAATGGAGCATTTGCGCTCCATCACACTCTCAGACAATTTCATTTTCCACATCCTATATATATTTTGGTTTTTCTGTCGTATTTTGTTTTAATTTATTGGTATTTCGTTCAAAAATAATTATTTTGACTGTATTTTTGGTTGCATA CATGTAGAACTGCTGTTTTTTAAGATATTCTGCCCATTCAAGTTTTTCAGTGTAAAATTGATATATTTCATTCCAACTGAAAATGAGATCGAAACGATGGAAAACCTCGGATATTACTGATTATGGAAAGAAGAGAAAAGAATCGGAAAG TTGTGGATCAAGTTCACCGATTCTCGAAACACAGTCATCTGGCGGTGCGGAACTTGACGAAGTTACTGAGGATGAATATTCTAGTAATTCGAGCAGTAATGAAACTAGCGACGAAGAGGAAAACTCAGAAGTACCAAATGTCTTATCTAT AACAGAAAGAGGTAAGAATTGCGTCTTCTAGTGATCATACTTTTCGCCAGATTCCCTAATGTAATATATTTTGTTGTAGAGAAAAGTTGGCAAAAGTTAACGGAAAACGATTTGGGACGAATTCGTTTCATCTTGAAGTACACTAGCAAT ACTAAAAAATGCGTGAACGAGTATTTTCAATATAATCATGGGCAAAACAATGAAATTATGAAAAGTCTATTATTGGATACCGATGGAACTATGACTGCAAAGGCTTGTTCGGAATGTGCCTACGATTTGAATCAGTAAGTTACTCTCTCG ATTTATTCCCAAAATTAATATGTGCTTCAGGTGCCACTGCAAAAAACCGCTTCGCTTCATCAATGCTCCGTGTGGTTGGTTTGCTATTCAAAACTATAAATAGTTCACTGTTTCCGTTCAGAGGTCATCAACCAAGTTCTTCATGTTGAA AATGCGGAGCCCACCAGGATCAACCATGTAATCGCAACACTCTTCCGGAATCACATTGGCGAGATTTTGTTGGTCCACTCTATTTCTGTGCGAGAACTGTGATAAAACTAGTATTTTCAGCACAAAGGCTCGAACTGCGGAAGCTCGCGC ATCTGAAGAAGCTCAAATCAGGATTCAAATCCAAGACAACTCGAACGCATTCCAAAGATCGTATCATAACGATCCACAACCTTCATCAGCCGAAGAACATGAGGAAGATATCGTGGTGGATGGCTGAGTACGGAGCTCAAATGCCTTAAG GCGAAACAATTGGTTTTTTAATTTGCTGGTTATCATGTTAGATTTTGAACGTGTTAGGTCTTTCAATTGTTTTTTTTTTTCGAAATGTTGTTGTTCTAATAAATTTGTTTTATTTAATCAAACGTTTTTTAGTCTACTACGGGCGTGAAG CCAGATATCAGTGGTATCTTCTTATCAGAAGCTGAATCATTTCCGGTTGACAATGTTTGAAGGACATAAGAAAGGCTGTGTTACTGATTTCGACCATTGATTTGTTTATATATGGATATGTTCCACTGCCTTTTGGAAAGGCAGTATTCC CGGTATATATGGGCCTAATACGGAATCTAAAATAACCTGACACAAACCTGACGTTGACCTGTTGCCGGCCCGCGGCGGCTTAGTGTCAACTTGACAGCGGGTCGCGATTTCACCTGCCAGTTGTTCTCCATTCAGCAGCCAGCGACCTGC TGGCAGGTTGCCACTAACCTGACGCGGTTTACCTGTGTTATCGGCGCGTGCATAGCTTAGTGGTTTCAGGAAATGATGCTAGTAATCAGAAGATCGGGGTTCGGGAAACGGCAGGGGCTTGAAGGTTAGGTTCTATGAAGCAGGGCGAAG GGTTGACAAGGAGAGGCAATAAGCAAGTAGTAGGGGTTCTCTAGAAAACATTTTTGTCTTTAATATGCGTTTCCTACTGATTTATTATTGATATTTGGATCCCCTTTTCTAGAAAAAAAAATCAGAATCAGCAGAAAAATTTGAGAAAAA GTCATAGCAAATCAGAGTTGGTCAGAGTAAATCAGAGCTAGTCATAGTAAATCATAGCTAGTCAGAGAATATCAGAGTTAATCAGGGTAATAAGTAGACCTAGTCATAGTAAATCAGAGCTAGGCATAGTAAAGCGTGGTTACTCCGAGT AAAACCACACTTGCACCGAACTGCGGTTAGTGTGCTTTACCATTATGTAACTCCGCTTTTTACTCTGAGTTAGTATGATATGGTTTGTCTGAGCTGTGGTTGGGCTTCGCGGGAAACTTGAATAATTCGAGACAAAATCTAATTTTAGCG AATTTTCTTTAATTTCTTTGAGGTTTCTACGACAGAACTCGAAAAATTTCGGGTTTTAATGTTTACACATTTTATTTAAAATTGAATAATCAACTGCGGGACTCCTCGAAAATCACATGCTCATTTAAATTTTGAAGTTCAAACCTCAAA AAACGCGCAAAAACCAAATTCAGCTAGGATATCAAATTTATGATTGAAATCTATATTTTGATGCGGTGTTTCTGAAGTTTTCGCGATAAAATCCGAATAATAATTCCACGTACCGTATATTCTCTATCTAATTTCCAGGTCATTTTTTAA TGCAGCACTATTAGAGACTGTCGTACTACTGGAGACTGCAGCATTAATTTTCGAACGGCTACTGTCAATTATAGATCACTAGTATTTAGTCACAAAAGCTAATTTTTTAAGCAGAAATTCATAAAAATGTTTTCAATATTGCGAACTTTT GTAACAAAAAGACCCAGTAATTCAATTACTTTCGTAAATTATCAAAAAATCATCAAAAATATACAAAAAAATACCAAAAAATATTGAAACTTTCAAGTGACTCTTTCAATAGAAAATGGGGTGCAGCACTAATAGAGACTGCTGCACTAT TTTTCGGACCCTTTTTGAATGCAGCACTATTAGAGACTGCAGTATTTACTACTGGAGATGCAGCACTAATAGAGAATATACGGTATATACGTAATATATTCTTGCAGAAAAAAGTACGATTATCAATGAAAAATAGCTGATAAGAGGCTT TTGTTTGAACTAACAGACGGAACGACTCCGGTTTAGTTCAAAAAATTCTAAAAACACGTTGTGTCAGGCTGTCTCATTGCGGTTTGATCTACGAAAAATGCGGGAATATTTTTCCAGAAAAATTGTGACGTCAGCACGCTCTTAACCATG CGAAACGAGATGAGATGTCTGCGTCTCTTTTCCCGCATTTTTCGAAGATCAAAACGAATGGGACTTTCTGACTCCACGTGTAAAAAGGGGTTACGACGGACCCTGGCCTAGAAATTAGGCGTGAAAATTCTCGGGCACTGGATGTAGTGA ACGCCCGCGATGAAAAATTGGGGGAAAATTAGGCTTTCTTTGCGAGAAAGATTAATTAAAAATGTTTTCCTTTGTCGAAAATAATTTTTAAAAAACACACCACGTGTATTCAGCTCGACCAACGCCTCGAAAATTTTCAAAAAAGGCGGG AAAAATTAGTTGAATTCGCCAAGAGGAATTTCACCGCAGCGCGTGCAAAAATTTCAGCATTTGCGCGTGACGGTGTTTGCACAAATTACACCGAATGGTCGAGCTGAAAACACGTGCACACTTTTAAATAAAACTAGAAAATAAATCCCA GGCCTGCAAATATTGCACACAAAACCGTAATCCCCTTCGCGCTAAACAACACGCGCAACGATGCTCCGCTTGGGGACAAGGAAAAATTAATTTAACTCGGGATTTTCATTAAAAAATTAGGTTTTTAGTTAATTTTTCGATGTTTTCACT GCGAAAAAGTGTTAAAATAACGATTTTTCAACCTATTTTCAATTAATCCGTGCAAAAAATCGTGTATTTCTCGAGTTTTGAAAGAAATTTATGAAAATCGGCATTTTTAATAATGGTTTTTCAAATAAAAATATAATTTTTCGGTGCAGA AAAGTCGTTGCTCGTACAGTTTTTTTAAAGCATTTTCACATCAAAATCCTCCATTTTTCCAGTAAATCGATATGGAGTGCGACGAGACAAAGCTGAGCGACGGCGCAAGCGGCTGGGTGCCGAGTATCCCGACAGATATCGATTCAAAAG ACACACCGTTGCTCGATATATCTTCTCAGGCGATTTGGGCGCTTTCCAGTTGTAAAAGCGGTAAATTTTCCGACTTTCAAGGGAGAAAAGTGTAGAAAAATCGAAATTACTTCTTAAAAATCTCGTAAAAATCGAATTCTTTCAGGATTC GGCATCGACGAGCTCCTATCCGACAGTGTTGAGAAATATTGGCAAAGCGATGGCCCGCAGCCGCACACGATTCTTCTAGAATTCCAGAAAAAGACCGACGTGGCTATGATGATGTTCTATTTGGATTTTAAAAACGACGAGTCTTATACA CCGTCAAAGTTAGCATTTTTGGCTTTTTCAAACGAAAAAATACAATGAAACACTGAATATCTAGTTTTTTTCTCAATTTTTGCCTAAAAAACGGCGATTTTTCACTAGCTTTTCAATTAAAATTTGAACAAAAAGTTTTTTAAAGGAAAA ACATGAATTTCTAGCTTTTTCAGAGGTTTTCTATTAAAAAATAGAGATTTTTGTGATATCTGACTGAAAAATTACCAAACTGTCGATTTTTTTAAACTATTTTTCACTTAAAATCTGCAATTTTTTTTTTCGAGGAAACATGTGAATTTC AAGCTTTTTCAGAGATTTTCTATGAAAAAGGTTCGTGCCGAGACCCATGTGCTTTTAAACTTCAGAATTTTCCCAATTTTGAAATTAAAAAGAGAATGAAAATTGATTTTCATGGAAAAATGCGTTTTTGGCCCAAAACCTCCAAAAAGT ACAAATATAGGTCGACTTTCAACTGTTTTAGATCAATTTTTTTGCAGAATTCAAGTAAAAATGGGTTCATCTCACCAGGATATATTTTTCCGTCAAACACAAACATTCAACGAGCCCCAGGGATGGACATTTATCGATTTACGCGACAAA AATGGGAAACCGAATCGCGTTTTTTGGCTTCAAGTACAAGTTATTCAGAATCATCAAAATGGGAGAGATACTCATATAAGGTAGAGGAATTGAGAATTTCAGAACGAAAATTGCCGAAAAAATGAAATTTTAGCGAATTTGAGTCGGAAA TTTCGAAATTTGATTGATTTTAAGCAAATTTCCAACTAAAATCTTGAAAATTTGATCTTTTTAGATAAATTTTTTTTTAATTTTGTGCTTTTCAAAAAACCTCAAAAAACAATTAAAAATTGAAGTAAAATTAATTTTTCAACAATTTTT GAAAGGCCGAATTTTTGATTGAAAATTTTCACAATTTGTCCATTTTGTGGTGGGGCTTATTCCGAAAAATCGTTGTTTTTTTTTTCAAAAAAGTTATAAAAACTTTAAAATTGCCATGTAAAATATGTTTATTCTCAGACCTCGTAGGCA CGAAGCAGGCGTAGGTCGCCTCGCAATAAATTTGAAAATCTCAAGAAAAATCAATAAATTTGTGATTAATCAAAAAAATTTAATTTCCTGGTCCCAGCACGAATGCTATTTTTCGAAAAAAAAAAAGAGGCGAGCCTAATATAGACCACG CCCACAAAATGGGCAAAAGTTTGATTTTTCAAAAAATCGAAACAAAAATTTTTCCAATTTTGTGAGATTTTAAAATTTCCGGTTTTTGGAAAATCGAAAAAAAATTTCTCGTTTTTTAATTTTCAAAAAAAATTGTGCCTAAAATTCAAA AAAAAAATCAATACTTTCTCAAAATTTCCAGAAAACAGTCCATTTTCCAGGCACGTTCGAGTCCTTGGACCCCAGCGATCTCGTGTCTCCACAACGAATCGAATATTCACCGGAGAACCACACGGACCGATTCCCGATAAAAATATCACT AATTTCGACGACGAGGATTTTGCCAATTTTATCGATCACTCACTTGTTCACTTATCACTTCGTTAAATTTACCTCCAGTGATTCCAGATAATGAGCCAGTTTTGCATTGAAATTTAGTGCCAAAATATAGAAAATCGCATGATTTAACAT AAAATAGCGTTTCGAATTGAAACAATGGAAAAAAAGTGCTATGATGATTTTTTAACACTTTTAATTGTTCCAATTTGAAGTAAAATCTATTTTCAGATAAATCAACTGATTTTCTATATTCTGCCACTAAAGCTTAAAAACTTGCCCTGC TGTCCTAACCTTCAAATTGTTCCCTGCAAATTTTATTATTCTTGTTTCATATTTTTGCGATTGCTTCGCGAGACCCAAACTCACACATTTACCTGTAAAATATAATCGAATAATTATTTATATATTTTCTGTAAATTTCCTTAGTATACT ATAAATTTTCTGATCTCTCTTCAAAAATCGCTAGAAAAAATAAACAAATGTCGGTTTAAAAATTCCTGGTAATTTACCTTCTATAGAAAATTTTTCGAAAAAAAAACCGAAGAAATTCAGATGGAAATTCCCGATCCCGAACTGCCGGGA ATACCGATTGATCCGCAAGATTTGGAGATTCTAGACACGCCCACACGGTTTTACGAGAAGCTTTTAGTGCGTTTTTCGTGTCGGGACCCGGAAATTTGACATTTTTGGCGCGCGGCTTGTTAGACTCCAAACCTTTTCAAAGATTTTTTT TTCGAATTAAATAACATTCGTGCTTGGGCCCGGAAATTGAATTTTTGATTTGAAAACAATTTTTTTTGAGTCCAAAATTTTCAAAGTTTGTCCATTTTTGGCGCGTGGCCTAGTAGGATCCGCCCCTTCTAAATTTTTTTTGAGCAAGTT TTCTGAAGCATTGATTTCAAAAATTTTTTTTGGAAATTTCTGGTTTATTTTTCCGGTTTTTTTCCGAGTTGCTGTTTAAGTTTGGAGAAATTCCAGAATTTGTCAATTTTTGGGGCGTGGCTTTTTCAGTAAGCACAGTTTTTTTTTTTT GAAAAATTGAAATTTTCGCGGTGCGGTTCAAGAAAAACCACAAAAACTCAATGATTTTTTAACGAAAATTTCAAATTTCTTGCAAGACCTACTGCAATTTCGATTTTTAGAAACTTTTTGAAAAAAATCCGAATTTTCTGATTTAGCCCC GCCCCAAAAATGGAAAGATTTCCGAAAATTCGAACCAAAAGTTCGCAAAAACTTGAATTTCTCTCACACAGATTGACGCGCTAATTTGAATTTTTCCAAAAATAAGCCCCGCCCCAAAAATGGACAAATTTTAAAAATTTTGAACCAAAT AAATTCAATTTTTTTTCGCTTTTTTCCGTTTTCGAACAAAAAATTCTAAAAATATATGGTTCTAGGCGGGGCTCAGGCACCCATCTACCTACTTAAAAATGCGTTAAATTTCAGGAATTAACTGCATCAACCGAACGGCGTCTCGCATTG TGTAGTCTGTATTTGGGCGAAGGAGATCTCGAAAAAAATCTGATCGCTGCGATCCGAGAAAGATCCGAAAAATCCGAGATTGAAGTGACGATTCTGTTGGATTTTTTGCGCGGAACACGGACCAATTCAAGCGGCGAAAGTAGTGTAACA GTGCTGAAACCTATTTCGGAAAAGTCAAAAGTTGGTTTTTTTTGCAAAAAAAAATCGATAAATCGATAAAAACCGACAATTTTGAGAATTTTCATTTCAAATTTGAGTCCCACATGCGCCTTTAAATATGGTGTACTGTAGTTTTAGCTC GAATGTTGAATTTCAAAAATTGAGAATAAAGAAATGTCGTGACGAGACCCACAAATGTTTTGAAAAAAATTTTCAATTTCAAAAAAATGTAAAAAATTGGGAATTTCCCTCCAAAAGTTAAATTGGTTTAGTCACAAACTTTGAAATTTT GAAATAAAATTTTTTTCGGCTAAAAATAAGTATTTTTTAAAAACTATTTTGAAGAAAAAAAGTTAGGTCTCGCCACGATGTATCTTGTATATGTGTATCTAAATTGCCATGTCGTGACGAGACCCTCTCATATTTTACACTGCAACTTTT TCCTCACGAGGGACGAGGAAAAGTGGTTTCTAGGCCATGGCCGAGGGGCCGACAAGTTTCATCGGCCATTTATCTTGCTTTGTTTTCCGCCTGTTTTCTTTCGTTTTTCACAGCTTTTTCCCATTTTTTCTTATTAAAACTGATAAATAA ATATTTTTGCAGATGCCAAAACGATTTTCAAGTAAAAAAATCATGTATTCAGTGGGCAAGCAGCGGTGAAAGTGGGCATTGTAATATGATGGATTACGGGAATACAAAACCTAAACTTTTTCTGAAACATGATACATATGATGCTTAAAT GCTGAGACTACCTGATTTTCATAACGAGACCGCTGAAAAAGTTTTGAGGTTTTCAAAATTCAACTTTTTGTGCGAAAATCTCGACTTTTTCACCGAAAAAGTTGAATTTTGGAAACCTCAAAACTTTTTCAGCGGTCTTGATATGAAAAT CAGGTAGCTTCAGCATCTAAGCAGCATATGTATCATGTTAAAGAAAAAGTTTAGGTTTTGTATTCCTGTAATCCATCATATTACATTGCCCACTTTCACCGCTGCTTGCCCACTGAATACATAATTTTTTCACTTGGAAATTGTTTTAGC Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 7

Example: C. elegans (I: 43,500-52,050) GAAGAAATGGAGCATTTGCGCTCCATCACACTCTCAGACAATTTCATTTTCCACATCCTATATATATTTTGGTTTTTCTGTCGTATTTTGTTTTAATTTATTGGTATTTCGTTCAAAAATAATTATTTTGACTGTATTTTTGGTTGCATA CATGTAGAACTGCTGTTTTTTAAGATATTCTGCCCATTCAAGTTTTTCAGTGTAAAATTGATATATTTCATTCCAACTGAAAATGAGATCGAAACGATGGAAAACCTCGGATATTACTGATTATGGAAAGAAGAGAAAAGAATCGGAAAG TTGTGGATCAAGTTCACCGATTCTCGAAACACAGTCATCTGGCGGTGCGGAACTTGACGAAGTTACTGAGGATGAATATTCTAGTAATTCGAGCAGTAATGAAACTAGCGACGAAGAGGAAAACTCAGAAGTACCAAATGTCTTATCTAT AACAGAAAGAGGTAAGAATTGCGTCTTCTAGTGATCATACTTTTCGCCAGATTCCCTAATGTAATATATTTTGTTGTAGAGAAAAGTTGGCAAAAGTTAACGGAAAACGATTTGGGACGAATTCGTTTCATCTTGAAGTACACTAGCAAT ACTAAAAAATGCGTGAACGAGTATTTTCAATATAATCATGGGCAAAACAATGAAATTATGAAAAGTCTATTATTGGATACCGATGGAACTATGACTGCAAAGGCTTGTTCGGAATGTGCCTACGATTTGAATCAGTAAGTTACTCTCTCG ATTTATTCCCAAAATTAATATGTGCTTCAGGTGCCACTGCAAAAAACCGCTTCGCTTCATCAATGCTCCGTGTGGTTGGTTTGCTATTCAAAACTATAAATAGTTCACTGTTTCCGTTCAGAGGTCATCAACCAAGTTCTTCATGTTGAA AATGCGGAGCCCACCAGGATCAACCATGTAATCGCAACACTCTTCCGGAATCACATTGGCGAGATTTTGTTGGTCCACTCTATTTCTGTGCGAGAACTGTGATAAAACTAGTATTTTCAGCACAAAGGCTCGAACTGCGGAAGCTCGCGC ATCTGAAGAAGCTCAAATCAGGATTCAAATCCAAGACAACTCGAACGCATTCCAAAGATCGTATCATAACGATCCACAACCTTCATCAGCCGAAGAACATGAGGAAGATATCGTGGTGGATGGCTGAGTACGGAGCTCAAATGCCTTAAG GCGAAACAATTGGTTTTTTAATTTGCTGGTTATCATGTTAGATTTTGAACGTGTTAGGTCTTTCAATTGTTTTTTTTTTTCGAAATGTTGTTGTTCTAATAAATTTGTTTTATTTAATCAAACGTTTTTTAGTCTACTACGGGCGTGAAG CCAGATATCAGTGGTATCTTCTTATCAGAAGCTGAATCATTTCCGGTTGACAATGTTTGAAGGACATAAGAAAGGCTGTGTTACTGATTTCGACCATTGATTTGTTTATATATGGATATGTTCCACTGCCTTTTGGAAAGGCAGTATTCC CGGTATATATGGGCCTAATACGGAATCTAAAATAACCTGACACAAACCTGACGTTGACCTGTTGCCGGCCCGCGGCGGCTTAGTGTCAACTTGACAGCGGGTCGCGATTTCACCTGCCAGTTGTTCTCCATTCAGCAGCCAGCGACCTGC TGGCAGGTTGCCACTAACCTGACGCGGTTTACCTGTGTTATCGGCGCGTGCATAGCTTAGTGGTTTCAGGAAATGATGCTAGTAATCAGAAGATCGGGGTTCGGGAAACGGCAGGGGCTTGAAGGTTAGGTTCTATGAAGCAGGGCGAAG GGTTGACAAGGAGAGGCAATAAGCAAGTAGTAGGGGTTCTCTAGAAAACATTTTTGTCTTTAATATGCGTTTCCTACTGATTTATTATTGATATTTGGATCCCCTTTTCTAGAAAAAAAAATCAGAATCAGCAGAAAAATTTGAGAAAAA GTCATAGCAAATCAGAGTTGGTCAGAGTAAATCAGAGCTAGTCATAGTAAATCATAGCTAGTCAGAGAATATCAGAGTTAATCAGGGTAATAAGTAGACCTAGTCATAGTAAATCAGAGCTAGGCATAGTAAAGCGTGGTTACTCCGAGT AAAACCACACTTGCACCGAACTGCGGTTAGTGTGCTTTACCATTATGTAACTCCGCTTTTTACTCTGAGTTAGTATGATATGGTTTGTCTGAGCTGTGGTTGGGCTTCGCGGGAAACTTGAATAATTCGAGACAAAATCTAATTTTAGCG AATTTTCTTTAATTTCTTTGAGGTTTCTACGACAGAACTCGAAAAATTTCGGGTTTTAATGTTTACACATTTTATTTAAAATTGAATAATCAACTGCGGGACTCCTCGAAAATCACATGCTCATTTAAATTTTGAAGTTCAAACCTCAAA AAACGCGCAAAAACCAAATTCAGCTAGGATATCAAATTTATGATTGAAATCTATATTTTGATGCGGTGTTTCTGAAGTTTTCGCGATAAAATCCGAATAATAATTCCACGTACCGTATATTCTCTATCTAATTTCCAGGTCATTTTTTAA TGCAGCACTATTAGAGACTGTCGTACTACTGGAGACTGCAGCATTAATTTTCGAACGGCTACTGTCAATTATAGATCACTAGTATTTAGTCACAAAAGCTAATTTTTTAAGCAGAAATTCATAAAAATGTTTTCAATATTGCGAACTTTT GTAACAAAAAGACCCAGTAATTCAATTACTTTCGTAAATTATCAAAAAATCATCAAAAATATACAAAAAAATACCAAAAAATATTGAAACTTTCAAGTGACTCTTTCAATAGAAAATGGGGTGCAGCACTAATAGAGACTGCTGCACTAT TTTTCGGACCCTTTTTGAATGCAGCACTATTAGAGACTGCAGTATTTACTACTGGAGATGCAGCACTAATAGAGAATATACGGTATATACGTAATATATTCTTGCAGAAAAAAGTACGATTATCAATGAAAAATAGCTGATAAGAGGCTT TTGTTTGAACTAACAGACGGAACGACTCCGGTTTAGTTCAAAAAATTCTAAAAACACGTTGTGTCAGGCTGTCTCATTGCGGTTTGATCTACGAAAAATGCGGGAATATTTTTCCAGAAAAATTGTGACGTCAGCACGCTCTTAACCATG CGAAACGAGATGAGATGTCTGCGTCTCTTTTCCCGCATTTTTCGAAGATCAAAACGAATGGGACTTTCTGACTCCACGTGTAAAAAGGGGTTACGACGGACCCTGGCCTAGAAATTAGGCGTGAAAATTCTCGGGCACTGGATGTAGTGA ACGCCCGCGATGAAAAATTGGGGGAAAATTAGGCTTTCTTTGCGAGAAAGATTAATTAAAAATGTTTTCCTTTGTCGAAAATAATTTTTAAAAAACACACCACGTGTATTCAGCTCGACCAACGCCTCGAAAATTTTCAAAAAAGGCGGG AAAAATTAGTTGAATTCGCCAAGAGGAATTTCACCGCAGCGCGTGCAAAAATTTCAGCATTTGCGCGTGACGGTGTTTGCACAAATTACACCGAATGGTCGAGCTGAAAACACGTGCACACTTTTAAATAAAACTAGAAAATAAATCCCA GGCCTGCAAATATTGCACACAAAACCGTAATCCCCTTCGCGCTAAACAACACGCGCAACGATGCTCCGCTTGGGGACAAGGAAAAATTAATTTAACTCGGGATTTTCATTAAAAAATTAGGTTTTTAGTTAATTTTTCGATGTTTTCACT GCGAAAAAGTGTTAAAATAACGATTTTTCAACCTATTTTCAATTAATCCGTGCAAAAAATCGTGTATTTCTCGAGTTTTGAAAGAAATTTATGAAAATCGGCATTTTTAATAATGGTTTTTCAAATAAAAATATAATTTTTCGGTGCAGA AAAGTCGTTGCTCGTACAGTTTTTTTAAAGCATTTTCACATCAAAATCCTCCATTTTTCCAGTAAATCGATATGGAGTGCGACGAGACAAAGCTGAGCGACGGCGCAAGCGGCTGGGTGCCGAGTATCCCGACAGATATCGATTCAAAAG ACACACCGTTGCTCGATATATCTTCTCAGGCGATTTGGGCGCTTTCCAGTTGTAAAAGCGGTAAATTTTCCGACTTTCAAGGGAGAAAAGTGTAGAAAAATCGAAATTACTTCTTAAAAATCTCGTAAAAATCGAATTCTTTCAGGATTC GGCATCGACGAGCTCCTATCCGACAGTGTTGAGAAATATTGGCAAAGCGATGGCCCGCAGCCGCACACGATTCTTCTAGAATTCCAGAAAAAGACCGACGTGGCTATGATGATGTTCTATTTGGATTTTAAAAACGACGAGTCTTATACA CCGTCAAAGTTAGCATTTTTGGCTTTTTCAAACGAAAAAATACAATGAAACACTGAATATCTAGTTTTTTTCTCAATTTTTGCCTAAAAAACGGCGATTTTTCACTAGCTTTTCAATTAAAATTTGAACAAAAAGTTTTTTAAAGGAAAA ACATGAATTTCTAGCTTTTTCAGAGGTTTTCTATTAAAAAATAGAGATTTTTGTGATATCTGACTGAAAAATTACCAAACTGTCGATTTTTTTAAACTATTTTTCACTTAAAATCTGCAATTTTTTTTTTCGAGGAAACATGTGAATTTC AAGCTTTTTCAGAGATTTTCTATGAAAAAGGTTCGTGCCGAGACCCATGTGCTTTTAAACTTCAGAATTTTCCCAATTTTGAAATTAAAAAGAGAATGAAAATTGATTTTCATGGAAAAATGCGTTTTTGGCCCAAAACCTCCAAAAAGT ACAAATATAGGTCGACTTTCAACTGTTTTAGATCAATTTTTTTGCAGAATTCAAGTAAAAATGGGTTCATCTCACCAGGATATATTTTTCCGTCAAACACAAACATTCAACGAGCCCCAGGGATGGACATTTATCGATTTACGCGACAAA AATGGGAAACCGAATCGCGTTTTTTGGCTTCAAGTACAAGTTATTCAGAATCATCAAAATGGGAGAGATACTCATATAAGGTAGAGGAATTGAGAATTTCAGAACGAAAATTGCCGAAAAAATGAAATTTTAGCGAATTTGAGTCGGAAA TTTCGAAATTTGATTGATTTTAAGCAAATTTCCAACTAAAATCTTGAAAATTTGATCTTTTTAGATAAATTTTTTTTTAATTTTGTGCTTTTCAAAAAACCTCAAAAAACAATTAAAAATTGAAGTAAAATTAATTTTTCAACAATTTTT GAAAGGCCGAATTTTTGATTGAAAATTTTCACAATTTGTCCATTTTGTGGTGGGGCTTATTCCGAAAAATCGTTGTTTTTTTTTTCAAAAAAGTTATAAAAACTTTAAAATTGCCATGTAAAATATGTTTATTCTCAGACCTCGTAGGCA CGAAGCAGGCGTAGGTCGCCTCGCAATAAATTTGAAAATCTCAAGAAAAATCAATAAATTTGTGATTAATCAAAAAAATTTAATTTCCTGGTCCCAGCACGAATGCTATTTTTCGAAAAAAAAAAAGAGGCGAGCCTAATATAGACCACG CCCACAAAATGGGCAAAAGTTTGATTTTTCAAAAAATCGAAACAAAAATTTTTCCAATTTTGTGAGATTTTAAAATTTCCGGTTTTTGGAAAATCGAAAAAAAATTTCTCGTTTTTTAATTTTCAAAAAAAATTGTGCCTAAAATTCAAA AAAAAAATCAATACTTTCTCAAAATTTCCAGAAAACAGTCCATTTTCCAGGCACGTTCGAGTCCTTGGACCCCAGCGATCTCGTGTCTCCACAACGAATCGAATATTCACCGGAGAACCACACGGACCGATTCCCGATAAAAATATCACT AATTTCGACGACGAGGATTTTGCCAATTTTATCGATCACTCACTTGTTCACTTATCACTTCGTTAAATTTACCTCCAGTGATTCCAGATAATGAGCCAGTTTTGCATTGAAATTTAGTGCCAAAATATAGAAAATCGCATGATTTAACAT AAAATAGCGTTTCGAATTGAAACAATGGAAAAAAAGTGCTATGATGATTTTTTAACACTTTTAATTGTTCCAATTTGAAGTAAAATCTATTTTCAGATAAATCAACTGATTTTCTATATTCTGCCACTAAAGCTTAAAAACTTGCCCTGC TGTCCTAACCTTCAAATTGTTCCCTGCAAATTTTATTATTCTTGTTTCATATTTTTGCGATTGCTTCGCGAGACCCAAACTCACACATTTACCTGTAAAATATAATCGAATAATTATTTATATATTTTCTGTAAATTTCCTTAGTATACT ATAAATTTTCTGATCTCTCTTCAAAAATCGCTAGAAAAAATAAACAAATGTCGGTTTAAAAATTCCTGGTAATTTACCTTCTATAGAAAATTTTTCGAAAAAAAAACCGAAGAAATTCAGATGGAAATTCCCGATCCCGAACTGCCGGGA ATACCGATTGATCCGCAAGATTTGGAGATTCTAGACACGCCCACACGGTTTTACGAGAAGCTTTTAGTGCGTTTTTCGTGTCGGGACCCGGAAATTTGACATTTTTGGCGCGCGGCTTGTTAGACTCCAAACCTTTTCAAAGATTTTTTT TTCGAATTAAATAACATTCGTGCTTGGGCCCGGAAATTGAATTTTTGATTTGAAAACAATTTTTTTTGAGTCCAAAATTTTCAAAGTTTGTCCATTTTTGGCGCGTGGCCTAGTAGGATCCGCCCCTTCTAAATTTTTTTTGAGCAAGTT TTCTGAAGCATTGATTTCAAAAATTTTTTTTGGAAATTTCTGGTTTATTTTTCCGGTTTTTTTCCGAGTTGCTGTTTAAGTTTGGAGAAATTCCAGAATTTGTCAATTTTTGGGGCGTGGCTTTTTCAGTAAGCACAGTTTTTTTTTTTT GAAAAATTGAAATTTTCGCGGTGCGGTTCAAGAAAAACCACAAAAACTCAATGATTTTTTAACGAAAATTTCAAATTTCTTGCAAGACCTACTGCAATTTCGATTTTTAGAAACTTTTTGAAAAAAATCCGAATTTTCTGATTTAGCCCC GCCCCAAAAATGGAAAGATTTCCGAAAATTCGAACCAAAAGTTCGCAAAAACTTGAATTTCTCTCACACAGATTGACGCGCTAATTTGAATTTTTCCAAAAATAAGCCCCGCCCCAAAAATGGACAAATTTTAAAAATTTTGAACCAAAT AAATTCAATTTTTTTTCGCTTTTTTCCGTTTTCGAACAAAAAATTCTAAAAATATATGGTTCTAGGCGGGGCTCAGGCACCCATCTACCTACTTAAAAATGCGTTAAATTTCAGGAATTAACTGCATCAACCGAACGGCGTCTCGCATTG TGTAGTCTGTATTTGGGCGAAGGAGATCTCGAAAAAAATCTGATCGCTGCGATCCGAGAAAGATCCGAAAAATCCGAGATTGAAGTGACGATTCTGTTGGATTTTTTGCGCGGAACACGGACCAATTCAAGCGGCGAAAGTAGTGTAACA GTGCTGAAACCTATTTCGGAAAAGTCAAAAGTTGGTTTTTTTTGCAAAAAAAAATCGATAAATCGATAAAAACCGACAATTTTGAGAATTTTCATTTCAAATTTGAGTCCCACATGCGCCTTTAAATATGGTGTACTGTAGTTTTAGCTC GAATGTTGAATTTCAAAAATTGAGAATAAAGAAATGTCGTGACGAGACCCACAAATGTTTTGAAAAAAATTTTCAATTTCAAAAAAATGTAAAAAATTGGGAATTTCCCTCCAAAAGTTAAATTGGTTTAGTCACAAACTTTGAAATTTT GAAATAAAATTTTTTTCGGCTAAAAATAAGTATTTTTTAAAAACTATTTTGAAGAAAAAAAGTTAGGTCTCGCCACGATGTATCTTGTATATGTGTATCTAAATTGCCATGTCGTGACGAGACCCTCTCATATTTTACACTGCAACTTTT TCCTCACGAGGGACGAGGAAAAGTGGTTTCTAGGCCATGGCCGAGGGGCCGACAAGTTTCATCGGCCATTTATCTTGCTTTGTTTTCCGCCTGTTTTCTTTCGTTTTTCACAGCTTTTTCCCATTTTTTCTTATTAAAACTGATAAATAA ATATTTTTGCAGATGCCAAAACGATTTTCAAGTAAAAAAATCATGTATTCAGTGGGCAAGCAGCGGTGAAAGTGGGCATTGTAATATGATGGATTACGGGAATACAAAACCTAAACTTTTTCTGAAACATGATACATATGATGCTTAAAT GCTGAGACTACCTGATTTTCATAACGAGACCGCTGAAAAAGTTTTGAGGTTTTCAAAATTCAACTTTTTGTGCGAAAATCTCGACTTTTTCACCGAAAAAGTTGAATTTTGGAAACCTCAAAACTTTTTCAGCGGTCTTGATATGAAAAT CAGGTAGCTTCAGCATCTAAGCAGCATATGTATCATGTTAAAGAAAAAGTTTAGGTTTTGTATTCCTGTAATCCATCATATTACATTGCCCACTTTCACCGCTGCTTGCCCACTGAATACATAATTTTTTCACTTGGAAATTGTTTTAGC Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 7

Quasi-Standard: Hidden Markov Models Model sequence content: One state per segment type Allow only plausible transitions Content statistics at each state Derived from known genes Prediction: Given DNA, find most likely state sequences Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 8

Quasi-Standard: Hidden Markov Models Model sequence content: One state per segment type Allow only plausible transitions Content statistics at each state Derived from known genes Prediction: Given DNA, find most likely state sequences Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 8

Quasi-Standard: Hidden Markov Models Model sequence content: One state per segment type Allow only plausible transitions Content statistics at each state Derived from known genes Prediction: Given DNA, find most likely state sequences Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 8

Quasi-Standard: Hidden Markov Models Model sequence content: One state per segment type Allow only plausible transitions Content statistics at each state Derived from known genes Prediction: Given DNA, find most likely state sequences p(x, y) = L 1 i=1 p(x i y i )p(y i+1 y i ) Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 8

Generative Models Hidden Markov Models [Rabiner, 1989] State sequence treated as (first order) Markov chain No direct dependencies between observations p(x, y) = i p(x i y i )p(y i y i 1 ) Y 1 Y 2... Y n X 1 X 2... Maximum likelihood estimation: max θ X n N log p(x n, y n ) n=1 Decoding: Viterbi/dynamic programming (DP) algorithms Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 9

Generative Models Hidden Markov Models [Rabiner, 1989] State sequence treated as (first order) Markov chain No direct dependencies between observations p(x, y) = i p(x i y i )p(y i y i 1 ) Y 1 Y 2... Y n X 1 X 2... Maximum likelihood estimation: max θ X n N log p(x n, y n ) n=1 Decoding: Viterbi/dynamic programming (DP) algorithms Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 9

Generative Models Hidden Markov Models [Rabiner, 1989] State sequence treated as (first order) Markov chain No direct dependencies between observations p(x, y) = i p(x i y i )p(y i y i 1 ) Y 1 Y 2... Y n X 1 X 2... Maximum likelihood estimation: max θ X n N log p(x n, y n ) n=1 Decoding: Viterbi/dynamic programming (DP) algorithms Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 9

Discriminative Models Conditional Random Fields [Lafferty et al., 2001] conditional probability p(y x) instead of joint probability p(x, y) p w (y x) = 1 Z(x, w) exp(f w(y x)) Y 1 Y 2... Y n X = X 1, X 2,..., X n Can handle non-independent input features N Maximum likelihood estimation: max log p w (y n x n ) w Decoding: Viterbi or MEA algorithms [Gross et al., 2006] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 10 n=1

Discriminative Models Conditional Random Fields [Lafferty et al., 2001] conditional probability p(y x) instead of joint probability p(x, y) p w (y x) = 1 Z(x, w) exp(f w(y x)) Y 1 Y 2... Y n X = X 1, X 2,..., X n Can handle non-independent input features N Maximum likelihood estimation: max log p w (y n x n ) w Decoding: Viterbi or MEA algorithms [Gross et al., 2006] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 10 n=1

Discriminative Models Conditional Random Fields [Lafferty et al., 2001] conditional probability p(y x) instead of joint probability p(x, y) p w (y x) = 1 Z(x, w) exp(f w(y x)) Y 1 Y 2... Y n X = X 1, X 2,..., X n Can handle non-independent input features N Maximum likelihood estimation: max log p w (y n x n ) w Decoding: Viterbi or MEA algorithms [Gross et al., 2006] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 10 n=1

Discriminative Models Conditional Random Fields [Lafferty et al., 2001] conditional probability p(y x) instead of joint probability p(x, y) p w (y x) = 1 Z(x, w) exp(f w(y x)) Y 1 Y 2... Y n X = X 1, X 2,..., X n Can handle non-independent input features N Maximum likelihood estimation: max log p w (y n x n ) w Decoding: Viterbi or MEA algorithms [Gross et al., 2006] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 10 n=1

Max-Margin Structured Output Learning Learn function f (y x) scoring segmentations y for x Maximize f (y x) w.r.t. y for prediction: argmax f (y x) y Υ Idea: f (y x) f (ŷ x) for wrong labels ŷ y Max-margin parameter estimation: [Altun et al., 2003] Given N sequence pairs (x 1, y 1 ),..., (x N, y N ) for training Solve optimization problem: max f,ρ R,ξ R N + w.r.t. ρ C N n=1 ξ n f (y n x n ) f (y x n ) ρ ξ n for all y n y Υ, n = 1,..., N Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 11

Max-Margin Structured Output Learning Learn function f (y x) scoring segmentations y for x Maximize f (y x) w.r.t. y for prediction: argmax f (y x) y Υ Idea: f (y x) f (ŷ x) for wrong labels ŷ y Max-margin parameter estimation: [Altun et al., 2003] Given N sequence pairs (x 1, y 1 ),..., (x N, y N ) for training Solve optimization problem: max f,ρ R,ξ R N + w.r.t. ρ C N n=1 ξ n f (y n x n ) f (y x n ) ρ ξ n for all y n y Υ, n = 1,..., N Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 11

Max-Margin Structured Output Learning Learn function f (y x) scoring segmentations y for x Maximize f (y x) w.r.t. y for prediction: argmax f (y x) y Υ Idea: f (y x) f (ŷ x) for wrong labels ŷ y Max-margin parameter estimation: [Altun et al., 2003] Given N sequence pairs (x 1, y 1 ),..., (x N, y N ) for training Solve optimization problem: max f,ρ R,ξ R N + w.r.t. ρ C N n=1 ξ n f (y n x n ) f (y x n ) ρ ξ n for all y n y Υ, n = 1,..., N Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 11

Structured Output Learning with Φ(x, y) Assume f (y x) = w, Φ(x, y), where w, Φ(x, y) F, w 2 = 1 N ρ C max w =1,ρ R,ξ R N + n=1 ξ n w.r.t. w, Φ(x n, y n ) Φ(x n, y) ρ ξ n (1) for all y n y Υ, n = 1,..., N Linear classifier that separates true from wrong labellings N Representer Theorem: w = α n,y Φ(x n, y) n=1 y Υ Corollary: (1) only depend on products of Φ s Define kernel function : k((x, y), (x, y )) = Φ(x, y), Φ(x, y ) Kernel Trick (everything only depends on k) Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 12

Structured Output Learning with Φ(x, y) Assume f (y x) = w, Φ(x, y), where w, Φ(x, y) F, w 2 = 1 N ρ C max w =1,ρ R,ξ R N + n=1 ξ n w.r.t. w, Φ(x n, y n ) Φ(x n, y) ρ ξ n (1) for all y n y Υ, n = 1,..., N Linear classifier that separates true from wrong labellings N Representer Theorem: w = α n,y Φ(x n, y) n=1 y Υ Corollary: (1) only depend on products of Φ s Define kernel function : k((x, y), (x, y )) = Φ(x, y), Φ(x, y ) Kernel Trick (everything only depends on k) Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 12

Structured Output Learning with Φ(x, y) Assume f (y x) = w, Φ(x, y), where w, Φ(x, y) F, w 2 = 1 N ρ C max w =1,ρ R,ξ R N + n=1 ξ n w.r.t. w, Φ(x n, y n ) Φ(x n, y) ρ ξ n (1) for all y n y Υ, n = 1,..., N Linear classifier that separates true from wrong labellings N Representer Theorem: w = α n,y Φ(x n, y) n=1 y Υ Corollary: (1) only depend on products of Φ s Define kernel function : k((x, y), (x, y )) = Φ(x, y), Φ(x, y ) Kernel Trick (everything only depends on k) Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 12

Structured Output Learning with Φ(x, y) Assume f (y x) = w, Φ(x, y), where w, Φ(x, y) F, w 2 = 1 N ρ C max w =1,ρ R,ξ R N + n=1 ξ n w.r.t. w, Φ(x n, y n ) Φ(x n, y) ρ ξ n (1) for all y n y Υ, n = 1,..., N Linear classifier that separates true from wrong labellings N Representer Theorem: w = α n,y Φ(x n, y) n=1 y Υ Corollary: (1) only depend on products of Φ s Define kernel function : k((x, y), (x, y )) = Φ(x, y), Φ(x, y ) Kernel Trick (everything only depends on k) Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 12

Structured Output Learning with Φ(x, y) Assume f (y x) = w, Φ(x, y), where w, Φ(x, y) F, w 2 = 1 N ρ C max w =1,ρ R,ξ R N + n=1 ξ n w.r.t. w, Φ(x n, y n ) Φ(x n, y) ρ ξ n (1) for all y n y Υ, n = 1,..., N Linear classifier that separates true from wrong labellings N Representer Theorem: w = α n,y Φ(x n, y) n=1 y Υ Corollary: (1) only depend on products of Φ s Define kernel function : k((x, y), (x, y )) = Φ(x, y), Φ(x, y ) Kernel Trick (everything only depends on k) Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 12

Optimization Strategy max w =1,ρ,ξ w.r.t. ρ C N n=1 ξ n w, Φ(x, y n ) Φ(x, y) ρ ξ n for all y n y Υ, n = 1,..., N Big: one constraint per example and wrong labeling Iterative solution Begin with small set of wrong labellings Solve reduced optimization problem Find labellings that violate constraints Add constraints, resolve Guaranteed Convergence: 2 : see Tsochantaridis et al. [2005] 1 : see Rätsch and Warmuth [2005] (log(n)/ɛ 2 iterations) Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 13

How to find violated constraints? Constraint Find labeling y that maximizes w, Φ(x, y n ) Φ(x, y) ρ ξ n w, Φ(x, y) Use Dynamic Programming Decoding y = argmax w, Φ(x, y) y Υ (DP only works if Φ has certain decomposition structure) If y = y n, then compute second best labeling as well If constraint is violated, then add to optimization problem Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 14

Column-Generation Algorithm 1 Υ 1 n =, for n = 1,..., N 2 Solve (w t, ρ, ξ t ) = argmax w =1,ρ,ξ w.r.t. ρ C N n=1 ξ n w, Φ(x, y n ) Φ(x, y) ρ ξ n for all y n y Υ t n, n = 1,..., N 3 Find violated constraints (n = 1,..., N) y t n = argmax y n y Υ w t, Φ(x, y) If w t, Φ(x, y n ) Φ(x, y t n) < ρ ξ t n, set Υ t+1 n = Υ t n {y t n} 4 If violated constraint exists, then go to 2 5 Otherwise terminate Optimal solution Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 15

Problems Optimization may require many iterations Number of variables increases linearly When using kernels, solving optimization problems can become easily infeasible Evaluation of w, Φ(x, y) in dynamic programming can be very expensive Optimization and decoding become too expensive Approximation algorithms needed Idea: Decompose problem First part uses kernels, can be precomputed Second part without kernels and only combines ingredients Solve problems with 10, 000 examples, instead of just 500 Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 16

Problems Optimization may require many iterations Number of variables increases linearly When using kernels, solving optimization problems can become easily infeasible Evaluation of w, Φ(x, y) in dynamic programming can be very expensive Optimization and decoding become too expensive Approximation algorithms needed Idea: Decompose problem First part uses kernels, can be precomputed Second part without kernels and only combines ingredients Solve problems with 10, 000 examples, instead of just 500 Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 16

Problems Optimization may require many iterations Number of variables increases linearly When using kernels, solving optimization problems can become easily infeasible Evaluation of w, Φ(x, y) in dynamic programming can be very expensive Optimization and decoding become too expensive Approximation algorithms needed Idea: Decompose problem First part uses kernels, can be precomputed Second part without kernels and only combines ingredients Solve problems with 10, 000 examples, instead of just 500 Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 16

Problems Optimization may require many iterations Number of variables increases linearly When using kernels, solving optimization problems can become easily infeasible Evaluation of w, Φ(x, y) in dynamic programming can be very expensive Optimization and decoding become too expensive Approximation algorithms needed Idea: Decompose problem First part uses kernels, can be precomputed Second part without kernels and only combines ingredients Solve problems with 10, 000 examples, instead of just 500 Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 16

Back to Computational Gene Finding Given a piece of DNA sequence Predict gene products inluding the intermediate processing steps Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 17

Back to Computational Gene Finding DNA TSS polya/cleavage pre-mrna Splice Donor Splice Acceptor Splice Splice Donor Acceptor mrna TIS Stop cap polya Protein Predict signals used during processing Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 17

Back to Computational Gene Finding DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein Predict signals used during processing Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 17

Back to Computational Gene Finding DNA TSS Donor Acceptor Donor Acceptor polya/cleavage pre-mrna TIS Stop mrna cap polya Protein Predict signals used during processing Predict the correct corresponding label sequence with labels intergenic, exon, intron, 5 UTR, etc. Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 17

Example: Splice Site Recognition True Splice Sites Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 18

Example: Splice Site Recognition True Splice Sites CT...GTCGTA...GAAGCTAGGAGCGC...ACGCGT...GA 150 nucleotides window around dimer Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 18

Example: Splice Site Recognition Potential Splice Sites CT...GTCGTA...GAAGCTAGGAGCGC...ACGCGT...GA 150 nucleotides window around dimer Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 18

Example: Splice Site Recognition Potential Splice Sites CT...GTCGTA...GAAGCTAGGAGCGC...ACGCGT...GA 150 nucleotides window around dimer. True sites: fixed window around a true splice site Decoy sites: all other consensus sites Millions of examples from EST databases Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 18

Example: Splice Site Recognition Potential Splice Sites CT...GTCGTA...GAAGCTAGGAGCGC...ACGCGT...GA Basic idea: 150 nucleotides window around dimer For instance, exploit that exons have higher GC content or that specific motifs appear near splice sites. [Sonnenburg et al., 2007b] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 19

Example: Splice Site Recognition Potential Splice Sites CT...GTCGTA...GAAGCTAGGAGCGC...ACGCGT...GA Basic idea: 150 nucleotides window around dimer In practice: Use one feature per possible substring (e.g. 20) at all positions 150 (4 1 +...+4 20 ) 2 10 14 features Needs efficient algorithms Leads to most accurate predictors that currently exist Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 19

Example: Splice Site Recognition Potential Splice Sites CT...GTCGTA...GAAGCTAGGAGCGC...ACGCGT...GA Basic idea: 150 nucleotides window around dimer In practice: Use one feature per possible substring (e.g. 20) at all positions 150 (4 1 +...+4 20 ) 2 10 14 features Needs efficient algorithms Leads to most accurate predictors that currently exist Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 19

Kernels and String Data Structures Good news: Use WD kernel to implicitly consider features: [Rätsch et al., 2007, Rätsch and Sonnenburg, 2004] Drawback: Training too expensive for millions of examples Needs more sophisticated algorithms: Exploit that features are sparse and can be explicitly represented String indexing data structures [Sonnenburg et al., 2007a] Use novel optimization techniques Use subgradient-based optimization [Franc and Sonnenburg, 2009] Implemented in a freely available software package www.shogun-toolbox.org [Sonnenburg et al., 2007a] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 20

Kernels and String Data Structures Good news: Use WD kernel to implicitly consider features: [Rätsch et al., 2007, Rätsch and Sonnenburg, 2004] Drawback: Training is too expensive for millions of examples Needs more sophisticated algorithms: Exploit that features are sparse and can be explicitly represented String indexing data structures [Sonnenburg et al., 2007a] Use novel optimization techniques Use subgradient-based optimization [Franc and Sonnenburg, 2009] Implemented in a freely available software package www.shogun-toolbox.org [Sonnenburg et al., 2007a] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 20

Kernels and String Data Structures Good news: Use WD kernel to implicitly consider features: [Rätsch et al., 2007, Rätsch and Sonnenburg, 2004] Drawback: Training was too expensive for millions of examples Needs more sophisticated algorithms: Exploit that features are sparse and can be explicitly represented String indexing data structures [Sonnenburg et al., 2007a] Use novel optimization techniques Use subgradient-based optimization [Franc and Sonnenburg, 2009] Implemented in a freely available software package www.shogun-toolbox.org [Sonnenburg et al., 2007a] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 20

Results on Splice Site Recognition Worm Fly Cress Fish Human Acc Don Acc Don Acc Don Acc Don Acc Don Markov Chain auprc(%) 92.1 90.0 80.3 78.5 87.4 88.2 63.6 62.9 16.2 26.0 SVM auprc(%) 95.9 95.3 86.7 87.5 92.2 92.9 86.6 86.9 54.4 56.9 [Sonnenburg, Schweikert, Philips, Behr, Rätsch, 2007] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 21

Human Splice Sites with WD Kernel Performance (=Area under PRC curve) keeps increasing. Pays off to use all available data! [Sonnenburg et al., 2007b] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 22

Human Splice Sites with WD Kernel Performance (=Area under PRC curve) keeps increasing. Pays off to use all available data! [Sonnenburg et al., 2007b] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 22

Example: Predictions in UCSC Browser [Rätsch et al., 2007, Schweikert et al., 2009c] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 23

Example: Predictions in UCSC Browser [Rätsch et al., 2007, Schweikert et al., 2009c] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 23

Example: Predictions in UCSC Browser Signals have to appear in the right order TSS TIS Stop cleave Don Acc Based on known genes, learn how to combine predictions for accurate gene prediction. Prediction of Structured Outputs [Rätsch et al., 2007, Schweikert et al., 2009c] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 23

Discriminative Gene Prediction (simplified) [Rätsch, Sonnenburg, Srinivasan, Witte, Müller, Sommer, Schölkopf, 2007] Simplified Model: Score for splice form y = {(p j, q j )} J j=1 : J 1 F (y) := S GT (fj GT ) + j=1 J S AG (f AG j=2 j ) } {{ } Splice signals S LI (p j+1 q j ) + J 1 + j=1 J S LE (q j p j ) j=1 } {{ } Segment lengths Tune free parameters by solving optimization problem. Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 24

Discriminative Gene Prediction (simplified) [Rätsch, Sonnenburg, Srinivasan, Witte, Müller, Sommer, Schölkopf, 2007] Simplified Model: Score for splice form y = {(p j, q j )} J j=1 : J 1 F (y) := S GT (fj GT ) + j=1 J S AG (f AG j=2 j ) } {{ } Splice signals S LI (p j+1 q j ) + J 1 + j=1 J S LE (q j p j ) j=1 } {{ } Segment lengths Tune free parameters by solving optimization problem. Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 24

Discriminative Gene Prediction (simplified) Take-Home Messages Simplified Model: Avoid Score solving for splice a huge form optimization y = {(p j, q j )} J j=1 : J 1 problem by splitting J 1 it into two parts F (y) := S GT (fj GT ) + j=1 [Rätsch, Sonnenburg, Srinivasan, Witte, Müller, Sommer, Schölkopf, 2007] j ) } {{ } Splice signals J S AG (f AG + S LI (p j+1 q j ) + Exploit state-of-the-art signal j=2 j=1 predictions J S LE (q j p j ) j=1 } {{ } Segment lengths Predict more accurately than Tune free parameters generative by solving models optimization problem. Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 24

Results using mgene Most accurate ab initio method in the ngasp genome annotation challenge for C. elegans [Coghlan et al., 2008] Validation of gene predictions for C. elegans: [Schweikert et al., 2009c] No. of genes No. of genes Frac. of genes analyzed w/ expression New genes 2,197 57 42% Missing unconf. genes 205 24 8% Annotation of other nematode genomes: [Schweikert et al., 2009c] Genome Genome No. of No. exons/gene mgene best other size [Mbp] genes (mean) accuracy accuracy C. remanei 235.94 31503 5.7 96.6% 93.8% C. japonica 266.90 20121 5.3 93.3% 88.7% C. brenneri 453.09 41129 5.4 93.1% 87.8% C. briggsae 108.48 22542 6.0 87.0% 82.0% Web service for ab initio gene predictions [Schweikert et al., 2009b] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 25

Results using mgene Most accurate ab initio method in the ngasp genome annotation challenge for C. elegans [Coghlan et al., 2008] Validation of gene predictions for C. elegans: [Schweikert et al., 2009c] No. of genes No. of genes Frac. of genes analyzed w/ expression New genes 2,197 57 42% Missing unconf. genes 205 24 8% Annotation of other nematode genomes: [Schweikert et al., 2009c] Genome Genome No. of No. exons/gene mgene best other size [Mbp] genes (mean) accuracy accuracy C. remanei 235.94 31503 5.7 96.6% 93.8% C. japonica 266.90 20121 5.3 93.3% 88.7% C. brenneri 453.09 41129 5.4 93.1% 87.8% C. briggsae 108.48 22542 6.0 87.0% 82.0% Web service for ab initio gene predictions [Schweikert et al., 2009b] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 25

Results using mgene Most accurate ab initio method in the ngasp genome annotation challenge for C. elegans [Coghlan et al., 2008] Validation of gene predictions for C. elegans: [Schweikert et al., 2009c] No. of genes No. of genes Frac. of genes analyzed w/ expression New genes 2,197 57 42% Missing unconf. genes 205 24 8% Annotation of other nematode genomes: [Schweikert et al., 2009c] Genome Genome No. of No. exons/gene mgene best other size [Mbp] genes (mean) accuracy accuracy C. remanei 235.94 31503 5.7 96.6% 93.8% C. japonica 266.90 20121 5.3 93.3% 88.7% C. brenneri 453.09 41129 5.4 93.1% 87.8% C. briggsae 108.48 22542 6.0 87.0% 82.0% Web service for ab initio gene predictions [Schweikert et al., 2009b] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 25

Results using mgene Most accurate ab initio method in the ngasp genome annotation challenge for C. elegans [Coghlan et al., 2008] Validation of gene predictions for C. elegans: [Schweikert et al., 2009c] No. of genes No. of genes Frac. of genes analyzed w/ expression New genes 2,197 57 42% Missing unconf. genes 205 24 8% Annotation of other nematode genomes: [Schweikert et al., 2009c] Genome Genome No. of No. exons/gene mgene best other size [Mbp] genes (mean) accuracy accuracy C. remanei 235.94 31503 5.7 96.6% 93.8% C. japonica 266.90 20121 5.3 93.3% 88.7% C. brenneri 453.09 41129 5.4 93.1% 87.8% C. briggsae 108.48 22542 6.0 87.0% 82.0% Web service for ab initio gene predictions [Schweikert et al., 2009b] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 25

mgene.web: Gene Finding for Everybody ;-) mgene.org/web [Schweikert et al., 2009b] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 26

Limitations/Extensions Gene finding (by far) still not perfect New ab initio techniques more accurate Many genes mis-predicted (C. elegans: 50%; human: 80%) What is missing? Not enough examples for training? Model complexity? Other information? Transcriptome is the result of many complex processes Current methods also ignore other important information: Chromatin structure and methylation patterns RNA structure and processing regulation... Ab initio methods cannot predict alternative transcripts Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 27

Limitations/Extensions Gene finding (by far) still not perfect New ab initio techniques more accurate Many genes mis-predicted (C. elegans: 50%; human: 80%) What is missing? Not enough examples for training? Model complexity? Other information? Transcriptome is the result of many complex processes Current methods also ignore other important information: Chromatin structure and methylation patterns RNA structure and processing regulation... Ab initio methods cannot predict alternative transcripts Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 27

Limitations/Extensions Gene finding (by far) still not perfect New ab initio techniques more accurate Many genes mis-predicted (C. elegans: 50%; human: 80%) What is missing? Not enough examples for training? Model complexity? Other information? Transcriptome is the result of many complex processes Current methods also ignore other important information: Chromatin structure and methylation patterns RNA structure and processing regulation... Ab initio methods cannot predict alternative transcripts Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 27

Limitations/Extensions Gene finding (by far) still not perfect New ab initio techniques more accurate Many genes mis-predicted (C. elegans: 50%; human: 80%) What is missing? Not enough examples for training? Model complexity? Other information? Transcriptome is the result of many complex processes Current methods also ignore other important information: Chromatin structure and methylation patterns RNA structure and processing regulation... Ab initio methods cannot predict alternative transcripts Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 27

Gene Finding + Auxiliary Measurements Is there other information that may be used by cellular processes that can improve prediction results? Preliminary study: (A. thaliana) transcript level (SN + SP)/2 1 mgene (ab initio)... 73.3% 2... with DNA methylation (1 tissue) 76.1% 3... with Nucleosome position predictions 78.0% 4... with RNA secondary structure predictions 76.7% In progress: Study effect of other information sources for human gene prediction Ideally, consider measurements from the same sample [Behr et al., 2009, in prep.] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 28

Gene Finding + Auxiliary Measurements Is there other information that may be used by cellular processes that can improve prediction results? Preliminary study: (A. thaliana) transcript level (SN + SP)/2 1 mgene (ab initio)... 73.3% 2... with DNA methylation (1 tissue) 76.1% 3... with Nucleosome position predictions 78.0% 4... with RNA secondary structure predictions 76.7% In progress: Study effect of other information sources for human gene prediction Ideally, consider measurements from the same sample [Behr et al., 2009, in prep.] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 28

Gene Finding + Auxiliary Measurements Is there other information that may be used by cellular processes that can improve prediction results? Preliminary study: (A. thaliana) transcript level (SN + SP)/2 1 mgene (ab initio)... 73.3% 2... with DNA methylation (1 tissue) 76.1% 3... with Nucleosome position predictions 78.0% 4... with RNA secondary structure predictions 76.7% In progress: Study effect of other information sources for human gene prediction Ideally, consider measurements from the same sample [Behr et al., 2009, in prep.] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 28

Gene Finding + Auxiliary Measurements Is there other information that may be used by cellular processes that can improve prediction results? Preliminary study: (A. thaliana) transcript level (SN + SP)/2 1 mgene (ab initio)... 73.3% 2... with DNA methylation (1 tissue) 76.1% 3... with Nucleosome position predictions 78.0% 4... with RNA secondary structure predictions 76.7% In progress: Study effect of other information sources for human gene prediction Ideally, consider measurements from the same sample [Behr et al., 2009, in prep.] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 28

Gene Finding + Auxiliary Measurements Is there other information that may be used by cellular processes that can improve prediction results? Preliminary study: (A. thaliana) transcript level (SN + SP)/2 1 mgene (ab initio)... 73.3% 2... with DNA methylation (1 tissue) 76.1% 3... with Nucleosome position predictions 78.0% 4... with RNA secondary structure predictions 76.7% In progress: Study effect of other information sources for human gene prediction Ideally, consider measurements from the same sample [Behr et al., 2009, in prep.] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 28

Alternative Transcripts: First Steps Predictions of alternative splicing Predict novel alternative splicing as independent events Only use sequence information [Rätsch et al., 2005] Requires known gene structures: Use ab initio predictions Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 29

Alternative Transcripts: More Steps Combine gene finding with prediction of alternative splicing Machine learning challenge: [Zien et al., 2006] Input: DNA sequence Output: Splicing graph mgene approach can be extended to Predict simple splicing graphs But, without clean data we cannot train the model... Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 30

Alternative Transcripts: More Steps Combine gene finding with prediction of alternative splicing Machine learning challenge: [Zien et al., 2006] Input: DNA sequence Output: Splicing graph mgene approach can be extended to Predict simple splicing graphs TSS TIS Stop cleave Don Acc Exon skip A-Don A-Acc But, without clean data we cannot train the model... Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 30

Alternative Transcripts: More Steps Combine gene finding with prediction of alternative splicing Machine learning challenge: [Zien et al., 2006] Input: DNA sequence Output: Splicing graph mgene approach can be extended to Predict simple splicing graphs TSS TIS Stop cleave Don Acc Exon skip A-Don A-Acc But, without clean data we cannot train the model... Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 30

Reconstructing the Transcriptome DNA mrna cap polya Protein Directly measure the transcriptome Whole genome tiling arrays Transcriptome sequencing (Sanger or Next Generation Sequencing) Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 31

mrna Deep Sequencing [Wikipedia] Gunnar Rätsch (FML, Tübingen) Large Scale Sequence Analysis October, 23, 2009 32