Manolis Kellis Piotr Indyk

Samankaltaiset tiedostot
CS284A Representations & Algorithms for Molecular Biology. Xiaohui S. Xie University of California, Irvine

Genome 373: Genomic Informatics. Professors Elhanan Borenstein and Jay Shendure

State of the Union... Functional Genomics Research Stream. Molecular Biology. Genomics. Computational Biology

Information on preparing Presentation

B3206 Microbial Genetics

7.4 Variability management

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Capacity Utilization

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Eukaryotic Comparative Genomics

Methods S1. Sequences relevant to the constructed strains, Related to Figures 1-6.

Master's Programme in Life Science Technologies (LifeTech) Prof. Juho Rousu Director of the Life Science Technologies programme 3.1.

Functional Genomics & Proteomics

The CCR Model and Production Correspondence

Constructive Alignment in Specialisation Studies in Industrial Pharmacy in Finland

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Gap-filling methods for CH 4 data

6.095/ Computational Biology: Genomes, Networks, Evolution. Sequence Alignment and Dynamic Programming

Results on the new polydrug use questions in the Finnish TDI data

Statistical design. Tuomas Selander

Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. David R. Kelley

7. Product-line architectures

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

Other approaches to restrict multipliers

Plasmid Name: pmm290. Aliases: none known. Length: bp. Constructed by: Mike Moser/Cristina Swanson. Last updated: 17 August 2009

16. Allocation Models

Bounds on non-surjective cellular automata

Efficiency change over time

FETAL FIBROBLASTS, PASSAGE 10

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

MALE ADULT FIBROBLAST LINE (82-6hTERT)

Windows Phone. Module Descriptions. Opiframe Oy puh Espoo

TU-C2030 Operations Management Project. Introduction lecture November 2nd, 2016 Lotta Lundell, Rinna Toikka, Timo Seppälä

Kysymys 5 Compared to the workload, the number of credits awarded was (1 credits equals 27 working hours): (4)

Viral DNA as a model for coil to globule transition

Alternative DEA Models

tgg agg Supplementary Figure S1.

Chapter 9 Motif finding. Chaochun Wei Spring 2019

AYYE 9/ HOUSING POLICY

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

Bioinformatics. Sequence Analysis: Part III. Pattern Searching and Gene Finding. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

EUROOPAN PARLAMENTTI

Experimental Identification and Computational Characterization of a Novel. Extracellular Metalloproteinase Produced by Clostridium sordellii

Innovative and responsible public procurement Urban Agenda kumppanuusryhmä. public-procurement

BLOCKCHAINS AND ODR: SMART CONTRACTS AS AN ALTERNATIVE TO ENFORCEMENT

Nuku hyvin, pieni susi -????????????,?????????????????. Kaksikielinen satukirja (suomi - venäjä) ( (Finnish Edition)

C++11 seminaari, kevät Johannes Koskinen

Tarua vai totta: sähkön vähittäismarkkina ei toimi? Satu Viljainen Professori, sähkömarkkinat

Bioinformatics in Laboratory of Computer and Information Science

Curriculum. Gym card

Miksi Suomi on Suomi (Finnish Edition)

ELEC-C5230 Digitaalisen signaalinkäsittelyn perusteet

toukokuu 2011: Lukion kokeiden kehittämistyöryhmien suunnittelukokous

Choose Finland-Helsinki Valitse Finland-Helsinki

Supplementary information: Biocatalysis on the surface of Escherichia coli: melanin pigmentation of the cell. exterior

Use of spatial data in the new production environment and in a data warehouse

812336A C++ -kielen perusteet,

RANTALA SARI: Sairaanhoitajan eettisten ohjeiden tunnettavuus ja niiden käyttö hoitotyön tukena sisätautien vuodeosastolla

LX 70. Ominaisuuksien mittaustulokset 1-kerroksinen 2-kerroksinen. Fyysiset ominaisuudet, nimellisarvot. Kalvon ominaisuudet

Information on Finnish Courses Autumn Semester 2017 Jenni Laine & Päivi Paukku Centre for Language and Communication Studies

Collaborative & Co-Creative Design in the Semogen -projects

Paikkatiedon semanttinen mallinnus, integrointi ja julkaiseminen Case Suomalainen ajallinen paikkaontologia SAPO

ProAgria. Opportunities For Success

VAASAN YLIOPISTO Humanististen tieteiden kandidaatin tutkinto / Filosofian maisterin tutkinto

The role of 3dr sector in rural -community based- tourism - potentials, challenges

anna minun kertoa let me tell you

Valuation of Asian Quanto- Basket Options

Vaisala s New Global L ightning Lightning Dataset GLD360

Capacity utilization

Infrastruktuurin asemoituminen kansalliseen ja kansainväliseen kenttään Outi Ala-Honkola Tiedeasiantuntija

NBE-E4510 Special Assignment in Biophysics and Biomedical Engineering AND NBE-E4500 Special Assignment in Human. NBE-E4225 Cognitive Neuroscience

Voice Over LTE (VoLTE) By Miikka Poikselkä;Harri Holma;Jukka Hongisto

Computational personal genomics: selection, regulation, epigenomics, disease

LYTH-CONS CONSISTENCY TRANSMITTER

Hankkeen toiminnot työsuunnitelman laatiminen

Uusia kokeellisia töitä opiskelijoiden tutkimustaitojen kehittämiseen

1. Liikkuvat määreet

TIETEEN PÄIVÄT OULUSSA

BIOENV Laitoskokous Departmental Meeting

Oma sininen meresi (Finnish Edition)

11/17/11. Gene Regulation. Gene Regulation. Gene Regulation. Finding Regulatory Motifs in DNA Sequences. Regulatory Proteins

Särmäystyökalut kuvasto Press brake tools catalogue

Salasanan vaihto uuteen / How to change password

Basic Flute Technique

Telecommunication Software

indexhan wen Club Ambulant -play together

Smart specialisation for regions and international collaboration Smart Pilots Seminar

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

IFAGG WORLD CUP I, CHALLENGE CUP I and GIRLS OPEN INTERNATIONAL COMPETITION 1 st 2 nd April 2011, Vantaa Finland

RINNAKKAINEN OHJELMOINTI A,

Informaatioteknologia vaikuttaa ihmisten käyttäytymiseen ja asenteisiin

Pricing policy: The Finnish experience

EVALUATION FOR THE ERASMUS+-PROJECT, STUDENTSE

Teacher's Professional Role in the Finnish Education System Katriina Maaranen Ph.D. Faculty of Educational Sciences University of Helsinki, Finland

Land-Use Model for the Helsinki Metropolitan Area

Green Growth Sessio - Millaisilla kansainvälistymismalleilla kasvumarkkinoille?

JA CHALLENGE Anna-Mari Sopenlehto Central Administration The City Development Group Business Developement and Competence

Transkriptio:

6.095 / 6.895 Computational Biology: Genomes, Networks, Evolution Rapid database search Courtsey of CCRNP, The National Cancer Institute. Manolis Kellis Piotr Indyk Protein interaction network Courtesy of GTL Center for Molecular and Cellular Systems. Genome duplication Courtesy of Talking Glossary of Genetics.

Administrivia Course information Lecturers: Manolis Kellis and Piotr Indyk Grading: Problem sets 50% Final Project 25% Midterm 20% Part. 5% 5 problem sets: Each problem set: covers 4 lectures, contains 4 problems. Algorithmic problems and programming assignments Graduate version includes 5th problem on current research Exams In-class midterm, no final exam Collaboration policy Collaboration allowed, but you must: Work independently on each problem before discussing it Write solutions on your own Acknowledge sources and collaborators. No outsourcing.

Goals for the term Introduction to computational biology Fundamental problems in computational biology Algorithmic/machine learning techniques for data analysis Research directions for active participation in the field Ability to tackle research Problem set questions: algorithmic rigorous thinking Programming assignments: hands-on experience w/ real datasets Final project: Research initiative to propose an innovative project Ability to carry out project s goals, produce deliverables Write-up goals, approach, and findings in conference format Present your project to your peers in conference setting

Course outline Organization Duality: Computation and Biology Important biological problems Fundamental computational techniques Foundations and Frontiers First half: well-defined problems and general methodologies Second half: in-depth look at complex problems, combine techniques learned, opens to projects, research directions Topics covered First half: the foundations String matching, genome analysis, expression clustering, regulatory motifs, biological networks, evolutionary theory Second half: the frontiers genome assembly, gene finding, RNA folding, micrornas, Bayesian networks, generative models, genome evolution

Books used in the course Image removed due to copyright restrictions. Image removed due to copyright restriction. Durbin, Richard, Graeme Mitchison, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge, UK: Cambridge University Press, 1997. ISBN: 0521629713. Jones, Neil, and PavelPevzner. An Introduction to Bioinformatics Algorithms. Cambridge, MA: MIT Press, 2004. ISBN: 0262101068. D J Price: ~$40 Availability: Quantum Books (J), amazon.com (J&D) MIT libraries: Both books have several copies on reserve

Today s Goals Computational challenges in modern biology The basic questions of molecular biology The computational problems that arise How we can address them algorithmically Overview of topics String matching, genome analysis, expression clustering, regulatory motifs, biological networks, evolutionary theory RNA folding, genome assembly, gene finding, micrornas, Bayesian networks, generative models, genome evolution First problem: regulatory motif discovery Problem introduction, biol. intuition, computational formulation Brute-force searching and algorithmic improvements Machine learning approach and research directions

ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT

ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT Genes Encode proteins Regulatory motifs Control gene expression Figures by MIT OCW.

ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT

ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT Extracting signal from noise CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT GTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT

Challenges in Computational Biology 4 Genome Assembly 5 Regulatory motif discovery 1 Gene Finding DNA 2 Sequence alignment 6 Comparative Genomics 7 Evolutionary Theory TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT 3 Database lookup 8 Gene expression analysis 11 Protein network analysis RNA transcript 9 Cluster discovery 10 Gibbs sampling 12 Regulatory network inference 13 Emerging network properties Image removed due to copyright restrictions.

Molecular Biology Primer 20 mins

DNA: The double helix The most noble molecule of our time In fact, the two DNA strands are twisted around each other to make a double helix. Traditional Fancy Chemical Atomic Figure by MIT OCW.

DNA: the molecule of heredity Self-complementarity sets molecular basis of heredity Knowing one strand, creates a template for the other It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material. Watson & Crick, 1953 Deoxyribose sugar molecule C Nitrogenous bases A G T Phosphate molecule DNA REPLICATING ITSELF TA AT TA CG TA TA GC { T G A Weak bonds between bases C { GC GC AT TA C C A T T G OLD NEW TA A C G G T A OLD NEW AT TA CG G G Sugar-phosphate backbone GC TATA GC TA GC AT Figures by MIT OCW.

DNA: chemical details 5 4 3 1 2 5 A 4 3 1 2 G 5 T 2 1 C A 3 4 5 2 3 1 T 4 5 Bases hidden on the inside Phosphate backbone outside 2 1 3 4 5 Weak hydrogen bonds hold the two strands together This allows low-energy opening and re-closing of two strands Anti-parallel strands Extension 5 3 triphosphate coming from newly added nucleotide 4 3 1 2 5 G C 2 1 3 4 5 The only parings are: A with T C with G 4 3 1 2

DNA: deoxyribose sugar 5 5 4 1 4 1 3 2 3 2

DNA: the four bases Purine Purine Pyrimidine Pyrimidine Weak Weak Strong Strong Amino Amino Keto Keto

DNA: base pairs 5 5 5 5 4 3 1 2 4 3 1 2 4 3 1 2 4 3 1 2 5 2 1 3 4 5 5 2 1 3 4 5 4 3 1 2 4 3 1 2

A T DNA: sequences 5 A G A T C T 3 C 3 G C 5 G A T 5 3 A G A G TCTC 3 5 G C AGAG or CTCT

DNA packaging Why packaging DNA is very long Cell is very small Compression Chromosome is 50,000 times shorter than extended DNA Using the DNA Before a piece of DNA is used for anything, this compact structure must open locally Image removed due to copyright restrictions. Please see: Figure 8-10 from Alberts, Bruce, and Martin Raff. Essential Cell Biology. New York, NY: Garland Publishing Inc., 1997. ISBN: 0815320450.

Chromosomes inside the cell Prokaryote DNA Eukaryote DNA organized in a single chromosome. No nucleus. No mitosis. Nucleus DNA organized in multiple chromosomes inside a nucleus. Mitotic division. Figure by MIT OCW.

Central dogma of Molecular Biology DNA makes RNA makes Protein

Genes control the making of cell parts The gene is a fundamental unit of inheritance DNA molecule contains tens of thousands of genes Each gene governs the making of one functional element, one part of the cell machine Every time a part must be made, a piece of the genome is copied, transported, and used as a blueprint RNA is a temporary copy The medium for transporting genetic information from the DNA information repository to the protein-making machinery is an RNA molecule The more parts are needed, the more copies are made Each mrna only lasts a limited time before degradation

RNA: The messenger DNA RNA Transcription Information changes medium single strand vs. double strand ribose vs. deoxyribose sugar A T T A C G G T A C C G T U A A U G C C A U G G C A Compatible base-pairing in hybrid Translation Protein Figure by MIT OCW.

From DNA to RNA: Transcription Image removed due to copyright restrictions. Please see: Figure 7-9 from Alberts, Bruce, and Martin Raff. Essential Cell Biology. New York, NY: Garland Publishing Inc., 1997. ISBN: 0815320450.

From pre-mrna to mrna: Splicing In Eukaryotes, not every part of a gene is coding Functional exons interrupted by non-translated introns During pre-mrna maturation, introns are spliced out In humans, primary transcript can be 10 6 bp long Image removed due to copyright restrictions. Please see: Figure 7-16 from Alberts, Bruce, and Martin Raff. Essential Cell Biology. New York, NY: Garland Publishing Inc., 1997. ISBN: 0815320450. Alternative splicing can yield different exon subsets for the same gene, and hence different protein products

RNA can be functional Single Strand allows complex structure Self-complementary regions form helical stems Three-dimensional structure allows functionality of RNA Four types of RNA mrna: messenger of genetic information trna: codon-to-amino acid specificity rrna: core of the ribosome snrna: splicing reactions To be continued We ll learn more in a dedicated lecture on RNA world Once upon a time, before DNA and protein, RNA did all

RNA structure: 2 nd ary and 3 rd ary Courtesy of Wikimedia Commons. Courtesy of SStructView.

Splicing machinery made of RNA Image removed due to copyright restrictions. Please see: Figure 7-16 from Alberts, Bruce, and Martin Raff. Essential Cell Biology. New York, NY: Garland Publishing Inc., 1997. ISBN: 0815320450.

Central dogma of Molecular Biology DNA makes RNA makes Protein

Proteins carry out the cell s chemistry DNA RNA Protein Transcription Translation More complex polymer Nucleic Acids have 4 building blocks Proteins have 20. Greater versatility Each amino acid has specific properties Sequence Structure Function The amino acid sequence determines the three-dimensional fold of protein The protein s function largely depends on the features of the 3D structure Proteins play diverse roles Catalysis, binding, cell structure, signaling, transport, metabolism Figure by MIT OCW.

Protein structure DNA Sugar phosphate backbone 2 1 3 Figure by MIT OCW. Base pair Figure by MIT OCW. Helix-turn-helix Common motif for DNAbinding proteins that often play a regulatory role as mrna level transcription factors Beta-barrel Figure by MIT OCW. Some antiparallel b-sheet domains are better described as b-barrels rather than b-sandwiches, for example streptavadin and porin. Note that some structures are intermediate between the extreme barrel and sandwich arrangements. Alpha-beta horseshoe this placental ribonuclease inhibitor is a cytosolic protein that binds extremely strongly to any ribonuclease that may leak into the cytosol. 17-stranded parallel b sheet curved into an open horseshoe shape, with 16 a-helices packed against the outer surface. It doesn't form a barrel although it looks as though it should. The strands are only very slightly slanted, being nearly parallel to the central `axis'.

Protein building blocks Amino Acids

Ribosome From RNA to protein: Translation trna Figure by MIT OCW.

The Genetic Code

The genetic code Degeneracy of the genetic code To encode 20 amino acids, two nucleotides are not enough (4 2 =16). Three nucleotides are too many (4 3 =64) The genetic code is degenerate. Same amino acid can be represented by more than one codon. Room for innovation Moreover, amino acids with similar properties can be substituted for each other without changing the structure of the protein Six possible translation frames for every nucleotide stretch GCU.UGU.UUA.CGA.AUU.A Ala Cys Leu Arg Ile - G.CUU.GUU.UAC.GAA.UUA - Leu Val Tyr Glu - Leu Stop codon every 3/64. Long ORFs are unlikely, probably genes In some viruses as many as four overlapping frames are functional

Summary: The Central Dogma DNA makes RNA makes Protein DNA Inheritance Replication Transcription RNA Messages Translation Reactions Protein Figure by MIT OCW.

Why Computational Biology?

Why Computational Biology: Student answers Lots of data (* lots of data) Pattern finding It s all about data Ability to visualize Simulations Guess + verify (generate hypotheses for testing) Propose mechanisms / theory to explain observations Networks / combinations of variables Efficiency (reduce experimental space to cover) Informatics infrastructure (ability to combine datasets) Correlations Life itself is digital. Understand cellular instruction set

Challenges in Computational Biology 4 Genome Assembly 5 Regulatory motif discovery 1 Gene Finding DNA 2 Sequence alignment 6 Comparative Genomics 7 Evolutionary Theory TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT 3 Database lookup 8 Gene expression analysis 11 Protein network analysis RNA transcript 9 Cluster discovery 10 Gibbs sampling 12 Regulatory network inference 13 Emerging network properties Image removed due to copyright restrictions.

Lecture Date Topic Reading Lecture 1 Thursday Sept 8 Algorithms + Machine Learning + Biology J3.1-3.7;J2.8-2.10;D11 Lecture 2 Tuesday Sept 13 Evolutionary models + seq alignment + Dynamic programming D2.1-2.3;J6.4-6.9 Lecture 3 Thursday Sept 15 Local/Global alignments + Variations on dynamic programming Lecture 4 Tuesday Sept 20 Linear time string searching + Suffix trees + String preprocessing J9.1-9.8 Lecture 5 Thursday Sept 22 Database search + Hashing + random projections Lecture 6 Tuesday Sept 27 Biological signals + HMMs D3.1-3.2 Lecture 7 Thursday Sept 29 CpG islands / simple ORFs + Learning with HMMs D3.3 Lecture 8 Tuesday Oct 4 Expression analysis + clustering J10.1-10.3 Course Outline Lecture 9 No lecture Lecture 10 Lecture 11 Lecture 12 Lecture 13 Lecture 14 Lecture 15 Lecture 16 Lecture 17 Thursday Tuesday Thursday Tuesday Thursday Tuesday Thursday Tuesday Thursday Tuesday Oct Oct Oct Oct Oct Oct Oct Nov Nov Nov 6 11 13 18 20 25 27 1 3 8 Multi-dimensional clustering + feature selection Holiday - Happy Columbus day Regulatory Motifs + Gibbs Sampling + Expectation Maximization Biological networks + graph algorithms + network dynamics Phylogenetic trees + Greedy algorithms + parsimony + EM Multiple alignment + profile alignment + iterative alignment Midterm RNA folding + context-free grammars + Phylo-CFGs Combine alignment and feature finding + Pair HMMs Gene Finding + Generalized HMMs J4.4-4.9;J5.5;J12.2 J8.1-8.2;P1 D7.1-7.5 J6.10 D9 D4 Burge Lecture 18 Thursday Nov 10 Comparative gene finding + Phylogenetic HMMs Siepel Lecture 19 Tuesday Nov 15 microrna regulation + target prediction Bartel Lecture 20 Thursday Nov 17 Regulatory relationships + Bayesian networks Hartemink Lecture 21 Tuesday Nov 22 Generative models of regulation + Bayesian graphs Segal No lecture Thursday Nov 24 Holiday - Happy Thanksgiving break Lecture 22 Tuesday Nov 29 Genome assembly + Euler graphs J8.4 Lecture 23 Thursday Dec 1 Genome rearrangements + Genome duplication J5.1-5.4 Lecture 24 Tuesday Dec 6 Genome-scale comparative genomics Kellis Lecture 25 Thursday Dec 8 Final presentations

Today: Regulatory Motif Discovery Gene regulation: The process by which genes are turned on or off, in response to environmental stimuli Regulatory motifs: sequences that control gene usage; short sequence patterns, ~6-12 letters long, possibly degenerate DNA Replication Transcription RNA Translation Protein Figure by MIT OCW.

Regulatory motif discovery Gal4 Gal4 Mig1 GAL1 CGG CCG CGG CCG CCCCW ATGACTAAATCTCATTCAGAAGAAGTGA Regulatory motifs (summary) Genes are turned on / off in response to changing environments No direct addressing: subroutines (genes) contain sequence tags (motifs) Specialized proteins (transcription factors) recognize these tags What makes motif discovery hard? Motifs are short (6-8 bp), sometimes degenerate Can contain any set of nucleotides (no ATG or other rules) Act at variable distances upstream (or downstream) of target gene How can we discover them?

Motifs are preferentially conserved across evolution Gal10 Gal4 Gal1 GAL10 Scer TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA Spar CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA Smik GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA Sbay TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA * * **** * * * ** ** * * ** ** ** * * * ** ** * * * ** * * * Scer TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACG Spar TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACG Smik TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG Sbay TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG ** ** *** **** ******* ** * * * * * * * ** ** * *** * *** * * * Scer CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCT Spar CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT Smik TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCT Sbay TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT ** ** ** ***** ******* ****** ***** *** **** * *** ***** * * ****** *** * *** Scer TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA Spar TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGAC Smik ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCC Sbay GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT ** * ** *** * * ***** ** * * ****** ** * * ** * * ** *** Scer GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T Spar AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--G Smik CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--G Sbay GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG **** * * ***** *** * * * * * * * * ** MIG1 TBP Scer TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT Spar GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TT Smik TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----AT Sbay TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA * * * *** * ** * * *** *** * * ** ** * ******** **** * Scer TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT Spar TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACT Smik TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACA Sbay TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT * * * * * * ** *** * * * * ** ** ** * * * * * *** * Scer TTAA-CGTCAAGGA---GAAAAAACTATA Spar TTAT-CGTCAAGGAAA-GAACAAACTATA Smik TCGTTCATCAAGAA----AAAAAACTA.. Sbay TTATCCCAAAAAAACAACAACAACATATA * * ** * ** ** ** MIG1 GAL4 GAL4 GAL4 GAL4 GAL1 Increase power by testing conservation Factor in many footprint regions TBP Conservation island

Framing the problem computationally How do we find all instances of a motif in a genome? Naïve algorithm: Search every position How do we count all instances of every 6-mer in a genome Naïve algorithm: Scan the genome for each motif Improvement: Scan genome once, filling a table How do we count all instances of every 50- mer in a genome Table is no longer feasible, most entries empty Use a hash table How do we search a new motif in a known genome Pre-processing of the database

Computational approaches for motif discovery Method #1: Enumerate all motifs Combinatorial search Method #2: Randomly sample the genome Statistical approach Method #3: Enumerate motif seeds + refinement Hill-climbing Method #4: Content-based addressing Hashing

Recitation tomorrow! Introduction to algorithms / running time Searching for motifs Basic table indexing Hints to hashing Introduction to molecular biology Central dogma, splicing, genomes Introduction to probability Searching motifs with ambiguity codes Modeling background distribution Likelihood ratios and hypothesis testing