Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

Samankaltaiset tiedostot
Chapter 9 Motif finding. Chaochun Wei Spring 2019

Capacity Utilization

Other approaches to restrict multipliers

Gap-filling methods for CH 4 data

Information on preparing Presentation

Efficiency change over time

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Statistical design. Tuomas Selander

FETAL FIBROBLASTS, PASSAGE 10

LYTH-CONS CONSISTENCY TRANSMITTER

The CCR Model and Production Correspondence

Alternative DEA Models

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

16. Allocation Models

Results on the new polydrug use questions in the Finnish TDI data

Bounds on non-surjective cellular automata

MALE ADULT FIBROBLAST LINE (82-6hTERT)

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

Capacity utilization

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

HARJOITUS- PAKETTI A

11/17/11. Gene Regulation. Gene Regulation. Gene Regulation. Finding Regulatory Motifs in DNA Sequences. Regulatory Proteins

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

T Statistical Natural Language Processing Answers 6 Collocations Version 1.0

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

Searching (Sub-)Strings. Ulf Leser

Functional Genomics & Proteomics

Valuation of Asian Quanto- Basket Options

anna minun kertoa let me tell you

Land-Use Model for the Helsinki Metropolitan Area

Kaivostoiminnan eri vaiheiden kumulatiivisten vaikutusten huomioimisen kehittäminen suomalaisessa luonnonsuojelulainsäädännössä

Toimintamallit happamuuden ennakoimiseksi ja riskien hallitsemiseksi turvetuotantoalueilla (Sulfa II)

Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. David R. Kelley

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

7.4 Variability management

TM ETRS-TM35FIN-ETRS89 WTG

6.095/ Computational Biology: Genomes, Networks, Evolution. Sequence Alignment and Dynamic Programming

Miksi Suomi on Suomi (Finnish Edition)

Lab SBS3.FARM_Hyper-V - Navigating a SharePoint site

Genome 373: Genomic Informatics. Professors Elhanan Borenstein and Jay Shendure

Returns to Scale Chapters

Characterization of clay using x-ray and neutron scattering at the University of Helsinki and ILL

Hankkeen toiminnot työsuunnitelman laatiminen

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

Metsälamminkankaan tuulivoimapuiston osayleiskaava

7. Product-line architectures

Information on Finnish Language Courses Spring Semester 2017 Jenni Laine

Operatioanalyysi 2011, Harjoitus 4, viikko 40

Curriculum. Gym card

Voice Over LTE (VoLTE) By Miikka Poikselkä;Harri Holma;Jukka Hongisto

S Sähkön jakelu ja markkinat S Electricity Distribution and Markets

LX 70. Ominaisuuksien mittaustulokset 1-kerroksinen 2-kerroksinen. Fyysiset ominaisuudet, nimellisarvot. Kalvon ominaisuudet

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

Information on Finnish Courses Autumn Semester 2017 Jenni Laine & Päivi Paukku Centre for Language and Communication Studies

VAASAN YLIOPISTO Humanististen tieteiden kandidaatin tutkinto / Filosofian maisterin tutkinto

Use of spatial data in the new production environment and in a data warehouse

Miehittämätön meriliikenne

C++11 seminaari, kevät Johannes Koskinen

Nuku hyvin, pieni susi -????????????,?????????????????. Kaksikielinen satukirja (suomi - venäjä) ( (Finnish Edition)

Infrastruktuurin asemoituminen kansalliseen ja kansainväliseen kenttään Outi Ala-Honkola Tiedeasiantuntija

Tynnyrivaara, OX2 Tuulivoimahanke. ( Layout 9 x N131 x HH145. Rakennukset Asuinrakennus Lomarakennus 9 x N131 x HH145 Varjostus 1 h/a 8 h/a 20 h/a

toukokuu 2011: Lukion kokeiden kehittämistyöryhmien suunnittelukokous

Innovative and responsible public procurement Urban Agenda kumppanuusryhmä. public-procurement

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

Eukaryotic Comparative Genomics

Transport climate policy choices in the Helsinki Metropolitan Area 2025

PYÖRÄILY OSANA HELSINGIN SEUDUN KESTÄVÄÄ KAUPUNKILIIKENNETTÄ

The role of 3dr sector in rural -community based- tourism - potentials, challenges

Predicting evolutionarily conserved regions (ECRs) in the Xenopus tropicalis genome using a MultiPipMaker-based bioinformatic strategy

( ( OX2 Perkkiö. Rakennuskanta. Varjostus. 9 x N131 x HH145

Methods S1. Sequences relevant to the constructed strains, Related to Figures 1-6.

TIEKE Verkottaja Service Tools for electronic data interchange utilizers. Heikki Laaksamo

TM ETRS-TM35FIN-ETRS89 WTG

OP1. PreDP StudyPlan

Operatioanalyysi 2011, Harjoitus 2, viikko 38

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

WindPRO version joulu 2012 Printed/Page :47 / 1. SHADOW - Main Result

Väite Argument "Yhteiskunnan velvollisuus on tarjota virkistysalueita ja -palveluita." "Recreation sites and service

Constructive Alignment in Specialisation Studies in Industrial Pharmacy in Finland

Bioinformatics. Sequence Analysis: Part III. Pattern Searching and Gene Finding. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Choose Finland-Helsinki Valitse Finland-Helsinki

WindPRO version joulu 2012 Printed/Page :42 / 1. SHADOW - Main Result

812336A C++ -kielen perusteet,

TM ETRS-TM35FIN-ETRS89 WTG

Skene. Games Refueled. Muokkaa perustyyl. for Health, Kuopio

Plasmid Name: pmm290. Aliases: none known. Length: bp. Constructed by: Mike Moser/Cristina Swanson. Last updated: 17 August 2009

Graph. COMPUTE x=rv.normal(0,0.04). COMPUTE y=rv.normal(0,0.04). execute.

KONEISTUSKOKOONPANON TEKEMINEN NX10-YMPÄRISTÖSSÄ

TM ETRS-TM35FIN-ETRS89 WTG

Vaihtoon lähdön motiivit ja esteet Pohjoismaissa. Siru Korkala

Sisällysluettelo Table of contents

KMTK lentoestetyöpaja - Osa 2

Tietorakenteet ja algoritmit

( ,5 1 1,5 2 km

tgg agg Supplementary Figure S1.

How to handle uncertainty in future projections?

EUROOPAN PARLAMENTTI

BDD (behavior-driven development) suunnittelumenetelmän käyttö open source projektissa, case: SpecFlow/.NET.

Transkriptio:

Course organization Introduction ( Week 1) Part I: Algorithms for Sequence Analysis (Week 1-11) Chapter 1-3, Models and theories» Probability theory and Statistics (Week 2)» Algorithm complexity analysis (Week 3)» Classic algorithms (Week 4)» Lab: Linux and Perl Chapter 4, Sequence alignment (week 6) Chapter 5, Hidden Markov Models ( week 8) Chapter 6. Multiple sequence alignment (week 10) Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11) Part II: Algorithms for Network Biology (Week 12-16) 1

Chapter 7 Motif Finding Chaochun Wei Fall 2013 2

Contents 1. Reading materials 2. Motif finding Motif WMM Motif finding methods 3

Reading materials Tompa et al (2005), Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, 23(2):137-144 4

Sequence Motifs Motif: subsequence with some specific function May be in DNA, RNA, protein Function may be context dependent Ribosome binding site must be transcribed RNA, protein motifs may depend on structure May be gapped or ungapped Use model to search for (predict) new sites Models may be simple sequences (regular expressions) or probabilistic patterns Modeling approach depends on data available Quantitative/qualitative

Motifs: DNA binding sites Gene regulation often controlled by proteins (transcription factors) that bind to specific DNA sequences to activate or repress transcription Binding sites are usually short (6-30bp) and typically ungapped, probabilistic patterns Would like to have a quanitative model that predicts binding affinity and/or occupancy

Weight Matrix Model A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11

Weight Matrix Model -24.A C T A T A A T G T A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11

Weight Matrix Model 43.A C T A T A A T G T A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11

How to get optimal models Quantitative binding data Can do regression to get best fit Can test different models Must take into account that a more complex model can always give better fit, but is it useful? Qualitative binding data set of sites Log-odds methods Can use minimum volume method (QP)

N(b,i) F(b,i) S(b,i) = log[f(b,i)/p(b)] I(i) = F(b,i)S(b,i) b

Likelihood Ratio Statistics Primer Given two probability distributions P i and Q I P i = Q i = 1 And some data, D i, which is number of times each type i is observed in N total observations The Likelihood Ratio of the data being from distribution Q i versus P i is: LR = (Q i /P i ) Di And the log-likelihood Ratio is LLR = D i ln (Q i /P i )

LLR = D i ln (Q i /P i ) Maximum likelihood distribution is Q i = D i /N So max LLR = N Q i ln (Q i /P i ) Q i ln (Q i /P i ) 0 Information Content Relative Entropy Kullbach-Liebler Distance Related to G-statistic and χ 2

A 0 3 79 40 66 48 65 11 65 0 C 94 75 4 3 1 2 5 2 3 3 G 1 0 3 4 1 0 5 3 28 88 T 2 19 11 50 29 47 22 81 1 6

Motif Discovery Find significant motifs in long sequences Types of data Co-regulated promoters Segments bound to specific proteins Phylogenetically conserved segments Algorithm classes search space Pattern searches consensus motifs Alignment searches PWM motifs Objective Functions

Example (small) dataset CE1CG \TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG\ ECOARABOP \GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG\ ECOBGLR1 \ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT\ ECOCRP \CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC\ ECOCYA \ACGGTGCTACACTTGTATGTAGCGCATCTTTCTTTACGGTCAATCAGCAAGGTGTTAAATTGATCACGTTTTAGACCATTTTTTCGTCGTGAAACTAAAAAAACC\ ECODEOP2 \AGTGAATTATTTGAACCAGATCGCATTACAGTGATGCAAACTTGTAAGTAGATTTCCTTAATTGTGATGTGTATCGAAGTGTGTTGCGGAGTAGATGTTAGAATA\ ECOGALE \GCGCATAAAAAACGGCTAAATTCTTGTGTAAACGATTCCACTAATTTATTCCATGTCACACTTTTCGCATCTTTGTTATGCTATGGTTATTTCATACCATAAGCC\ ECOILVBPR \GCTCCGGCGGGGTTTTTTGTTATCTGCAATTCAGTACAAAACGTGATCAACCCCTCAATTTTCCCTTTGCTGAAAAATTTTCCATTGTCTCCCCTGTAAAGCTGT\ ECOLAC \AACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCAC\ ECOMALBA \ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGGCGTAGGGGCAAGGAGGATGGAAAGAGGTTGCCGTATAAAGAAACTAGAGTCCGTTTA\ ECOMALBA \GGAGGAGGCGGGAGGATGAGAACACGGCTTCTGTGAACTAAACCGAGGTCATGTAAGGAATTTCGTGATGTTGCTTGCAAAAATCGTGGCGATTTTATGTGCGCA\ ECOMALT \GATCAGCGTCGTTTTAGGTGAGTTGTTAATAAAGATTTGGAATTGTGACACAGTGCAAATTCAGACACATAAAAAAACGTCATCGCTTGCATTAGAAAGGTTTCT\ ECOOMPA \GCTGACAAAAAAGATTAAACATACCTTATACAAGACTTTTTTTTCATATGCCTGACGGAGTTCACACTTGTAAGTTTTCAACTACGTTGTAGACTTTACATCGCC\ ECOTNAA \TTTTTTAAACATTAAAATTCTTACGTAATTTATAATCTTTAAAAAAAGCATTTAATATTGCTCCCCGAACGATTGTGATTCGATTCACATTTAAACAATTTCAGA\ ECOUXU1 \CCCATGAGAGTGAAATTGTTGTGATGTGGTTAACCCAATTAGAATTCGGGATTGACATGTCTTACCAAAAGGTAGAACTTATACGCCATCTCATCCGATGCAAGC\ PBR322 \CTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGCACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGCGCTC\ TRN9CAT \CTGTGACGGAAGATCACTTCGCAGAATAAATAAATCCTGGTGTCCCTGTTGATACCGGGAAGCCCTGGGCCAACTTTTGGCGAAAATGAGACGTTGATCGGCACG\ TDC \GATTTTTATACTTTAACTTGTTGATATTTAAAGGTATTTAATTGTAATAACGATACTCTGGAAAGTATTGAAAGTTAATTTGTGAGTGGTCGCACATATCCTGTT\

Example Output

Example Output

Types of Motifs 1. Motif: Consensus Sequence pattern May include degenerate bases and allow for mismatches Search space is over possible patterns 2. Weight Matrix (PWM, Profile, PSSM) Might go to higher order models Search space is over possible alignments

Pattern based algorithms Motif length l, mismatches m; N seqs, L long 4^l patterns, search for most common (or most significant) allowing up to m mismatches P-value from background distribution Can allow for m mismatches Can allow degenerate positions Can just search using existing l-mers Can use suffix tree for efficient search of patterns allowing mismatches

WEEDER program: good performance on benchmark tests. Suffix Trees All sequneces included in O(NL) time and space Each node can be labeled with N-bit string to indicate which sequences contain specific l-mer Can allow m mismatches: At each string keep track of number of mismatches to pattern being considered and continue down branches until that is exceeded. Can speed up by requiring the mismatches to be spaced; i.e. not all at beginning of string Each pattern can be searched for in linear time, can find all patterns that match criteria of occurring in at least k/n seqs Significance of each matching Pattern can be found from an Expectation based on background

Alignment (Profile) based methods Greedy algorithm (Consensus) Expectation Maximization (MEME) Gibbs Sampler Regression (MatrixReduce) Can use phylogenetic conservation

Greedy Algorithm (Consensus) Simple version: assume every sequence contains at least one true binding site Using each l-mer find best match to generate 2-seq alignments Using top K PWMs to search remaining sequences to include a new sequence Repeat until all seqs contribute Or objective function is maximized (IC, p-value)

Expectation Maximization (MEME) Initial PWM (at random or from average over all potential sites) Using current PWM determine probability of all positions being sites Re-estimate PWM based on those probabilities Continue until convergence always convergences Objective is LLR

Gibbs Sampling Similar to EM, but some important differences At each iteration pick one site on each seq, chosen by its probability, to update PWM Not guaranteed to converge, but tends to increase objective (IC) and plauteau Can escape local optima Other MCMC algorithms Metropolis Simulated annealing

Acknowledgement Most of the slides in this chapter were provided by Prof. Gary Stormo. 27