Chapter 9 Motif finding. Chaochun Wei Spring 2019

Samankaltaiset tiedostot
Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

Capacity Utilization

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

LYTH-CONS CONSISTENCY TRANSMITTER

Bounds on non-surjective cellular automata

The CCR Model and Production Correspondence

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Alternative DEA Models

16. Allocation Models

Information on preparing Presentation

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

Efficiency change over time

Results on the new polydrug use questions in the Finnish TDI data

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

Statistical design. Tuomas Selander

Other approaches to restrict multipliers

T Statistical Natural Language Processing Answers 6 Collocations Version 1.0

Gap-filling methods for CH 4 data

Valuation of Asian Quanto- Basket Options

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

FETAL FIBROBLASTS, PASSAGE 10

Miksi Suomi on Suomi (Finnish Edition)

Characterization of clay using x-ray and neutron scattering at the University of Helsinki and ILL

Capacity utilization

Operatioanalyysi 2011, Harjoitus 4, viikko 40

Methods S1. Sequences relevant to the constructed strains, Related to Figures 1-6.

Choose Finland-Helsinki Valitse Finland-Helsinki

Nuku hyvin, pieni susi -????????????,?????????????????. Kaksikielinen satukirja (suomi - venäjä) ( (Finnish Edition)

MALE ADULT FIBROBLAST LINE (82-6hTERT)

Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. David R. Kelley

anna minun kertoa let me tell you

7.4 Variability management

11/17/11. Gene Regulation. Gene Regulation. Gene Regulation. Finding Regulatory Motifs in DNA Sequences. Regulatory Proteins

HARJOITUS- PAKETTI A

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

Voice Over LTE (VoLTE) By Miikka Poikselkä;Harri Holma;Jukka Hongisto

1. Liikkuvat määreet

The Viking Battle - Part Version: Finnish

812336A C++ -kielen perusteet,

Alueellinen yhteistoiminta

How to handle uncertainty in future projections?

Hankkeen toiminnot työsuunnitelman laatiminen

Returns to Scale Chapters

Land-Use Model for the Helsinki Metropolitan Area

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

Tietorakenteet ja algoritmit

Eukaryotic Comparative Genomics

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

Green Growth Sessio - Millaisilla kansainvälistymismalleilla kasvumarkkinoille?

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

C++11 seminaari, kevät Johannes Koskinen

Uusi Ajatus Löytyy Luonnosta 3 (Finnish Edition)

7. Product-line architectures

Graph. COMPUTE x=rv.normal(0,0.04). COMPUTE y=rv.normal(0,0.04). execute.

Oma sininen meresi (Finnish Edition)

Information on Finnish Language Courses Spring Semester 2017 Jenni Laine

Functional Genomics & Proteomics

Lab SBS3.FARM_Hyper-V - Navigating a SharePoint site

TM ETRS-TM35FIN-ETRS89 WTG

LX 70. Ominaisuuksien mittaustulokset 1-kerroksinen 2-kerroksinen. Fyysiset ominaisuudet, nimellisarvot. Kalvon ominaisuudet

The role of 3dr sector in rural -community based- tourism - potentials, challenges

Constructive Alignment in Specialisation Studies in Industrial Pharmacy in Finland

Kvanttilaskenta - 1. tehtävät

Kaivostoiminnan eri vaiheiden kumulatiivisten vaikutusten huomioimisen kehittäminen suomalaisessa luonnonsuojelulainsäädännössä

Metsälamminkankaan tuulivoimapuiston osayleiskaava

Tarua vai totta: sähkön vähittäismarkkina ei toimi? Satu Viljainen Professori, sähkömarkkinat

Operatioanalyysi 2011, Harjoitus 2, viikko 38

Miehittämätön meriliikenne

I. Principles of Pointer Year Analysis

toukokuu 2011: Lukion kokeiden kehittämistyöryhmien suunnittelukokous

Use of spatial data in the new production environment and in a data warehouse

Curriculum. Gym card

Karkaavatko ylläpitokustannukset miten kustannukset ja tuotot johdetaan hallitusti?

KONEISTUSKOKOONPANON TEKEMINEN NX10-YMPÄRISTÖSSÄ

BLOCKCHAINS AND ODR: SMART CONTRACTS AS AN ALTERNATIVE TO ENFORCEMENT

ELEMET- MOCASTRO. Effect of grain size on A 3 temperatures in C-Mn and low alloyed steels - Gleeble tests and predictions. Period

MEETING PEOPLE COMMUNICATIVE QUESTIONS

Searching (Sub-)Strings. Ulf Leser

KMTK lentoestetyöpaja - Osa 2

Viral DNA as a model for coil to globule transition

Travel Getting Around

Plasmid Name: pmm290. Aliases: none known. Length: bp. Constructed by: Mike Moser/Cristina Swanson. Last updated: 17 August 2009

FinFamily Installation and importing data ( ) FinFamily Asennus / Installation

Tynnyrivaara, OX2 Tuulivoimahanke. ( Layout 9 x N131 x HH145. Rakennukset Asuinrakennus Lomarakennus 9 x N131 x HH145 Varjostus 1 h/a 8 h/a 20 h/a

Information on Finnish Courses Autumn Semester 2017 Jenni Laine & Päivi Paukku Centre for Language and Communication Studies

HOITAJAN ROOLI TEKNOLOGIAVÄLITTEISESSÄ POTILASOHJAUKSESSA VÄITÖSKIRJATUTKIJA JENNI HUHTASALO

ReFuel 70 % Emission Reduction Using Renewable High Cetane Number Paraffinic Diesel Fuel. Kalle Lehto, Aalto-yliopisto 5.5.

Tork Paperipyyhe. etu. tuotteen ominaisuudet. kuvaus. Väri: Valkoinen Malli: Vetopyyhe

Pojan Sydan: Loytoretki Isan Rakkauteen (Finnish Edition)

Toimintamallit happamuuden ennakoimiseksi ja riskien hallitsemiseksi turvetuotantoalueilla (Sulfa II)

Toppila/Kivistö Vastaa kaikkin neljään tehtävään, jotka kukin arvostellaan asteikolla 0-6 pistettä.

TM ETRS-TM35FIN-ETRS89 WTG

Mat Seminar on Optimization. Data Envelopment Analysis. Economies of Scope S ysteemianalyysin. Laboratorio. Teknillinen korkeakoulu

TM ETRS-TM35FIN-ETRS89 WTG

( ( OX2 Perkkiö. Rakennuskanta. Varjostus. 9 x N131 x HH145

Experimental Identification and Computational Characterization of a Novel. Extracellular Metalloproteinase Produced by Clostridium sordellii

Guidebook for Multicultural TUT Users

Integration of Finnish web services in WebLicht Presentation in Freudenstadt by Jussi Piitulainen

Transkriptio:

1896 1920 1987 2006 Chapter 9 Motif finding Chaochun Wei Spring 2019

Contents 1. Reading materials 2. Sequence structure modeling Motif finding Regulatory module finding 2

Reading materials Tompa et al (2005), Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, 23(2):137-144 3

Sequence Motifs Motif: subsequence with some specific function May be in DNA, RNA, protein Function may be context dependent Ribosome binding site must be transcribed RNA, protein motifs may depend on structure May be gapped or ungapped Use model to search for (predict) new sites Models may be simple sequences (regular expressions) or probabilistic patterns Modeling approach depends on data available Quantitative/qualitative

Motifs: DNA binding sites Gene regulation often controlled by proteins (transcription factors) that bind to specific DNA sequences to activate or repress transcription Binding sites are usually short (6-30bp) and typically ungapped, probabilistic patterns Would like to have a quanitative model that predicts binding affinity and/or occupancy

Binding energy to all possible 4-long sites, and a simple additive model that gives the same energy predictions

Weight Matrix Model A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11

Weight Matrix Model -24.A C T A T A A T G T A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11

Weight Matrix Model 43.A C T A T A A T G T A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11

Non-additivity in binding Non-specific binding sets lower limit on affinity Typically much higher than if all bases considered Non-additive contributions to specific binding Binding energy of a base may depend on context (neighboring bases) Can use more complex models

If simple additive model is inadequate, can use, di-nucleotide or higher-order models. Some form of a matrix model must be correct because binding the binding data itself is a matrix (vector). 1 2 3 AA 0.83 0.42 1.38 AC 0-0.42 1.38 AG -1.21 0.42 1.38 AT -0.13 0-0.79 TT 1.25 0-0.79 AAA AAC AAG AAT TTT 1 1.25 0.41 1.25 0.83 1.25 2 1.38 1.38 1.38-0.79-0.79

How to get optimal models Quantitative binding data Can do regression to get best fit Can test different models Must take into account that a more complex model can always give better fit, but is it useful? Qualitative binding data set of sites Log-odds methods, assume Boltzmann Can use minimum volume method (QP)

Key point: There are only 3L+1 independent parameters

Given some quantitative activity data, you can obtain the optimal solution using standard, multiple linear regression methods. Key point: If you assume the activity is related to affinity and model parameters are related to energy, need to regress vs log(affinity)...

You get the optimal model, and you can Assess how well that model fits the data.

Here is a different example, the context of induced mutations, where the simple additive model doesn t work well but a di-nucleotide model gives a much better fit.

Example binding sites Most often one has example binding sites without quantitative binding data Can find the model that maximizes the probability of observing those sites using a maximum likelihood approach Can use maximum a posterior instead, especially if one has small sample size

N(b,i) F(b,i) S(b,i) = log[f(b,i)/p(b)] I(i) = F(b,i)S(b,i) b

Likelihood Ratio Statistics Primer Given two probability distributions P i and Q I P i = Q i = 1 And some data, D i, which is number of times each type i is observed in N total observations The Likelihood Ratio of the data being from distribution Q i versus P i is: LR = (Q i /P i ) Di And the log-likelihood Ratio is LLR = D i ln (Q i /P i )

LLR = D i ln (Q i /P i ) Maximum likelihood distribution is Q i = D i /N So max LLR = N Q i ln (Q i /P i ) Q i ln (Q i /P i ) 0 Information Content Relative Entropy Kullbach-Liebler Distance Related to G-statistic and χ 2

A 0 3 79 40 66 48 65 11 65 0 C 94 75 4 3 1 2 5 2 3 3 G 1 0 3 4 1 0 5 3 28 88 T 2 19 11 50 29 47 22 81 1 6

Motif Discovery Find significant motifs in long sequences Types of data Co-regulated promoters Segments bound to specific proteins Phylogenetically conserved segments Algorithm classes search space Pattern searches consensus motifs Alignment searches PWM motifs Objective Functions

Example (small) dataset CE1CG \TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG\ ECOARABOP \GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG\ ECOBGLR1 \ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT\ ECOCRP \CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC\ ECOCYA \ACGGTGCTACACTTGTATGTAGCGCATCTTTCTTTACGGTCAATCAGCAAGGTGTTAAATTGATCACGTTTTAGACCATTTTTTCGTCGTGAAACTAAAAAAACC\ ECODEOP2 \AGTGAATTATTTGAACCAGATCGCATTACAGTGATGCAAACTTGTAAGTAGATTTCCTTAATTGTGATGTGTATCGAAGTGTGTTGCGGAGTAGATGTTAGAATA\ ECOGALE \GCGCATAAAAAACGGCTAAATTCTTGTGTAAACGATTCCACTAATTTATTCCATGTCACACTTTTCGCATCTTTGTTATGCTATGGTTATTTCATACCATAAGCC\ ECOILVBPR \GCTCCGGCGGGGTTTTTTGTTATCTGCAATTCAGTACAAAACGTGATCAACCCCTCAATTTTCCCTTTGCTGAAAAATTTTCCATTGTCTCCCCTGTAAAGCTGT\ ECOLAC \AACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCAC\ ECOMALBA \ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGGCGTAGGGGCAAGGAGGATGGAAAGAGGTTGCCGTATAAAGAAACTAGAGTCCGTTTA\ ECOMALBA \GGAGGAGGCGGGAGGATGAGAACACGGCTTCTGTGAACTAAACCGAGGTCATGTAAGGAATTTCGTGATGTTGCTTGCAAAAATCGTGGCGATTTTATGTGCGCA\ ECOMALT \GATCAGCGTCGTTTTAGGTGAGTTGTTAATAAAGATTTGGAATTGTGACACAGTGCAAATTCAGACACATAAAAAAACGTCATCGCTTGCATTAGAAAGGTTTCT\ ECOOMPA \GCTGACAAAAAAGATTAAACATACCTTATACAAGACTTTTTTTTCATATGCCTGACGGAGTTCACACTTGTAAGTTTTCAACTACGTTGTAGACTTTACATCGCC\ ECOTNAA \TTTTTTAAACATTAAAATTCTTACGTAATTTATAATCTTTAAAAAAAGCATTTAATATTGCTCCCCGAACGATTGTGATTCGATTCACATTTAAACAATTTCAGA\ ECOUXU1 \CCCATGAGAGTGAAATTGTTGTGATGTGGTTAACCCAATTAGAATTCGGGATTGACATGTCTTACCAAAAGGTAGAACTTATACGCCATCTCATCCGATGCAAGC\ PBR322 \CTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGCACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGCGCTC\ TRN9CAT \CTGTGACGGAAGATCACTTCGCAGAATAAATAAATCCTGGTGTCCCTGTTGATACCGGGAAGCCCTGGGCCAACTTTTGGCGAAAATGAGACGTTGATCGGCACG\ TDC \GATTTTTATACTTTAACTTGTTGATATTTAAAGGTATTTAATTGTAATAACGATACTCTGGAAAGTATTGAAAGTTAATTTGTGAGTGGTCGCACATATCCTGTT\

Example Output

Example Output

Types of Motifs 1. Motif: Consensus Sequence pattern May include degenerate bases and allow for mismatches Search space is over possible patterns 2. Weight Matrix (PWM, Profile, PSSM) Might go to higher order models Search space is over possible alignments

Can use suffix tree for efficient search of patterns Pattern based algorithms Motif length l, mismatches m; N seqs, L long 4^l patterns, search for most common (or most significant) allowing up to m mismatches P-value from background distribution Can allow for m mismatches Can allow degenerate positions: 15^l patterns Can just search using existing l-mers

WEEDER program: good performance on benchmark tests. Suffix Trees All sequneces included in O(NL) time and space Each node can be labeled with N-bit string to indicate which sequences contain specific l-mer Can allow m mismatches: At each string keep track of number of mismatches to pattern being considered and continue down branches until that is exceeded. Can speed up by requiring the mismatches to be spaced; i.e. not all at beginning of string Each pattern can be searched for in linear time, can find all patterns that match criteria of occurring in at least k/n seqs Significance of each matching Pattern can be found from an Expectation based on background

Alignment (Profile) based methods Greedy algorithm (Consensus) Expectation Maximization (MEME) Gibbs Sampler Regression (MatrixReduce) Can use phylogenetic conservation

Greedy Algorithm (Consensus) Simple version: assume every sequence contains at least one true binding site Using each l-mer find best match to generate 2- seq alignments Using top K PWMs to search remaining sequences to include a new sequence Repeat until all seqs contribute Or objective function is maximized (IC, p-value)

Expectation Maximization (MEME) Initial PWM (at random or from average over all potential sites) Using current PWM determine probability of all positions being sites Re-estimate PWM based on those probabilities Continue until convergence always convergences Objective is LLR

Gibbs Sampling Similar to EM, but some important differences At each iteration pick one site on each seq, chosen by its probability, to update PWM Not guaranteed to converge, but tends to increase objective (IC) and plauteau Can escape local optima Other MCMC algorithms Metropolis Simulated annealing

From Lawrence et al, (1993) Science 1 262:208-14.

Sequence Motifs (II) Everything you learned about sequence motifs is wrong! But that is OK because:

Sequence Motifs (II) Everything you learned about sequence motifs is wrong! But that is OK because: All models are wrong. Some models are useful. George Box

Motifs: how are they wrong? Non-specific binding sets lower limit on affinity Non-additivity is an approximation; sometimes it is a good one, sometimes not Probabilistic models, such as log-odds PWMs, make an implicit assumption that sites come from a Boltzmann distribution

Log-Odds method assumes binding site probabilities from Boltzmann distribution: Z e S P S F i E i i / ) ( ) ( ) ( ) ( ln ln ' i i i i S P S F Z E E In Bayesian notation: P(S i bound) = P(bound S i ) P(S i )/P(bound)

Log-odds methods assumes sites are selected from a Boltzmann distribution where probability of binding is proportional to e -E

Problem: the simple probabilistic model doesn t model doesn t account for the protein concentration and saturation effects, or the biophysics of the interaction

Boltzmann is an approximation to the real (Fermi-Dirac) distribution: 1 P( S i bound) 1 e E i for Probabilities =.03,.33,.67,.99 for the consensus site with energy=0

True vs Predicted Binding Probabilities

Does it matter if we assume the wrong distribution? 1. Are binding sites often near saturation? 2. If so how accurate are the log-odds models? 3. How does it affect false-positive/falsenegative rates when predicting sites? 4. How can we get models using the correct distribution?

Djordjevic et al Genome Res. 2003 13: 2381-2390

Quadratic Programming finds solution with minimum number of predicted binding sites

Compare performance of: Log-odds (Red) QPMEME (Blue) MATCH (YELLOW) Simulated data with known Energy matrix T P rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 FP rate Sample sites with assuming: Step function (top) Boltzmann (bottom) For all score cutoffs, Vertical axis = TP rate Horizontal axis = FP rate T P rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 FP rate

Can we do better than either QP or Log-odds? Want to fit the whole distribution and estimate both the binding energy matrix and µ Sample sequences based on their probability, Estimate model from sampled sites

Binding Energy Estimates using Maximum Likelihood (BEEML)

High-throughput SELEX Components: Purified protein with GST-tag attached to beads Randomized synthetic library Method: 1. Illumina sequence library to ascertain any biases (to get P(S i )) 2. Extract DNA bound to protein, Illumina sequence (to get P(S i bound)) 3. (Repeat step 2 if desired) 4. BEEML analysis of sequences to get PWM

Protein Binding Microarray (PBM)Technology Berger et al. Nature Biotechnology, Vol. 24, pp. 1429-1435 (November 2006)

Accuracy of BEEML PWM from PBM data

UniProbe PWM

Conclusions Log-odds methods for estimating energy matrices and/or binding probabilities work fine in some cases, poorly in others Can take into account protein concentration and get fit to the whole distribution using a maximum likelihood method that gives good results under a wider variety of conditions Possible to take into account non-specific binding and multiple modes of binding too

Acknowledgement Most of the slides in this chapter were provided by Prof. Gary Stormo. 60