Chapter 9 Motif finding. Chaochun Wei Spring 2019

1896 1920 1987 2006 Chapter 9 Motif finding Chaochun Wei Spring 2019

Contents 1. Reading materials 2. Sequence structure modeling Motif finding Regulatory module finding 2

Reading materials Tompa et al (2005), Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, 23(2):137-144 3

Sequence Motifs Motif: subsequence with some specific function May be in DNA, RNA, protein Function may be context dependent Ribosome binding site must be transcribed RNA, protein motifs may depend on structure May be gapped or ungapped Use model to search for (predict) new sites Models may be simple sequences (regular expressions) or probabilistic patterns Modeling approach depends on data available Quantitative/qualitative

Motifs: DNA binding sites Gene regulation often controlled by proteins (transcription factors) that bind to specific DNA sequences to activate or repress transcription Binding sites are usually short (6-30bp) and typically ungapped, probabilistic patterns Would like to have a quanitative model that predicts binding affinity and/or occupancy

Binding energy to all possible 4-long sites, and a simple additive model that gives the same energy predictions

Weight Matrix Model A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11

Weight Matrix Model -24.A C T A T A A T G T A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11

Weight Matrix Model 43.A C T A T A A T G T A: -8 10-1 2 1-8 C: -10-9 -3-2 -1-12 G: -7-9 -1-1 -4-9 T: 10-6 9 0-1 11

Non-additivity in binding Non-specific binding sets lower limit on affinity Typically much higher than if all bases considered Non-additive contributions to specific binding Binding energy of a base may depend on context (neighboring bases) Can use more complex models

If simple additive model is inadequate, can use, di-nucleotide or higher-order models. Some form of a matrix model must be correct because binding the binding data itself is a matrix (vector). 1 2 3 AA 0.83 0.42 1.38 AC 0-0.42 1.38 AG -1.21 0.42 1.38 AT -0.13 0-0.79 TT 1.25 0-0.79 AAA AAC AAG AAT TTT 1 1.25 0.41 1.25 0.83 1.25 2 1.38 1.38 1.38-0.79-0.79

How to get optimal models Quantitative binding data Can do regression to get best fit Can test different models Must take into account that a more complex model can always give better fit, but is it useful? Qualitative binding data set of sites Log-odds methods, assume Boltzmann Can use minimum volume method (QP)

Key point: There are only 3L+1 independent parameters

Given some quantitative activity data, you can obtain the optimal solution using standard, multiple linear regression methods. Key point: If you assume the activity is related to affinity and model parameters are related to energy, need to regress vs log(affinity)...

You get the optimal model, and you can Assess how well that model fits the data.

Here is a different example, the context of induced mutations, where the simple additive model doesn t work well but a di-nucleotide model gives a much better fit.

Example binding sites Most often one has example binding sites without quantitative binding data Can find the model that maximizes the probability of observing those sites using a maximum likelihood approach Can use maximum a posterior instead, especially if one has small sample size

N(b,i) F(b,i) S(b,i) = log[f(b,i)/p(b)] I(i) = F(b,i)S(b,i) b

Likelihood Ratio Statistics Primer Given two probability distributions P i and Q I P i = Q i = 1 And some data, D i, which is number of times each type i is observed in N total observations The Likelihood Ratio of the data being from distribution Q i versus P i is: LR = (Q i /P i ) Di And the log-likelihood Ratio is LLR = D i ln (Q i /P i )

LLR = D i ln (Q i /P i ) Maximum likelihood distribution is Q i = D i /N So max LLR = N Q i ln (Q i /P i ) Q i ln (Q i /P i ) 0 Information Content Relative Entropy Kullbach-Liebler Distance Related to G-statistic and χ 2

A 0 3 79 40 66 48 65 11 65 0 C 94 75 4 3 1 2 5 2 3 3 G 1 0 3 4 1 0 5 3 28 88 T 2 19 11 50 29 47 22 81 1 6

Motif Discovery Find significant motifs in long sequences Types of data Co-regulated promoters Segments bound to specific proteins Phylogenetically conserved segments Algorithm classes search space Pattern searches consensus motifs Alignment searches PWM motifs Objective Functions

Example (small) dataset CE1CG \TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG\ ECOARABOP \GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG\ ECOBGLR1 \ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT\ ECOCRP \CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC\ ECOCYA \ACGGTGCTACACTTGTATGTAGCGCATCTTTCTTTACGGTCAATCAGCAAGGTGTTAAATTGATCACGTTTTAGACCATTTTTTCGTCGTGAAACTAAAAAAACC\ ECODEOP2 \AGTGAATTATTTGAACCAGATCGCATTACAGTGATGCAAACTTGTAAGTAGATTTCCTTAATTGTGATGTGTATCGAAGTGTGTTGCGGAGTAGATGTTAGAATA\ ECOGALE \GCGCATAAAAAACGGCTAAATTCTTGTGTAAACGATTCCACTAATTTATTCCATGTCACACTTTTCGCATCTTTGTTATGCTATGGTTATTTCATACCATAAGCC\ ECOILVBPR \GCTCCGGCGGGGTTTTTTGTTATCTGCAATTCAGTACAAAACGTGATCAACCCCTCAATTTTCCCTTTGCTGAAAAATTTTCCATTGTCTCCCCTGTAAAGCTGT\ ECOLAC \AACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCAC\ ECOMALBA \ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGGCGTAGGGGCAAGGAGGATGGAAAGAGGTTGCCGTATAAAGAAACTAGAGTCCGTTTA\ ECOMALBA \GGAGGAGGCGGGAGGATGAGAACACGGCTTCTGTGAACTAAACCGAGGTCATGTAAGGAATTTCGTGATGTTGCTTGCAAAAATCGTGGCGATTTTATGTGCGCA\ ECOMALT \GATCAGCGTCGTTTTAGGTGAGTTGTTAATAAAGATTTGGAATTGTGACACAGTGCAAATTCAGACACATAAAAAAACGTCATCGCTTGCATTAGAAAGGTTTCT\ ECOOMPA \GCTGACAAAAAAGATTAAACATACCTTATACAAGACTTTTTTTTCATATGCCTGACGGAGTTCACACTTGTAAGTTTTCAACTACGTTGTAGACTTTACATCGCC\ ECOTNAA \TTTTTTAAACATTAAAATTCTTACGTAATTTATAATCTTTAAAAAAAGCATTTAATATTGCTCCCCGAACGATTGTGATTCGATTCACATTTAAACAATTTCAGA\ ECOUXU1 \CCCATGAGAGTGAAATTGTTGTGATGTGGTTAACCCAATTAGAATTCGGGATTGACATGTCTTACCAAAAGGTAGAACTTATACGCCATCTCATCCGATGCAAGC\ PBR322 \CTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGCACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGCGCTC\ TRN9CAT \CTGTGACGGAAGATCACTTCGCAGAATAAATAAATCCTGGTGTCCCTGTTGATACCGGGAAGCCCTGGGCCAACTTTTGGCGAAAATGAGACGTTGATCGGCACG\ TDC \GATTTTTATACTTTAACTTGTTGATATTTAAAGGTATTTAATTGTAATAACGATACTCTGGAAAGTATTGAAAGTTAATTTGTGAGTGGTCGCACATATCCTGTT\

Example Output

Types of Motifs 1. Motif: Consensus Sequence pattern May include degenerate bases and allow for mismatches Search space is over possible patterns 2. Weight Matrix (PWM, Profile, PSSM) Might go to higher order models Search space is over possible alignments

Can use suffix tree for efficient search of patterns Pattern based algorithms Motif length l, mismatches m; N seqs, L long 4^l patterns, search for most common (or most significant) allowing up to m mismatches P-value from background distribution Can allow for m mismatches Can allow degenerate positions: 15^l patterns Can just search using existing l-mers

WEEDER program: good performance on benchmark tests. Suffix Trees All sequneces included in O(NL) time and space Each node can be labeled with N-bit string to indicate which sequences contain specific l-mer Can allow m mismatches: At each string keep track of number of mismatches to pattern being considered and continue down branches until that is exceeded. Can speed up by requiring the mismatches to be spaced; i.e. not all at beginning of string Each pattern can be searched for in linear time, can find all patterns that match criteria of occurring in at least k/n seqs Significance of each matching Pattern can be found from an Expectation based on background

Alignment (Profile) based methods Greedy algorithm (Consensus) Expectation Maximization (MEME) Gibbs Sampler Regression (MatrixReduce) Can use phylogenetic conservation

Greedy Algorithm (Consensus) Simple version: assume every sequence contains at least one true binding site Using each l-mer find best match to generate 2- seq alignments Using top K PWMs to search remaining sequences to include a new sequence Repeat until all seqs contribute Or objective function is maximized (IC, p-value)

Expectation Maximization (MEME) Initial PWM (at random or from average over all potential sites) Using current PWM determine probability of all positions being sites Re-estimate PWM based on those probabilities Continue until convergence always convergences Objective is LLR

Gibbs Sampling Similar to EM, but some important differences At each iteration pick one site on each seq, chosen by its probability, to update PWM Not guaranteed to converge, but tends to increase objective (IC) and plauteau Can escape local optima Other MCMC algorithms Metropolis Simulated annealing

From Lawrence et al, (1993) Science 1 262:208-14.

Sequence Motifs (II) Everything you learned about sequence motifs is wrong! But that is OK because:

Sequence Motifs (II) Everything you learned about sequence motifs is wrong! But that is OK because: All models are wrong. Some models are useful. George Box

Motifs: how are they wrong? Non-specific binding sets lower limit on affinity Non-additivity is an approximation; sometimes it is a good one, sometimes not Probabilistic models, such as log-odds PWMs, make an implicit assumption that sites come from a Boltzmann distribution

Log-Odds method assumes binding site probabilities from Boltzmann distribution: Z e S P S F i E i i / ) ( ) ( ) ( ) ( ln ln ' i i i i S P S F Z E E In Bayesian notation: P(S i bound) = P(bound S i ) P(S i )/P(bound)

Log-odds methods assumes sites are selected from a Boltzmann distribution where probability of binding is proportional to e -E

Problem: the simple probabilistic model doesn t model doesn t account for the protein concentration and saturation effects, or the biophysics of the interaction

Boltzmann is an approximation to the real (Fermi-Dirac) distribution: 1 P( S i bound) 1 e E i for Probabilities =.03,.33,.67,.99 for the consensus site with energy=0

True vs Predicted Binding Probabilities

Does it matter if we assume the wrong distribution? 1. Are binding sites often near saturation? 2. If so how accurate are the log-odds models? 3. How does it affect false-positive/falsenegative rates when predicting sites? 4. How can we get models using the correct distribution?

Djordjevic et al Genome Res. 2003 13: 2381-2390

Quadratic Programming finds solution with minimum number of predicted binding sites

Compare performance of: Log-odds (Red) QPMEME (Blue) MATCH (YELLOW) Simulated data with known Energy matrix T P rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 FP rate Sample sites with assuming: Step function (top) Boltzmann (bottom) For all score cutoffs, Vertical axis = TP rate Horizontal axis = FP rate T P rate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 FP rate

Can we do better than either QP or Log-odds? Want to fit the whole distribution and estimate both the binding energy matrix and µ Sample sequences based on their probability, Estimate model from sampled sites

Binding Energy Estimates using Maximum Likelihood (BEEML)

High-throughput SELEX Components: Purified protein with GST-tag attached to beads Randomized synthetic library Method: 1. Illumina sequence library to ascertain any biases (to get P(S i )) 2. Extract DNA bound to protein, Illumina sequence (to get P(S i bound)) 3. (Repeat step 2 if desired) 4. BEEML analysis of sequences to get PWM

Protein Binding Microarray (PBM)Technology Berger et al. Nature Biotechnology, Vol. 24, pp. 1429-1435 (November 2006)

Accuracy of BEEML PWM from PBM data

UniProbe PWM

Conclusions Log-odds methods for estimating energy matrices and/or binding probabilities work fine in some cases, poorly in others Can take into account protein concentration and get fit to the whole distribution using a maximum likelihood method that gives good results under a wider variety of conditions Possible to take into account non-specific binding and multiple modes of binding too

Acknowledgement Most of the slides in this chapter were provided by Prof. Gary Stormo. 60