Genomic and epigenomic signatures for interpreting complex disease Manolis Kellis Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory
TATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA ATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC GCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT TTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA CTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT GATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA CGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG ACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA GGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA AGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC AGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG CTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA ATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG GAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA TTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT GCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG TTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC AACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA ATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT GCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA TATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA CTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA AGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC ACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
TATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA ATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC GCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT TTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA CTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT GATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA CGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG ACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA Genes GGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT Regulatory motifs CAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA Encode AGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC Control AGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG proteins TCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG gene expression CTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA ATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG GAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA TTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT GCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG TTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC AACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA ATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT GCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA TATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA CTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA AGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC ACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
Building systems-level views of genomes and disease Goal: A systems-level understanding of genomes and gene regulation: The regulators: Transcription factors, micrornas, sequence specificities The regions: enhancers, promoters, and their tissue-specificity The targets: TFs targets, regulators enhancers, enhancers genes The grammars: Interplay of multiple TFs prediction of gene expression The parts list = Building blocks of gene regulatory networks Our tools: Comparative genomics & large-scale experimental datasets. Evolutionary signatures for coding/non-coding genes, micrornas, motifs Chromatin signatures for regulatory regions and their tissue specificity Activity signatures for linking regulators enhancers target genes Predictive models for gene function, gene expression, chromatin state Integrative models = Define roles in development, health, disease
Challenge: interpreting disease-associated variants Gene annotation (Coding, 5 /3 UTR, RNAs) Evolutionary signatures Non-coding annotation Chromatin signatures CATGACTG CATGCCTG Disease-associated variant (SNP/CNV/ ) Roles in gene/chromatin regulation Activator/repressor signatures Other evidence of function Signatures of selection (sp/pop) GWAS, case-control, reveal disease-associated variants Molecular mechanism, cell-type specificity, drug targets Challenges towards interpreting disease variants Find true causative SNP among many candidates in LD Use causal variant: predict function, pathway, drug targets Non-coding variant: type of function, cell type of activity Regulatory variant: upstream regulators, downstream targets This talk: genomics tools for addressing these challenges
Goal: Towards personal systems genomics Family Inheritance Me vs. my brother Recombination breakpoints Dad s mom My dad Mom s dad Disease risk Human ancestry Genomics: Regions mechanisms drugs Systems: genes combinations pathways
Systems-level views of disease epigenomics Evolutionary signatures gene/genome annotation High-resolution annotation: genes, RNAs, motif instances Measuring selection within the human population Chromatin states for interpreting disease association Annotate dynamic regulatory elements in multiple cell types Activity-based linking of regulators enhancers targets Interpreting disease-associated sequence variants Mechanistic predictions for individual top-scoring SNPs Functional roles of 1000s of disease-associated SNPs Systematic manipulation of 2000+ human enhancers Test effect of single-motif and single-nucleotide disruptions Role of activator/repressor motifs, disease-associated SNPs Personal genomes/epigenomes in health and disease Allele-specific activity.alzheimer s brain methylation SNP Global repression of distal enhancers. NRSF, ELK1, CTCF
Large-scale comparative genomics datasets 29 mammals 12 flies 17 fungi Post-duplication 9 Yeasts 8 Candida Kellis Nature 2003 Nature 2004; Stark Nature 2007; Clark Nature 2007; Butler Nature 2009; Lindblad-Toh Nature 2011 P P N P P N P P Haploid Diploid Pre-dup
Comparative genomics and evolutionary signatures Comparative genomics can reveal functional elements For example: exons are deeply conserved to mouse, chicken, fish Many other elements are also strongly conserved: exons / regulatory? Can we also pinpoint specific functions of each region? Yes! Patterns of change distinguish different types of functional elements Specific function Selective pressures Patterns of mutation/inse/del Develop evolutionary signatures characteristic of each function Kellis Nature 2003 Nature 2004; Stark Nature 2007; Clark Nature 2007; Butler Nature 2009; Lindblad-Toh Nature 2011
Evolutionary signatures for diverse functions Protein-coding genes - Codon Substitution Frequencies - Reading Frame Conservation RNA structures - Compensatory changes - Silent G-U substitutions micrornas - Shape of conservation profile - Structural features: loops, pairs - Relationship with 3 UTR motifs Regulatory motifs - Mutations preserve consensus - Increased Branch Length Score - Genome-wide conservation Stark et al, Nature 2007
Implications for genome annotation / regulation Novel protein-coding genes Revised gene annotations Unusual gene structures Novel structural families Targeting, editing, stability Riboswitches in mammals Novel/expanded mir families mir/mir* arm cooperation Sense/anti-sense mir switches Novel regulatory motifs Regulatory motif instances TF/miRNA regulatory networks Single binding site resolution Stark et al, Nature 2007
Translational read-through in human & fly Overlapping selection in human exons Protein-coding conservation Jungreis, Genome Research 2011 Stop codon read through Continued proteincoding conservation 2 nd stop codon No more conserv Synonym. Substitut. Rate Lin, Genome Research 2011 Reveal splicing signals, RNA structures, enhancer motifs, dual-coding genes RNA structure families: ortholog/paralog cons Regions of codon-level positive selection distributed localized Parker Gen. Res. 2011 Ex:MAT2A S-adeosyl-methionic level detection Structure/loop sequence deep conservation Lindblad-Toh Nature 2011 Distributed vs. localized positive selection Immunity/taste vs. retinal/bone/secretion
NRSF motif Measuring constraint at individual nucleotides Reveal individual transcription factor binding sites Within motif instances reveal position-specific bias More species: motif consensus directly revealed
Detect SNPs that disrupt conserved regulatory motifs Functionally-associated SNPs enriched in states, constraint Prioritize candidates, increase resolution, disrupted motifs
Measuring selection within the human lineage
Human constraint outside conserved regions Active regions Average diversity (heterozygosity) Aggregate over the genome Non-conserved regions: ENCODE-active regions show reduced diversity Lineage-specific constraint in biochemically-active regions Conserved regions: Non-ENCODE regions show increased diversity Loss of constraint in human when biochemically-inactive
Strongest: motifs, short RNA, Dnase, ChIP, lncrna Significant derived allele depletion in active features
Bound motifs show increased human constraint Position-specific reduction in bound motif heterozygosity Aggregate across thousands of CTCF motif instances
Most constrained human-specific enhancer functions Transcription initiation from Pol2 promoter Transcription coactivator activity Transcription factor binding Chromatin binding Negative regulation of transcription, DNA-dependent Transcription factor complex Protein complex Protein kinase activity Nerve growth factor receptor signaling pathway Signal transducer activity Protein serine/threonine kinase activity Negative regulation of transcription from Pol2 prom Protein tyrosine kinase activity In utero embryonic development Regulatory genes: Transcription, Chromatin, Signaling. Developmental enhancers: embryo, nerve growth
Systems-level views of disease epigenomics Evolutionary signatures gene/genome annotation High-resolution annotation: genes, RNAs, motif instances Measuring selection within the human population Chromatin states for interpreting disease association Annotate dynamic regulatory elements in multiple cell types Activity-based linking of regulators enhancers targets Interpreting disease-associated sequence variants Mechanistic predictions for individual top-scoring SNPs Functional roles of 1000s of disease-associated SNPs Systematic manipulation of 2000+ human enhancers Test effect of single-motif and single-nucleotide disruptions Role of activator/repressor motifs, disease-associated SNPs Personal genomes/epigenomes in health and disease Allele-specific activity.alzheimer s brain methylation SNP Global repression of distal enhancers. NRSF, ELK1, CTCF
Chromatin signatures for genome annotation Epigenomic maps 1. DNA methylation 2. Histone modifications Ernst et al Nature Biotech 2010 3. DNA accessibility See also: Amos Tanay, Bill Noble.
ENCODE: Study nine marks in nine human cell lines 9 marks H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac H3K27me3 H4K20me1 H3K36me3 CTCF +WCE +RNA x 9 human cell types HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM Umbilical vein endothelial Keratinocytes Lymphoblastoid Myelogenous leukemia Liver carcinoma Normal human lung fibroblast Mammary epithelial cell Skeletal muscle myoblasts H1 Embryonic Brad Bernstein ENCODE Chromatin Group 81 Chromatin Mark Tracks (2 81 combinations) Learned jointly across cell types (virtual concatenation) State definitions are common State locations are dynamic Ernst et al, Nature 2011
Chromatin states dynamics across nine cell types Predicted linking Correlated activity Single annotation track for each cell type Summarize cell-type activity at a glance Can study 9-cell activity pattern across
Introducing multi-cell activity profiles Gene Chromatin Active TF motif TF regulator Dip-aligned expression States enrichment expression motif biases HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 Link enhancers to target genes ON OFF Active enhancer Repressed Motif enrichment Motif depletion TF On TF Off Motif aligned Flat profile
Enhancer-gene links supported by eqtl-gene links eqtl study Individuals Indiv. 1 Indiv. 2 Indiv. 3 Indiv. 4 Indiv. 5 Indiv. 6 Indiv. 7 Indiv. 8 Indiv. 9-0.5-1.5-1.8 3.1 1.1-1.8-1.4 3.2 4.4 15kb A A A C A A A C C Validation rationale: Expression Quantitative Trait Loci (eqtls) provide independent SNP-to-gene links Do they agree with activity-based links? Example: Lymphoblastoid (GM) cells study Expression/genotype across 60 individuals (Montgomery et al, Nature 2010) 120 eqtls are eligible for enhancer-gene linking based on our datasets 51 actually linked (43%) using predictions 4-fold enrichment (10% exp. by chance) Expression level of gene Sequence variant at distal position Independent validation of links. Relevance to disease datasets. 25
Visualizing 10,000s predicted enhancer-gene links Overlapping regulatory units, both few and many Both upstream and downstream elements linked Enhancers correlate with sequence constraint 26
Introducing multi-cell activity profiles Gene Chromatin Active TF motif TF regulator Dip-aligned expression States enrichment expression motif biases HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 Link TFs to target enhancers Predict activators vs. repressors ON OFF Active enhancer Repressed Motif enrichment Motif depletion TF On TF Off Motif aligned Flat profile
Coordinated activity reveals activators/repressors Enhancer activity Activity signatures for each TF Ex1: Oct4 predicted activator of embryonic stem (ES) cells Ex2: Gfi1 repressor of K562/GM cells Enhancer networks: Regulator enhancer target gene
Causal motifs supported by dips & enhancer assays Predicted causal HNF motifs (that also showed dips) in HepG2 enhancers Dip evidence of TF binding (nucleosome displacement) Tarjei Mikkelsen Enhancer activity halved by single-motif disruption Motifs bound by TF, contribute to enhancers 29
Systems-level views of disease epigenomics Evolutionary signatures gene/genome annotation High-resolution annotation: genes, RNAs, motif instances Measuring selection within the human population Chromatin states for interpreting disease association Annotate dynamic regulatory elements in multiple cell types Activity-based linking of regulators enhancers targets Interpreting disease-associated sequence variants Mechanistic predictions for individual top-scoring SNPs Functional roles of 1000s of disease-associated SNPs Systematic manipulation of 2000+ human enhancers Test effect of single-motif and single-nucleotide disruptions Role of activator/repressor motifs, disease-associated SNPs Personal genomes/epigenomes in health and disease Allele-specific activity.alzheimer s brain methylation SNP Global repression of distal enhancers. NRSF, ELK1, CTCF
Interpreting disease-association signals Interpret variants using Epigenomics - Chromatin states: Enhancers, promoters, motifs - Enrichment in individual loci, across 1000s of SNPs in T1D CATGACTG CATGCCTG Genotype GWAS Disease Epigenome changes in disease
Revisiting diseaseassociated variants xx Disease-associated SNPs enriched for enhancers in relevant cell types E.g. lupus SNP in GM enhancer disrupts Ets1 predicted activator
Disrupt activator Ets-1 motif Loss of GM-specific activation Loss of enhancer function Loss of HLA-DRB1 expression Creation of repressor Gfi1 motif Gain K562-specific repression Loss of enhancer function Loss of CCDC162 expression Mechanistic predictions for top disease-associated SNPs Lupus erythromatosus in GM lymphoblastoid Erythrocyte phenotypes in K562 leukemia cells `
Allele-specific chromatin marks: cis-vs-trans effects Maternal and paternal GM12878 genomes sequenced Map reads to phased genome, handle SNPs indels Correlate activity changes with sequence differences
HaploReg: systematic ENCODE mining of variants (compbio.mit.edu/haploreg) Start with any list of SNPs or select a GWA study Mine publically available ENCODE data for significant hits Hundreds of assays, dozens of cells, conservation, motifs Report significant overlaps and link to info/browser
Functional enrichment for 1000s of SNPs
Full T1D association spectrum 1000s of causal SNPs Rank all SNPs by P-value Find chromatin states with enrichment in high ranks Signal spans 1000s of SNPs GM12878 K562 Lymphoblastoid Myelogenous leukemia GM12878 enhancer enrichment now seen Could bias in array design contribute to these enrichments? Evaluate all 1000 genomes SNPs by imputing those in LD Cell type specific: GM and K562 enhancers Chromatin state specific: Enhancers/promoters
Imputing SNPs in LD stronger cell/state separation Enhancers across cell types Chromatin states in GM12878 Promoters: 462 (excess 81) Enhancers: 2049 (excess 392) 1940 distinct loci (R^2<.8) Transcribed: 4740 (excess 522) Insulator: 240 (excess 23) Repressed: 1351 (excess 76) Other: 21k (deplete 1093) Excess of 30,000 SNPs 2049 enhancers (excess 392) Mostly found in independent loci (1730 with R 2 <0.2) Systematically measure their regulatory contributions
Systems-level views of disease epigenomics Evolutionary signatures gene/genome annotation High-resolution annotation: genes, RNAs, motif instances Measuring selection within the human population Chromatin states for interpreting disease association Annotate dynamic regulatory elements in multiple cell types Activity-based linking of regulators enhancers targets Interpreting disease-associated sequence variants Mechanistic predictions for individual top-scoring SNPs Functional roles of 1000s of disease-associated SNPs Systematic manipulation of 2000+ human enhancers Test effect of single-motif and single-nucleotide disruptions Role of activator/repressor motifs, disease-associated SNPs Personal genomes/epigenomes in health and disease Allele-specific activity.alzheimer s brain methylation SNP Global repression of distal enhancers. NRSF, ELK1, CTCF
High-throughput experiments: 10,000s enhancers Melnikov, Nature Biotech 2012 Experiment features: Multiplexed enhancer assays 10,000s of elements Each w/ unique barcode Multiple human cell types Repeat experiments on same array / diff barcodes Applied to: Test enhancer offsets Test causal motifs With: Tarjei Mikkelse Broad Institute, ARRA funds See also: Barak Cohen, Jay Shendure, Eran Segal
Systematic motif disruption for 5 activators and 2 repressors in 2 human cell lines 54000+ measurements (x2 cells, 2x repl)
Example activator: conserved HNF4 motif match WT expression specific to HepG2 Motif match disruptions reduce expression to background Non-disruptive changes maintain expression Random changes depend on effect to motif match
Results hold across 2000+ enhancers Scramble abolishes reporter expression Neutral mutations show no change Increasing mutations show more expression However, only 40% show wild-type expression: context?
Features of functional wildtype enhancers Nucleosome exclusion, motif conservation, other TFs Each of these features is encoded in primary sequence
Repressors of HepG2 enhancer act in K562 Repressor disruption aberrant expression in opposite cell types
Testing effect of SNP change in enhancer constructs SNPs in enhancer regions can lead to expression changes in downstream reporter genes Currently testing all T1D-associated enhancer SNPs
Systems-level views of disease epigenomics Evolutionary signatures gene/genome annotation High-resolution annotation: genes, RNAs, motif instances Measuring selection within the human population Chromatin states for interpreting disease association Annotate dynamic regulatory elements in multiple cell types Activity-based linking of regulators enhancers targets Interpreting disease-associated sequence variants Mechanistic predictions for individual top-scoring SNPs Functional roles of 1000s of disease-associated SNPs Systematic manipulation of 2000+ human enhancers Test effect of single-motif and single-nucleotide disruptions Role of activator/repressor motifs, disease-associated SNPs Personal genomes/epigenomes in health and disease Allele-specific activity.alzheimer s brain methylation SNP Global repression of distal enhancers. NRSF, ELK1, CTCF
Interpreting disease-association signals (1) Interpret variants using Epigenomics - Chromatin states: Enhancers, promoters, motifs - Enrichment in individual loci, across 1000s of SNPs in T1D CATGACTG CATGCCTG Genotype GWAS Disease mqtls MWAS Epigenome (2) Epigenome changes in disease - Intermediate molecular phenotypes associated with disease - Variation in brain methylomes of Alzheimer s patients
Phil de Jager: Methylation in 750 Alzheimer patients 750 individuals 500,000 methylation probes Phil de Jager, Roadmap disease epigenomics Genome Epigenome Phenotype Brad Bernstein REMC mapping meqtl Epigenome Classification MWAS Patients followed for 10+ years with cognitive evaluations Brain samples donated post-mortem methylation/genotype Seek predictive features: SNPs, QTLs, mqtls, regulation 1 2
P-value exponent (-log 10 P) 2,500 mqtls for neighboring SNPs at 10-14 Chromosome and genomic position Distance from CpG (MB) -1 1 Overlay Manhattan plots of 450,000 methylation probes Cutoff of 10-14 (10-8 after Bonferroni correction) Use to pinpoint disrupted motifs, predict epigenome 50
Focusing on 2831 most variable probes Probe intensity distribution Inter-individual variability Promoters: Stable low (active) Gene bodies: Stable high (active) Enhancers/poised: Most variable Hemi-methylated probes are also the most variable Tiny fraction (0.6%) of all probes
Multimodal SNP-associated Promoter-depleted Multimodal probes (~3Κ) 184 2,647 SNP-associated probes (29% of all) 138,731 All probes SNP-associated 1 Active promoter 2 Promoter flanking 3 Active enhancer 4 Weak enhancer 5 Gene bodies 93.5% of multimodal probes are SNP-associated Importance of distinguishing contribution of genotype to disease associations % of CpG probes 6 Active gene bodies 7 Repetitive 8 Heterochromatin 9 Low signal SNP-associated probes depleted in promoters (driven epigenetically>genetically, open chrom)
Phil de Jager: Methylation in 750 Alzheimer patients 750 individuals 500,000 methylation probes Phil de Jager, Roadmap disease epigenomics Genome Epigenome Phenotype Brad Bernstein REMC mapping meqtl Epigenome Classification MWAS Patients followed for 10+ years with cognitive evaluations Brain samples donated post-mortem methylation/genotype Seek predictive features: SNPs, QTLs, mqtls, regulation 1 2
Global hyper-methylation trend in AD-associated probes P-value Top 7000 probes Hypomethylated probes (active) Alzheimer s Normal 480,000 probes, ranked by Alzheimer s association Methylation Alzheimer s-associated probes are hypermethylated Global effect across 1000s of probes Rank all probes by Alzheimer s association Observe functional changes down ranklist 7000 probes show shift in methylation Hypermethylated probes (repressed) Alzheimer s Normal Complex disease: genome-wide effects, 1000s of loci
Chromatin state breakdown reveals activity Red: More methylated in Alhzeimer s Blue: Less methylated in Alzheimer s % probes Significant probes are in enhancers Not promoters 1 Active promoter 2 Promoter flanking 3 Active enhancer 4 Weak enhancer 5 Gene bodies 6 Active gene bodies 7 Repetitive 8 Heterochromatin 9 Low signal * => fisher exact test, p-value <= 0.001
Alzheimer s prediction vs. likely biological pathways NRSF All probes, ranked by AD assoc. P-value ELK1 All probes, ranked by AD assoc. P-value CTCF Predictive power: 6k probes + APOE Regulatory motifs associated with Alzheimer-associated probes suggest potential pathways We have not solved Alzheimer s, but new insights gained
Systems-level views of disease epigenomics Evolutionary signatures gene/genome annotation High-resolution annotation: genes, RNAs, motif instances Measuring selection within the human population Chromatin states for interpreting disease association Annotate dynamic regulatory elements in multiple cell types Activity-based linking of regulators enhancers targets Interpreting disease-associated sequence variants Mechanistic predictions for individual top-scoring SNPs Functional roles of 1000s of disease-associated SNPs Systematic manipulation of 2000+ human enhancers Test effect of single-motif and single-nucleotide disruptions Role of activator/repressor motifs, disease-associated SNPs Personal genomes/epigenomes in health and disease Allele-specific activity.alzheimer s brain methylation SNP Global repression of distal enhancers. NRSF, ELK1, CTCF
Understanding human variation and human disease Gene annotation (Coding, 5 /3 UTR, RNAs) Evolutionary signatures Non-coding annotation Chromatin signatures CATGACTG CATGCCTG Disease-associated variant (SNP/CNV/ ) Roles in gene/chromatin regulation Activator/repressor signatures Other evidence of function Signatures of selection (sp/pop) Challenge: from loci to mechanism, pathways, drug targets Goal: A systems-level understanding of genomes and gene regulation: The regulators: Transcription factors, micrornas, sequence specificities The regions: enhancers, promoters, and their tissue-specificity The targets: TFs targets, regulators enhancers, enhancers genes The grammars: Interplay of multiple TFs prediction of gene expression The parts list = Building blocks of gene regulatory networks
Collaborators and Acknowledgements ENCODE Brad Bernstein, Tarjei Mikkelsen, Noam Shoresh, David Epstein Massively parallel enhancer reporter assays Tarjei Mikkelsen, Broad Institute Epigenome Roadmap Bing Ren, Brad Bernstein, John Stam, Joe Costello 2X mammals Kerstin Lindblad-Toh, Eric Lander, Manuel Garber, Or Zuk Funding NHGRI, NIH, NSF Sloan Foundation
Daniel Marbach Mike Lin Jason Ernst Jessica Wu Rachel Sealfon Pouya Kheradpour (#187) Manolis Kellis Chris Bristow Loyal Goff Irwin Jungreis MIT Computational Biology group Compbio.mit.edu Sushmita Roy #331: Luke Ward Stata4 Stata3 Louisa DiStefano Dave Hendrix Angela Yen Ben Holmes Soheil Feizi Mukul Bansal #19:Bob Altshuler Stefan Washietl Matt Eaton