Computational personal genomics: selection, regulation, epigenomics, disease Manolis Kellis Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory
TATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA ATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC GCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT TTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA CTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT GATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA CGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG ACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA GGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT CAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA AGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC AGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG TCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG CTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA ATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG GAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA TTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT GCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG TTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC AACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA ATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT GCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA TATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA CTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA AGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC ACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
TATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA ATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC GCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT TTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA CTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT GATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA CGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG ACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA Genes GGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT Regulatory motifs CAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA Encode AGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC Control AGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG proteins TCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG gene expression CTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA ATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG GAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC TGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA TTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT GCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG TTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC AACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA AAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA ATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT CCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT GCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA TATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA CTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA AGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC ACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
Family Inheritance Personal genomics today: 23 and We Me vs. my brother Recombination breakpoints Dad s mom My dad Mom s dad Disease risk Human ancestry Genomics: Regions mechanisms drugs Systems: genes combinations pathways
Understanding human variation and human disease Gene annotation (Coding, 5 /3 UTR, RNAs) Evolutionary signatures Non-coding annotation Chromatin signatures CATGACTG CATGCCTG Disease-associated variant (SNP/CNV/ ) Roles in gene/chromatin regulation Activator/repressor signatures Other evidence of function Signatures of selection (sp/pop) Challenge: from loci to mechanism, pathways, drug targets Goal: A systems-level understanding of genomes and gene regulation: The regulators: Transcription factors, micrornas, sequence specificities The regions: enhancers, promoters, and their tissue-specificity The targets: TFs targets, regulators enhancers, enhancers genes The grammars: Interplay of multiple TFs prediction of gene expression The parts list = Building blocks of gene regulatory networks
Tools for interpreting the human genome Evolutionary signatures Genome annotation Distinct signatures for proteins, ncrnas, mirnas, motifs Read-through, excess-constraint, networks,human selection Chromatin signatures Dynamic regulatory regions Define chromatin states from combinations of histone marks Distinct classes of promoter/enhancer/transcribed/repressed Activity signatures Link enhancer networks Activity-based linking of regulators enhancers targets Testing of 1000s of enhancer activator / repressor motifs Personal genomics Interpret disease mechanism Disease-associations: Mechanistic predictions for variants Beyond top hits: 2000+ GWAS T1D-associated enhancers Global methylation changes in Alzheimer s: NRSF, ELK1, CTCF
Evolutionary signatures reveal genes, RNAs, motifs Compare 29 mammals Increased conservation pinpoints functional regions Lindblad-Toh Nature 2011 Stark Nature 2007 Distinct patterns of change distinguish diff. functions Protein-coding genes - Codon Substitution Frequencies - Reading Frame Conservation RNA structures - Compensatory changes - Silent G-U substitutions micrornas - Shape of conservation profile - Structural features: loops, pairs - Relationship with 3 UTR motifs Regulatory motifs - Mutations preserve consensus - Increased Branch Length Score - Genome-wide conservation
Compare 29 mammals: Reveal constrained positions NRSF motif Reveal individual transcription factor binding sites Within motif instances reveal position-specific bias More species: motif consensus directly revealed
Human constraint outside conserved regions Active regions Average diversity (heterozygosity) Aggregate over the genome Ward and Kellis Science 2012 Non-conserved regions: ENCODE-active regions show reduced diversity Lineage-specific constraint in biochemically-active regions Conserved regions: Non-ENCODE regions show increased diversity Loss of constraint in human when biochemically-inactive
Human-specific enhancer functions play regulatory roles Transcription initiation from Pol2 promoter Transcription coactivator activity Transcription factor binding Chromatin binding Negative regulation of transcription, DNA-dependent Transcription factor complex Protein complex Protein kinase activity Nerve growth factor receptor signaling pathway Signal transducer activity Protein serine/threonine kinase activity Negative regulation of transcription from Pol2 prom Protein tyrosine kinase activity In utero embryonic development Regulatory genes: Transcription, Chromatin, Signaling. Developmental enhancers: embryo, nerve growth
Tools for interpreting the human genome Evolutionary signatures Genome annotation Distinct signatures for proteins, ncrnas, mirnas, motifs Read-through, excess-constraint, networks,human selection Chromatin signatures Dynamic regulatory regions Define chromatin states from combinations of histone marks Distinct classes of promoter/enhancer/transcribed/repressed Activity signatures Link enhancer networks Activity-based linking of regulators enhancers targets Testing of 1000s of enhancer activator / repressor motifs Personal genomics Interpret disease mechanism Disease-associations: Mechanistic predictions for variants Beyond top hits: 2000+ GWAS T1D-associated enhancers Global methylation changes in Alzheimer s: NRSF, ELK1, CTCF
Integrate epigenomics datasets in multiple cell types 2. DNA methylation 3. DNA accessibility Epigenomic maps 1. Histone modifications Epigenetic modifications DNA/histone/nucleosome Encode epigenetic state Histone code hypothesis Distinct function for distinct combinations of marks? Hundreds of histone marks Astronomical number of histone mark combinations How do we find biologically relevant ones? Unsupervised approach Probabilistic model Explicit combinatorics
Chromatin state dynamics across nine cell types Predicted linking Correlated activity Single annotation track for each cell type Summarize cell-type activity at a glance Can study 9-cell activity pattern across
Epigenomics Roadmap: 90 complete epigenomes Interpret GWAS, global effects, reveal relevant cell types
Enhancer modules associated with tissue identity Clustering of 500,000 distal enhancers Tissue-specific disease-relevant tissues and processes
Dissect motifs in 10,000s of human enhancers 54000+ measurements (x2 cells, 2x repl) Predict activators/repressors based on activity correlations Validate by engineering enhancers disrupting causal motifs
Example activator: conserved HNF4 motif match WT expression specific to HepG2 Motif match disruptions reduce expression to background Non-disruptive changes maintain expression Random changes depend on effect to motif match
Tools for interpreting the human genome Evolutionary signatures Genome annotation Distinct signatures for proteins, ncrnas, mirnas, motifs Read-through, excess-constraint, networks,human selection Chromatin signatures Dynamic regulatory regions Define chromatin states from combinations of histone marks Distinct classes of promoter/enhancer/transcribed/repressed Activity signatures Link enhancer networks Activity-based linking of regulators enhancers targets Testing of 1000s of enhancer activator / repressor motifs Personal genomics Interpret disease mechanism Disease-associations: Mechanistic predictions for variants Beyond top hits: 2000+ GWAS T1D-associated enhancers Global methylation changes in Alzheimer s: NRSF, ELK1, CTCF
Revisiting diseaseassociated variants xx Disease-associated SNPs enriched for enhancers in relevant cell types E.g. lupus SNP in GM enhancer disrupts Ets1 predicted activator
Variation in methylation patterns largely genotype driven Global signature of repression in 1000s regulatory regions: hypermethylation, enhancer states, brain regulator targets G E D: Global repression of brain enhancers in AD MAP Memory and Aging Project + ROS Religious Order Study Dorsolateral PFC Genotype (1M SNPs x700 ind.) Reference Chromatin states Methylation (450k probes x 700 ind)
Global hyper-methylation in 1000s of AD-associated loci P-value Top 7000 probes 480,000 probes, ranked by Alzheimer s association Methylation Alzheimer s-associated probes are hypermethylated Global effect across 1000s of probes Rank all probes by Alzheimer s association 7000 probes increase methylation (repressed) Enriched in brain-specific enhancers Near motifs of brain-specific regulators Complex disease: genome-wide effects
Covers computational challenges associated with personal genomics: - genotype phasing and haplotype reconstruction resolve mom/dad chromosomes - exploiting linkage for variant imputation co-inheritance patterns in human population - ancestry painting for admixed genomes result of human migration patterns - predicting likely causal variants using functional genomics from regions to mechanism - comparative genomics annotation of coding/non-coding elements gene regulation - relating regulatory variation to gene expression or chromatin quantitative trait loci - measuring recent evolution and human selection selective pressure shaped our genome - using systems/network information to decipher weak contributions combinatorics - challenge of complex multi-genic traits: height, diabetes, Alzheimer's 1000s of genes
MIT Computational Biology group Compbio.mit.edu Mike Lin Matt Eaton Stata3 Stata4 Angela Yen Manolis Kellis Ben Holmes Soheil Feizi Jason Ernst Luke Ward Jessica Wu Bob Altshuler Louisa DiStefano Irwin Jungreis Mukul Bansal Daniel Marbach Sushmita Roy Chris Bristow Dave Hendrix Pouya Kheradpour Rachel Sealfon Loyal Goff Stefan Washietl