Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks David R. Kelley
DNA codes for complex life. How? Kundaje et al. Integrative analysis of 111 reference human epigenomes. Nature, 2015.
Noncoding DNA determines gene expression Transcription Factors. Kelvin Song. CC BY 3.0
We can t read noncoding DNA TAGTAAAAAAACAACAAAAGACTTTTTCTGACAGTATGATTTACATAACTA CATTTTCATTACTTTTATTTTTCTACATAACTAGTGTCTTTTCAGTTGCAA ATGTTTGACATTCTGACAAGTAGTGTGGCAGAGTCTTAGATTATAGGTTGC ATTTAGCCAAAAGAAAGACTTCGAATGGAATTTTTTTCTATTGACACACTT TCTAACAACATACTTATTTTCTAAAAAGGTTTTTATAACTTAGTGTTGATA ATATCAAAATGCTAAGCAATTTTGCTTAAAAAGCGTAGAACACCAATATTT AATGAAGATTAATTAAATAGCACACATTGATTACTTGTTTAAAAATATTCG GAAAAGTTTTGACACATGCTAAAGTGCTGAAGTAGGATTTTGGCCTTCCAT AAAAATAATATATTGTGCATAAATGGATGCAGAATGAAGAAAGCAATGGGG
Reading noncoding DNA would transform variant interpretation
DNaseI hypersensitivity KLF4 20 kb Boyle et al. High-resolution mapping and characterization of open chromatin across the genome. Cell, 2008.
Is this sequence accessible? TAGTAAAAAAACAACAAAAGACTTTTTCTGACAGTATGATTTACATAACTA CATTTTCATTACTTTTATTTTTCTACATAACTAGTGTCTTTTCAGTTGCAA ATGTTTGACATTCTGACAAGTAGTGTGGCAGAGTCTTAGATTATAGGTTGC ATTTAGCCAAAAGAAAGACTTCGAATGGAATTTTTTTCTATTGACACACTT TCTAACAACATACTTATTTTCTAAAAAGGTTTTTATAACTTAGTGTTGATA ATATCAAAATGCTAAGCAATTTTGCTTAAAAAGCGTAGAACACCAATATTT AATGAAGATTAATTAAATAGCACACATTGATTACTTGTTTAAAAATATTCG GAAAAGTTTTGACACATGCTAAAGTGCTGAAGTAGGATTTTGGCCTTCCAT AAAAATAATATATTGTGCATAAATGGATGCAGAATGAAGAAAGCAATGGGG
Learning DNA sequence activity ACGTGATTACAACGT 1 ATTTAGCCAAAAGAA 1 AGACTTCGAATGGAA 0 TTTTTTTCTATTGAC 0 ACACTTTCTAACAAC 0 ATACTTATTTTCTAA 1 AAAGGTTTTTATAAC 0 TTAGTGTTGATAATA 0
Sequence representation ACGTGATTACAACGT
Sequence representation ACGTGATTACAACGT ACGT 1
Sequence representation ACGTGATTACAACGT ACGT 1 CGTG 1
Sequence representation With big data & big computers, Can we learn better representations? ACGTGATTACAACGT AACG 1 ACAA 1 ACGT 2 ATTA 1 CAAC 1 CGTG 1 GATT 1 GTGA 1 TACA 1 TGAT 1 TTAC 1 0 Learning Algorithm
Artificial neural networks
Convolutional neural network http://deeplearning.net/tutorial/lenet.html
Convolutional neural network Zeiler, Fergus. Visualizing and understanding convolutional networks. ECCV 2014.
Convolutional neural network Zeiler, Fergus. Visualizing and understanding convolutional networks. ECCV 2014.
Convolutional neural network Zeiler, Fergus. Visualizing and understanding convolutional networks. ECCV 2014.
Convolutional neural network Zeiler, Fergus. Visualizing and understanding convolutional networks. ECCV 2014.
DNA convolutional neural network A C G T
DNA convolutional neural network A C G T 2 1 0-1 -2
DNA convolutional neural network A C G T 2 1 0-1 -2
DNA convolutional neural network A C G T 2 1 0-1 -2
DNA convolutional neural network A C G T 2 1 0-1 -2
DNA convolutional neural network A C G T 2 1 0-1 -2
DNA convolutional neural network
Basset
Learning the accessibility code DNaseI hypersensitivity sites DNaseI-seq from 164 cell types from ENCODE and Epigenomics Roadmap. Cells 2 million sites broken into training, validation, and test sets.
Convolutional nets accurately predict accessibility 1.0 True positive rate 0.8 0.6 0.4 0.2 0.0 PanIslets AUC: 0.839 HUVEC AUC: 0.896 CLL AUC: 0.907 HRE AUC: 0.917 HPF AUC: 0.929 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Basset AUC 0.95 0.90 0.85 0.80 0.75 0.70 mean AUC 0.900 mean AUC 0.780 0.70 0.75 0.80 0.85 0.90 0.95 gkm-svm AUC Ghandi, Lee et al. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. PLoS Comp Bio, 2014.
Filters recapitulate known protein binding motifs
Filters recapitulate known protein binding motifs
Multiple filters capture motif variants
Filter influence reflects cell specificity
Annotate nucleotide influence 118.434 mb 118.435 mb 118.436 mb 118.437 mb JUN ChIP-seq DNaseI-seq JUND ChIP-seq 15 10 5 10 5 20 10 0 CAGCCTTTGTTAATGGGGACACAATCCTGGAAATTTTGCCTGTGTGTAAACCTCTAGGGGCTTTTTCTTTCATCGTTTTACATCAGCCAGACTCTGACTCACAGCTGGAGAATCAGCTTCCTTATTATGTAGCGAATTCCATGAACACAC 0.16 0.08 0.00-0.08-0.16 0.2 0.1 PhyloP 0.0 3 2 1 0 1 2 3
Case study: vitiligo - rs4409785
Case study: vitiligo - rs4409785 Jin et al. Genome-wide associate analyses identify 13 new susceptibility loci for generalized vitiligo. Nat Genetics, 2012.
Predictive of causal disease SNPs
CTCF ChIP-seq shows allele-specific rs4409785 binding
Can we easily add new datasets? SNP interpretation requires relevant cell types. 1. Seed a new model with pretrained parameters. 2. Chop off the final model layer. 3. Train one pass (to avoid overfitting).
Large-scale public data informs new dataset learning
Basset https://github.com/davek44/basset Published in Advance May 3, 2016, doi: 10.1101/gr.200535.115
Summary Deep convolutional neural networks predict DNaseI hypersensitivity far beyond previous algorithms. ChIP-seq peaks work great, too. Accurate models enable nucleotide-resolution genome annotation. Such models show great promise for interpreting noncoding variants.
Acknowledgments Jasper Snoek John Rinn Research was supported by NIEHS of the National Institutes of Health under K25 award ES022984.