Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. David R. Kelley

Samankaltaiset tiedostot
Computational personal genomics: selection, regulation, epigenomics, disease

Genomic and epigenomic signatures for interpreting complex disease

Functional Genomics & Proteomics

Experimental Identification and Computational Characterization of a Novel. Extracellular Metalloproteinase Produced by Clostridium sordellii

Gap-filling methods for CH 4 data

Efficiency change over time

State of the Union... Functional Genomics Research Stream. Molecular Biology. Genomics. Computational Biology

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

Bioinformatics in Laboratory of Computer and Information Science

Inferring Trichoderma reesei gene regulatory network

Supplementary information: Biocatalysis on the surface of Escherichia coli: melanin pigmentation of the cell. exterior

ReFuel 70 % Emission Reduction Using Renewable High Cetane Number Paraffinic Diesel Fuel. Kalle Lehto, Aalto-yliopisto 5.5.

Plasmid Name: pmm290. Aliases: none known. Length: bp. Constructed by: Mike Moser/Cristina Swanson. Last updated: 17 August 2009

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

This notice in TED website:

Capacity Utilization

tgg agg Supplementary Figure S1.

7.4 Variability management

Capacity utilization

lpar1 IPB004065, IPB002277, and IPB Restriction Enyzme Differences from REBASE Gained in Variant Lost from Reference

16. Allocation Models

Eukaryotic Comparative Genomics

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

Chapter 9 Motif finding. Chaochun Wei Spring 2019

Helsinki, Turku and WMT

Information on preparing Presentation

ProAgria. Opportunities For Success

CS284A Representations & Algorithms for Molecular Biology. Xiaohui S. Xie University of California, Irvine

Supporting Information for

Viral DNA as a model for coil to globule transition

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Large Scale Sequence Analysis with Applications to Genomics

Genome 373: Genomic Informatics. Professors Elhanan Borenstein and Jay Shendure

Infrastruktuurin asemoituminen kansalliseen ja kansainväliseen kenttään Outi Ala-Honkola Tiedeasiantuntija

Enterprise Architecture TJTSE Yrityksen kokonaisarkkitehtuuri

FROM VISION TO CRITERIA: PLANNING SUSTAINABLE TOURISM DESTINATIONS Case Ylläs Lapland

Uusia kokeellisia töitä opiskelijoiden tutkimustaitojen kehittämiseen

The CCR Model and Production Correspondence

TUTKIJOIDEN KÄYTTÄMÄT DATAREPOSITORIOT

7. Product-line architectures

Kysymys 5 Compared to the workload, the number of credits awarded was (1 credits equals 27 working hours): (4)

WP3 Decision Support Technologies

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

BLOCKCHAINS AND ODR: SMART CONTRACTS AS AN ALTERNATIVE TO ENFORCEMENT

SFS/SR315 Tekoäly Tekoälyn standardisointi

Other approaches to restrict multipliers

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

Tilausvahvistus. Anttolan Urheilijat HENNA-RIIKKA HAIKONEN KUMMANNIEMENTIE 5 B RAHULA. Anttolan Urheilijat

WAMS 2010,Ylivieska Monitoring service of energy efficiency in housing Jan Nyman,

TESTBED FOR NEXT GENERATION REASEARCH & INNOVATION

Master's Programme in Life Science Technologies (LifeTech) Prof. Juho Rousu Director of the Life Science Technologies programme 3.1.

Arkkitehtuuritietoisku. eli mitä aina olet halunnut tietää arkkitehtuureista, muttet ole uskaltanut kysyä

Tutkimus Auria Biopankissa ja tulevaisuuden visiot Samu Kurki, FT, data-analyytikko

LYTH-CONS CONSISTENCY TRANSMITTER

Constructive Alignment in Specialisation Studies in Industrial Pharmacy in Finland

EUROOPAN PARLAMENTTI

11/17/11. Gene Regulation. Gene Regulation. Gene Regulation. Finding Regulatory Motifs in DNA Sequences. Regulatory Proteins

VIIKKI BIOCENTER University of Helsinki

Large Scale Sequence Analysis with Applications to Genomics

PYÖRÄILY OSANA HELSINGIN SEUDUN KESTÄVÄÄ KAUPUNKILIIKENNETTÄ

100 % Kaisu Keskinen Diat

Ohjelmointikielet ja -paradigmat 5op. Markus Norrena

AKKREDITOITU TESTAUSLABORATORIO ACCREDITED TESTING LABORATORY VERKOTAN OY VERKOTAN LTD.

Big datan hyödyntäminen lääkkeisiin liittyvässä viranomaistyössä Suomessa ja EU:ssa. Vesa Kiviniemi Arviointipäällikkö Fimea

Biopankit ja Big Data terveydenhuollossa: onko open science magic bullet?

1. Liikkuvat määreet

Statistical design. Tuomas Selander

Bioinformatics. Sequence Analysis: Part III. Pattern Searching and Gene Finding. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Avoimen datan liiketoimintamallit. Matti Rossi, Aalto University School of Business

Kaivostoiminnan eri vaiheiden kumulatiivisten vaikutusten huomioimisen kehittäminen suomalaisessa luonnonsuojelulainsäädännössä

Methods S1. Sequences relevant to the constructed strains, Related to Figures 1-6.

Tech Conference On-Premises Data Mining. Peruskäsitteet. Sovelto Oyj

Julkaisun laji Opinnäytetyö. Sivumäärä 43

Riitta Kilpeläinen Elia Liitiäinen Belle Selene Xia University of Eastern Finland Department of Forest Sciences Department of Economics and HECER

Stormwater filtration unit

Pysyvä työkyvyttömyys riskitekijöiden varhainen tunnistaminen: voiko kaksostutkimus antaa uutta tietoa?

Students Experiences of Workplace Learning Marja Samppala, Med, doctoral student

Supporting information

Käytännön kokemuksia osallistumisesta EU projekteihin. 7. puiteohjelman uusien hakujen infopäivät 2011

Kansalaisten näkemykset sekä julkisen liikenteen ja pyöräilyn innovaatiot

Strict singularity of a Volterra-type integral operator on H p

BDD (behavior-driven development) suunnittelumenetelmän käyttö open source projektissa, case: SpecFlow/.NET.

Data quality points. ICAR, Berlin,

Tekes the Finnish Funding Agency for Technology and Innovation. Copyright Tekes

2017/S Contract notice. Supplies

Python Libraries 1 / 14

Lataa Legislating the blind spot - Nikolas Sellheim. Lataa

Salasanan vaihto uuteen / How to change password

Lataa Cognitive Function in Opioid Substitution Treated Patiens - Pekka Rapeli. Lataa

Voice Over LTE (VoLTE) By Miikka Poikselkä;Harri Holma;Jukka Hongisto

Vaisala s New Global L ightning Lightning Dataset GLD360

A new model of regional development work in habilitation of children - Good habilitation in functional networks

2D filter. Convolution also extend to the 2D case: In this case, the result pixel is the weighted sum of pixels inside the window.

Epigeneettinen säätely ja genomin leimautuminen. Tiina Immonen BLL Biokemia ja kehitysbiologia

Tietokonearkkitehtuuri 2 TKT-3201 (5 op)

Suopeuden ainekset. Dos. Ilpo Helén Biomedicine in Society (BitS) Department of Social Reseach

Automaatiojärjestelmän hankinnassa huomioitavat tietoturva-asiat

Miten oppimista voi tehostaa?

UUSIA TAPOJA SELVITTÄÄ ONLINE-SUOSION SYITÄ

RANTALA SARI: Sairaanhoitajan eettisten ohjeiden tunnettavuus ja niiden käyttö hoitotyön tukena sisätautien vuodeosastolla

Transkriptio:

Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks David R. Kelley

DNA codes for complex life. How? Kundaje et al. Integrative analysis of 111 reference human epigenomes. Nature, 2015.

Noncoding DNA determines gene expression Transcription Factors. Kelvin Song. CC BY 3.0

We can t read noncoding DNA TAGTAAAAAAACAACAAAAGACTTTTTCTGACAGTATGATTTACATAACTA CATTTTCATTACTTTTATTTTTCTACATAACTAGTGTCTTTTCAGTTGCAA ATGTTTGACATTCTGACAAGTAGTGTGGCAGAGTCTTAGATTATAGGTTGC ATTTAGCCAAAAGAAAGACTTCGAATGGAATTTTTTTCTATTGACACACTT TCTAACAACATACTTATTTTCTAAAAAGGTTTTTATAACTTAGTGTTGATA ATATCAAAATGCTAAGCAATTTTGCTTAAAAAGCGTAGAACACCAATATTT AATGAAGATTAATTAAATAGCACACATTGATTACTTGTTTAAAAATATTCG GAAAAGTTTTGACACATGCTAAAGTGCTGAAGTAGGATTTTGGCCTTCCAT AAAAATAATATATTGTGCATAAATGGATGCAGAATGAAGAAAGCAATGGGG

Reading noncoding DNA would transform variant interpretation

DNaseI hypersensitivity KLF4 20 kb Boyle et al. High-resolution mapping and characterization of open chromatin across the genome. Cell, 2008.

Is this sequence accessible? TAGTAAAAAAACAACAAAAGACTTTTTCTGACAGTATGATTTACATAACTA CATTTTCATTACTTTTATTTTTCTACATAACTAGTGTCTTTTCAGTTGCAA ATGTTTGACATTCTGACAAGTAGTGTGGCAGAGTCTTAGATTATAGGTTGC ATTTAGCCAAAAGAAAGACTTCGAATGGAATTTTTTTCTATTGACACACTT TCTAACAACATACTTATTTTCTAAAAAGGTTTTTATAACTTAGTGTTGATA ATATCAAAATGCTAAGCAATTTTGCTTAAAAAGCGTAGAACACCAATATTT AATGAAGATTAATTAAATAGCACACATTGATTACTTGTTTAAAAATATTCG GAAAAGTTTTGACACATGCTAAAGTGCTGAAGTAGGATTTTGGCCTTCCAT AAAAATAATATATTGTGCATAAATGGATGCAGAATGAAGAAAGCAATGGGG

Learning DNA sequence activity ACGTGATTACAACGT 1 ATTTAGCCAAAAGAA 1 AGACTTCGAATGGAA 0 TTTTTTTCTATTGAC 0 ACACTTTCTAACAAC 0 ATACTTATTTTCTAA 1 AAAGGTTTTTATAAC 0 TTAGTGTTGATAATA 0

Sequence representation ACGTGATTACAACGT

Sequence representation ACGTGATTACAACGT ACGT 1

Sequence representation ACGTGATTACAACGT ACGT 1 CGTG 1

Sequence representation With big data & big computers, Can we learn better representations? ACGTGATTACAACGT AACG 1 ACAA 1 ACGT 2 ATTA 1 CAAC 1 CGTG 1 GATT 1 GTGA 1 TACA 1 TGAT 1 TTAC 1 0 Learning Algorithm

Artificial neural networks

Convolutional neural network http://deeplearning.net/tutorial/lenet.html

Convolutional neural network Zeiler, Fergus. Visualizing and understanding convolutional networks. ECCV 2014.

Convolutional neural network Zeiler, Fergus. Visualizing and understanding convolutional networks. ECCV 2014.

Convolutional neural network Zeiler, Fergus. Visualizing and understanding convolutional networks. ECCV 2014.

Convolutional neural network Zeiler, Fergus. Visualizing and understanding convolutional networks. ECCV 2014.

DNA convolutional neural network A C G T

DNA convolutional neural network A C G T 2 1 0-1 -2

DNA convolutional neural network A C G T 2 1 0-1 -2

DNA convolutional neural network A C G T 2 1 0-1 -2

DNA convolutional neural network A C G T 2 1 0-1 -2

DNA convolutional neural network A C G T 2 1 0-1 -2

DNA convolutional neural network

Basset

Learning the accessibility code DNaseI hypersensitivity sites DNaseI-seq from 164 cell types from ENCODE and Epigenomics Roadmap. Cells 2 million sites broken into training, validation, and test sets.

Convolutional nets accurately predict accessibility 1.0 True positive rate 0.8 0.6 0.4 0.2 0.0 PanIslets AUC: 0.839 HUVEC AUC: 0.896 CLL AUC: 0.907 HRE AUC: 0.917 HPF AUC: 0.929 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Basset AUC 0.95 0.90 0.85 0.80 0.75 0.70 mean AUC 0.900 mean AUC 0.780 0.70 0.75 0.80 0.85 0.90 0.95 gkm-svm AUC Ghandi, Lee et al. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. PLoS Comp Bio, 2014.

Filters recapitulate known protein binding motifs

Filters recapitulate known protein binding motifs

Multiple filters capture motif variants

Filter influence reflects cell specificity

Annotate nucleotide influence 118.434 mb 118.435 mb 118.436 mb 118.437 mb JUN ChIP-seq DNaseI-seq JUND ChIP-seq 15 10 5 10 5 20 10 0 CAGCCTTTGTTAATGGGGACACAATCCTGGAAATTTTGCCTGTGTGTAAACCTCTAGGGGCTTTTTCTTTCATCGTTTTACATCAGCCAGACTCTGACTCACAGCTGGAGAATCAGCTTCCTTATTATGTAGCGAATTCCATGAACACAC 0.16 0.08 0.00-0.08-0.16 0.2 0.1 PhyloP 0.0 3 2 1 0 1 2 3

Case study: vitiligo - rs4409785

Case study: vitiligo - rs4409785 Jin et al. Genome-wide associate analyses identify 13 new susceptibility loci for generalized vitiligo. Nat Genetics, 2012.

Predictive of causal disease SNPs

CTCF ChIP-seq shows allele-specific rs4409785 binding

Can we easily add new datasets? SNP interpretation requires relevant cell types. 1. Seed a new model with pretrained parameters. 2. Chop off the final model layer. 3. Train one pass (to avoid overfitting).

Large-scale public data informs new dataset learning

Basset https://github.com/davek44/basset Published in Advance May 3, 2016, doi: 10.1101/gr.200535.115

Summary Deep convolutional neural networks predict DNaseI hypersensitivity far beyond previous algorithms. ChIP-seq peaks work great, too. Accurate models enable nucleotide-resolution genome annotation. Such models show great promise for interpreting noncoding variants.

Acknowledgments Jasper Snoek John Rinn Research was supported by NIEHS of the National Institutes of Health under K25 award ES022984.