CS284A Representations & Algorithms for Molecular Biology. Xiaohui S. Xie University of California, Irvine

Samankaltaiset tiedostot
State of the Union... Functional Genomics Research Stream. Molecular Biology. Genomics. Computational Biology

Genome 373: Genomic Informatics. Professors Elhanan Borenstein and Jay Shendure

Manolis Kellis Piotr Indyk

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Efficiency change over time

Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. David R. Kelley

MALE ADULT FIBROBLAST LINE (82-6hTERT)

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

tgg agg Supplementary Figure S1.

Functional Genomics & Proteomics

Master's Programme in Life Science Technologies (LifeTech) Prof. Juho Rousu Director of the Life Science Technologies programme 3.1.

Capacity Utilization

Bioinformatics in Laboratory of Computer and Information Science

Windows Phone. Module Descriptions. Opiframe Oy puh Espoo

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

NBE-E4510 Special Assignment in Biophysics and Biomedical Engineering AND NBE-E4500 Special Assignment in Human. NBE-E4225 Cognitive Neuroscience

Information on preparing Presentation

Viral DNA as a model for coil to globule transition

Information on Finnish Courses Autumn Semester 2017 Jenni Laine & Päivi Paukku Centre for Language and Communication Studies

Other approaches to restrict multipliers

Constructive Alignment in Specialisation Studies in Industrial Pharmacy in Finland

FETAL FIBROBLASTS, PASSAGE 10

arxiv: v1 [stat.ml] 27 Feb 2011

Enterprise Architecture TJTSE Yrityksen kokonaisarkkitehtuuri

Experimental Identification and Computational Characterization of a Novel. Extracellular Metalloproteinase Produced by Clostridium sordellii

Bioinformatics. Sequence Analysis: Part III. Pattern Searching and Gene Finding. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Co-Design Yhteissuunnittelu

Eukaryotic Comparative Genomics

Information on Finnish Language Courses Spring Semester 2017 Jenni Laine

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

Alternative DEA Models

Nuku hyvin, pieni susi -????????????,?????????????????. Kaksikielinen satukirja (suomi - venäjä) ( (Finnish Edition)

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

The CCR Model and Production Correspondence

Supporting information


Results on the new polydrug use questions in the Finnish TDI data

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

toukokuu 2011: Lukion kokeiden kehittämistyöryhmien suunnittelukokous

7.4 Variability management

7. Product-line architectures

Supplementary information: Biocatalysis on the surface of Escherichia coli: melanin pigmentation of the cell. exterior

make and make and make ThinkMath 2017

Plasmid Name: pmm290. Aliases: none known. Length: bp. Constructed by: Mike Moser/Cristina Swanson. Last updated: 17 August 2009

Keskeisiä näkökulmia RCE-verkoston rakentamisessa Central viewpoints to consider when constructing RCE

BIOENV Laitoskokous Departmental Meeting

3 9-VUOTIAIDEN LASTEN SUORIUTUMINEN BOSTONIN NIMENTÄTESTISTÄ

KONEISTUSKOKOONPANON TEKEMINEN NX10-YMPÄRISTÖSSÄ

Supporting Information for

Ohjelmien kehittämisstudiot varmistavat laadukkaat ja linjakkaat maisteriohjelmat Maire Syrjäkari ja Riikka Rissanen

16. Allocation Models

Innovative and responsible public procurement Urban Agenda kumppanuusryhmä. public-procurement

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

Infrastruktuurin asemoituminen kansalliseen ja kansainväliseen kenttään Outi Ala-Honkola Tiedeasiantuntija

Paikkatiedon semanttinen mallinnus, integrointi ja julkaiseminen Case Suomalainen ajallinen paikkaontologia SAPO

Bounds on non-surjective cellular automata

Voice Over LTE (VoLTE) By Miikka Poikselkä;Harri Holma;Jukka Hongisto

Security server v6 installation requirements

A new model of regional development work in habilitation of children - Good habilitation in functional networks

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

Miksi Suomi on Suomi (Finnish Edition)

Oma sininen meresi (Finnish Edition)

Collaborative & Co-Creative Design in the Semogen -projects

EUROOPAN PARLAMENTTI

1.3 Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Choose Finland-Helsinki Valitse Finland-Helsinki

Lausuntopyyntöluettelo HUOM. Komiteoiden ja seurantaryhmien kokoonpanot on esitetty SESKOn komitealuettelossa

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

VAASAN YLIOPISTO Humanististen tieteiden kandidaatin tutkinto / Filosofian maisterin tutkinto

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

Jussi Klemola 3D- KEITTIÖSUUNNITTELUOHJELMAN KÄYTTÖÖNOTTO

Statistical design. Tuomas Selander

Large Scale Sequence Analysis with Applications to Genomics

Uusi Ajatus Löytyy Luonnosta 3 (Finnish Edition)

TU-C2030 Operations Management Project. Introduction lecture November 2nd, 2016 Lotta Lundell, Rinna Toikka, Timo Seppälä

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

VIIKKI BIOCENTER University of Helsinki

Kurssin koodi ja nimi Ryhmä Päivä Aika Sali Viikot Henkilöt Course code and name Group Day Time Lecture Weeks Course staff

TIETOJENKÄSITTELYTIEDE

Curriculum. Gym card

Python Libraries 1 / 14

Additions, deletions and changes to courses for the academic year Mitä vanhoja kursseja uusi korvaa / kommentit

Security server v6 installation requirements

Increase of opioid use in Finland when is there enough key indicator data to state a trend?

Kysymys 5 Compared to the workload, the number of credits awarded was (1 credits equals 27 working hours): (4)

Page 1 of 9. Ryhmä/group: L = luento, lecture H = harjoitus, exercises A, ATK = atk-harjoitukset, computer exercises

Sisällysluettelo Table of contents

OP1. PreDP StudyPlan

Oct. 21, h2p://

anna minun kertoa let me tell you

Asiantuntijoiden osaamisen kehittäminen ja sen arviointi. Anne Sundelin Capgemini Finland Oy

Use of spatial data in the new production environment and in a data warehouse

AYYE 9/ HOUSING POLICY

Data Quality Master Data Management

PYÖRÄILY OSANA HELSINGIN SEUDUN KESTÄVÄÄ KAUPUNKILIIKENNETTÄ

Business Opening. Arvoisa Herra Presidentti Very formal, recipient has a special title that must be used in place of their name

TIEKE Verkottaja Service Tools for electronic data interchange utilizers. Heikki Laaksamo

AJATUKSIA KÄSITYÖTIETEEN ONTOLOGIASTA

Stormwater filtration unit

Transkriptio:

CS284A Representations & Algorithms for Molecular Biology Xiaohui S. Xie University of California, Irvine

Today s Goals Course information Challenges in computational biology Introduction to molecular biology

Course Information Lecture: TT 3:30-4:50pm in PSCB 220 Grading 30% Homework 20% Scribe note 50% Final project Exams no final exams Course Prerequisites: Programming skill (Perl/Python, Matlab/R) Statistics, Calculus, basic knowledge of Biology

Course Goals Introduction to computational biology Fundamental problems in computational biology Statistical, algorithmic and machine learning techniques Directions for future research in the field Final project: Propose an innovative project Design novel or implement previous algorithms to carry out the project Write-up goals, approach and findings in a conference format Present your project to your peers in a conference setting

References Recommended Textbooks: R. Durbin, S. Eddy, A. Krogh and G. Mitchison. Biological Sequence Analysis P. Baldi and S. Brunak. Bioinformatics: the Machine Learning Approach Course Website: http://www.ics.uci.edu/~xhx/courses/cs284a/

Why computational biology? Computational biology/bioinformatics is the application of computational tools and techniques to biology (mostly molecular biology). Lots of data Pattern finding, rule discovery Allowing analytic and predictive methodologies that support and enhance lab work Informatics infrastructure (data storage, retrieval) Data visualization Life itself is a computer!

Biology What s the problem? Algorithm Four Aspects How to solve the problem efficiently? Learning How to model biology systems and learn from observed data? Statistics How to differentiate true phenomena from artifacts?

Topics to be covered DNA/RNA/Protein sequence analysis Pattern finding (motif discovery, EM-algorithm) Sequence alignment (Smith-Waterman, BLAST) Models of sequences (HMM) Gene discovery (HMM) RNA folding (Stochastic context-free grammar SCFG) Algorithms for large-scale data analysis Clustering algorithms (Hierarchical clustering, K-means) Inference of networks (Regression, Bayesian networks) Systems biology (Model and simulation) Evolutionary models Phylogenetic trees Comparative Genomics Protein world (if time allows) Secondary & tertiary structure prediction

Introduction to Molecular Biology and Genomics

Different Life Forms Share a Common Genetic Framework

Deoxyribonucleic acid (DNA) can be thought of as the blueprint for an organism composed of small molecules called nucleotides four different nucleotides distinguished by the four bases: adenine (A), cytosine (C), guanine (G) and thymine (T) is a polymer: large molecule consisting of similar units (nucleotides in this case) DNA is digital information a single strand of DNA can be thought of as a string composed of the four letters: A, C, G, T AGCGGTTAAGGCTGATATGCGCTTTAA TCGCCAATTCCGACTATACGCGAAATT

The Double Helix DNA molecules usually consist of two strands arranged in the famous double helix

Genomes The term genome refers to the complete complement of DNA for a given species The human genome consists of 46 chromosomes Male: 22 pairs of autosomes + XY Female: 22 pairs of autosomes + XX Every cell (except sex cells and mature red blood cells) contains the complete genome of an organism

Human Genome (Male) 22 pairs of autosomes + sex chromosomes (XY)

Human Genome (Female) 22 pairs of autosomes + sex chromosomes (XX)

Human Chromosomes Karyogram

Chromosomes DNA is packaged into individual chromosomes (along with proteins) prokaryotes (single-celled organisms lacking nuclei) have a single circular chromosome eukaryotes (organisms with nuclei) have a species-specific number of linear chromosomes DNA + associated chromosomal proteins = chromatin

Proteins Proteins are molecules composed of one or more polypeptides A polypeptide is a polymer composed of amino acids Cells build their proteins from 20 different amino acids A polypeptide can be thought of as a string composed from a 20-character alphabet

Protein Functions structural support storage of amino acids transport of other substances coordination of an organism s activities response of cell to chemical stimuli movement protection against disease selective acceleration of chemical reactions

Amino Acids

Amino Acid Sequence of Hexokinase

Protein Structure Proteins are poly-peptides of 70-3000 amino-acids This structure is (mostly) determined by the sequence of amino-acids that make up the protein

Genes Genes are the basic units of heredity A gene is a sequence of bases that carries the information required for constructing a particular protein (polypeptide really) Such a gene is said to encode a protein The human genome comprises ~22,000 genes Those genes encode >100,000 polypeptides RNA genes: micrornas and other small RNAs

The Central Dogma

Genetic code: DNA -> mrna -> protein

RNA is like DNA except: RNA backbone is a little different usually single stranded the base uracil (U) is used in place of thymine (T) A strand of RNA can be thought of as a string composed of the four letters: A, C, G, U

The Genetic Code 64 combinations: 20 amino acids + stop codon

Translation Ribosomes are the machines that synthesize proteins from mrna The grouping of codons is called the reading frame Translation begins with the start codon Translation ends with the stop codon

Codons and Reading Frames

Comparison of genome size Organisms Genomes Haemophilus Methannococcus Saccharomyces Caenorhabditis Drosophila Mus Homo influenzae jannaschii cerevisiae elegans Melanogaster musculus sapiens (baker s yeast) (nematode worm) (fruit fly) (laboratory mouse) (man) Genome (MB) Number of genes 1.83 1.66 13 97 180 3200 3500 1709 1682 6241 18,424 13,500 ~30,000 ~30,000

AGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATA CATGCATGCTTCAATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTTTCAATGTAAGAGATTTCGATTATCCTTATGATTGTATGATAATGTTTTCTCCTTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTGAGATTTCGATTATCCTTATAGTTCATAC ATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTACTTAATAAATGATTGTATGATAATGTTTTCA ATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA GTTCATACATGCATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCTCCTTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGAT AATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATC CTTATAGTTCATACATGCATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACAT GCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAAT GTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATATGCTTCAACTACTTAATAAATGAT CAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGAATTTC GATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACACATGCTTCAA CTACTTAATAAATGCAGATGCTGTTGGACTTCATGTCCCCAACCTAGCTTGGTGCACAGCATTTATTGTATGAAGAGATTTCGATTATCCTTATAGTTCATACATGCAT AGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAAT GATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCATGCTTCAACTACTTAATAAATGATT GTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGTATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTA ATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCT TTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCC TCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATT GGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGA AGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCA TAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCT GGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGA TTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAG ACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCATTCAGGTTGGTACGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTT GTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCC CTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAG TCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTAT AGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATA TGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGAAAATTTGC GAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAG ATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTA TTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAG GGAATATGCAGGAGAACGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAA ATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA CAGCTCATTCTGGAAGAAATAGTGTTTCTTGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGT TTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATG GCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGA CGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGA TTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCT TGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCC TTATAGTTCATACATGCTTCAACTACTTAATAATGCACTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAAATAAAGCTTCAACTACTTAATAAAT GATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTGTATGATTTATA GTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGATTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA TTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTCATACATGCTTCAACTACTGTAAATAATTAATAAATGATTGTATGATAATGTTTTCAATGTAAG AGATTTCGATTATCCTTATAGTTCATACATGCATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATTAT AGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAAATAAAATGTAAGAGATTTCGATTATCCTTATAGTTCATACEEEEEEECATGCGTTG ACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTTGATTG TATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTGTATGATTTATATCG

Readout from the genome