CSC:n käyttäjätunnukset - myös opiskelijoille

Samankaltaiset tiedostot

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Capacity Utilization

Use of spatial data in the new production environment and in a data warehouse

7.4 Variability management

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Efficiency change over time

Information on preparing Presentation

Gap-filling methods for CH 4 data

anna minun kertoa let me tell you

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

Eukaryotic Comparative Genomics

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

Bounds on non-surjective cellular automata

Nuku hyvin, pieni susi -????????????,?????????????????. Kaksikielinen satukirja (suomi - venäjä) ( (Finnish Edition)

Other approaches to restrict multipliers

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

Choose Finland-Helsinki Valitse Finland-Helsinki

Statistical design. Tuomas Selander

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

C++11 seminaari, kevät Johannes Koskinen

Valuation of Asian Quanto- Basket Options

LYTH-CONS CONSISTENCY TRANSMITTER

Plasmid Name: pmm290. Aliases: none known. Length: bp. Constructed by: Mike Moser/Cristina Swanson. Last updated: 17 August 2009

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

1. Liikkuvat määreet

Salasanan vaihto uuteen / How to change password

Results on the new polydrug use questions in the Finnish TDI data

Basic Flute Technique

T Statistical Natural Language Processing Answers 6 Collocations Version 1.0

Chapter 7. Motif finding (week 11) Chapter 8. Sequence binning (week 11)

EUROOPAN PARLAMENTTI

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

16. Allocation Models

State of the Union... Functional Genomics Research Stream. Molecular Biology. Genomics. Computational Biology

MRI-sovellukset. Ryhmän 6 LH:t (8.22 & 9.25)

Tietorakenteet ja algoritmit

Genome 373: Genomic Informatics. Professors Elhanan Borenstein and Jay Shendure

tgg agg Supplementary Figure S1.

Uusi Ajatus Löytyy Luonnosta 3 (Finnish Edition)

Käyttöliittymät II. Käyttöliittymät I Kertaus peruskurssilta. Keskeisin kälikurssilla opittu asia?

MALE ADULT FIBROBLAST LINE (82-6hTERT)

Alternative DEA Models

make and make and make ThinkMath 2017

AYYE 9/ HOUSING POLICY

Guidebook for Multicultural TUT Users

FETAL FIBROBLASTS, PASSAGE 10

Oma sininen meresi (Finnish Edition)

812336A C++ -kielen perusteet,

1.3Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Rotarypiiri 1420 Piiriapurahoista myönnettävät stipendit

Paikkatiedon semanttinen mallinnus, integrointi ja julkaiseminen Case Suomalainen ajallinen paikkaontologia SAPO

1.3 Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Co-Design Yhteissuunnittelu

KONEISTUSKOKOONPANON TEKEMINEN NX10-YMPÄRISTÖSSÄ

Skene. Games Refueled. Muokkaa perustyyl. for Health, Kuopio

Infrastruktuurin asemoituminen kansalliseen ja kansainväliseen kenttään Outi Ala-Honkola Tiedeasiantuntija

Experimental Identification and Computational Characterization of a Novel. Extracellular Metalloproteinase Produced by Clostridium sordellii

Miksi Suomi on Suomi (Finnish Edition)

Methods S1. Sequences relevant to the constructed strains, Related to Figures 1-6.

The CCR Model and Production Correspondence

MEETING PEOPLE COMMUNICATIVE QUESTIONS

7. Product-line architectures

Supplementary information: Biocatalysis on the surface of Escherichia coli: melanin pigmentation of the cell. exterior

TIETEEN PÄIVÄT OULUSSA

Exercise 1. (session: )

Korkeakoulujen tietohallinto ja tutkimus: kumpi ohjaa kumpaa?

4x4cup Rastikuvien tulkinta

S Sähkön jakelu ja markkinat S Electricity Distribution and Markets

CS284A Representations & Algorithms for Molecular Biology. Xiaohui S. Xie University of California, Irvine

Searching (Sub-)Strings. Ulf Leser

KMTK lentoestetyöpaja - Osa 2

Functional Genomics & Proteomics

BLOCKCHAINS AND ODR: SMART CONTRACTS AS AN ALTERNATIVE TO ENFORCEMENT

Innovative and responsible public procurement Urban Agenda kumppanuusryhmä. public-procurement

FinFamily Installation and importing data ( ) FinFamily Asennus / Installation

Curriculum. Gym card

TIEKE Verkottaja Service Tools for electronic data interchange utilizers. Heikki Laaksamo

Integration of Finnish web services in WebLicht Presentation in Freudenstadt by Jussi Piitulainen

Constructive Alignment in Specialisation Studies in Industrial Pharmacy in Finland

The role of 3dr sector in rural -community based- tourism - potentials, challenges

toukokuu 2011: Lukion kokeiden kehittämistyöryhmien suunnittelukokous

Travel Getting Around

Tutkimusdata ja julkaiseminen Suomen Akatemian ja EU:n H2020 projekteissa

Suomen Talonpoikaiss Dyn Keskustelup Yt Kirjat, Issue 1... (Finnish Edition) Click here if your download doesn"t start automatically

Tilausvahvistus. Anttolan Urheilijat HENNA-RIIKKA HAIKONEN KUMMANNIEMENTIE 5 B RAHULA. Anttolan Urheilijat

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

Vertaispalaute. Vertaispalaute, /9

You can check above like this: Start->Control Panel->Programs->find if Microsoft Lync or Microsoft Lync Attendeed is listed

Bioinformatics. Sequence Analysis: Part III. Pattern Searching and Gene Finding. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Ajettavat luokat: SM: S1 (25 aika-ajon nopeinta)

Predicting evolutionarily conserved regions (ECRs) in the Xenopus tropicalis genome using a MultiPipMaker-based bioinformatic strategy

Operatioanalyysi 2011, Harjoitus 4, viikko 40

Kysymys 5 Compared to the workload, the number of credits awarded was (1 credits equals 27 working hours): (4)

Microsoft Lync 2010 Attendee

RINNAKKAINEN OHJELMOINTI A,

Immigration Studying. Studying - University. Stating that you want to enroll. Stating that you want to apply for a course.

ENE-C2001 Käytännön energiatekniikkaa. Aloitustapaaminen Osa II: Projekti- ja tiimityö

Counting quantities 1-3

Transkriptio:

CSC:n käyttäjätunnukset - myös opiskelijoille http://www.csc.fi/asiakkaaksi/korkeakoulut/kayttol upahakemukset/index_html Ohjaajan nimeksi Petri Törönen Perusteluiksi opiskelu ja luentokurssin nimi (Geneettisen Bioinformatiikan luennot) Pyytäkää käyttöoikeus myös Chipsteriin (rastittava) jos olette tulossa Bioinfon työt - kurssille!

Sekvenssihaut ja vertailut: BLAST 2014

Sequence databases Many different types of databases: Nucleotide data banks EMBL, GenBank, DDBJ EST databases (dbest: http://www.ncbi.nlm.nih.gov/dbest/ ) genome builds with browsers ENSEMBL (www.ensembl.org) UCSC (http://www.genome.ucsc.edu/cgibin/hggateway) Protein databases (Uniprot, ) etc

Searching databases Different types of queries: Find DNA sequences that are homologous to your sequence (the same evolutionary origin) Find gene families across the species Align an mrna sequence to a genome assembly (what gene is my mrna coding?) Does my RNA sequence remind any protein? Design primers for PCR Search a database for a specific motif

Database search The most common application in bioinformatics Used for searching a sequence database for a match for the query sequence. Proteins are frequently composed of functional domains repeated in many different proteins These parts are most likely to be conserved -> look for shared patterns Two computer program families FastA (older, used seldom anymore, BUT may offer advantages which do not exist with BLAST) BLAST

Database search tools BLAST vs. other tools SPEED versus SENSITIVITY FAST TOOLS BLAT MEGABLAST Standard BLAST SENSITIVE TOOLS SSEARCH PSI-BLAST

What is BLAST? BLAST Basic Local Alignment Search Tool finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. A set of several computer programs (blastn, blastx ) Optimized for finding local alignments between two sequences Requires user to choose some parameters http://blast.ncbi.nlm.nih.gov/blast.cgi

BLAST Since DNA databases can be very large, searching for the optimal alignment to all sequences would take too long. Thus, BLAST is a heuristic algorithm. Heuristic algorithms find a match reasonably close to the optimal one in a much shorter time than the full dynamic programming. The alignment found can then be separately verified / refined using slower but more accurate dynamic programming Understanding BLAST function is relevant for understanding when it fails to find hits! http://en.wikipedia.org/wiki/blast

How BLAST works? All BLAST programs work more or less similarly, and computationally a BLAST search consists of three phases: Seeding Extension Evaluation

How BLAST works The query sequence is divided into subsequences of a given length. word size 3 for proteins, 11 for nucleotides. These are used to look for exact or nearly exact matches in the sequence database. Fast to do = computationally inexpensive. When a match is found, it is extended further.

Seeding Word size (W=3) KRISTIAN KRISTIAN KRISTIAN KRISTIAN KRISTIAN KRISTIAN Q u e r y Search space Database Word hits Remember dot plot! Alignment Gapped alignment

Threshold in seeding Word hit Hit is two matching, identical words, one in database, another in the query sequence (used in blastn) Hit is a neighborhood (used in protein-related searches) The neighborhood of a word contains the word itself and all other words whose score is at least as big as T (threshold) when compared via the scoring matrix. For example, if T=13, word=pqg, matrix=blosum62, only words getting a score over 13 will be scored as hits: PQG-PEG (15) is accepted, but PQG-PQA (12) is not. Setting T higher will remove more word hits, making BLAST run faster, but increases the chance of missing an interesting alignment. Setting W (wordsize) higher will decrease sensitivity (chance of finding the alignment), but increase speed of the search.

Extension Word hits found during seeding are extented from their ends. Extension is stopped when the alignment score drops, or in newer implementations, when the alignment score has dropped enough (drop-off score) compared to its previous maximum. Alignment Word hit Extension

Extension, example KRISTIAN gap=0, X=2 -RISTISANA BLOSUM62 0544541200 <- BLOSUM62 values 059 18 23 21 13 22 21 21 <- Score 00000002 <- Drop off score drop off score Extension terminates when drop off score falls below X.

Evaluation When the extension stage has produced the alignments, they will be evaluated to determine whether they are statistically significant. Statistical significance is determined using Karlin-Altschul statistics (the E-score) Some simplifying assumptions are made (such as sequences inifinitely long, no gaps), but in practice, K-A statistics is nicely generalizable.

E-score The lower the E-score, the more significant the alignment The E-score is dependent on both the database size and the scoring system (substitution matrix, gap penalties). If these are changed, the E-score for a specific alignment will also change.

Karlin-Altschul statistics E value. E = Kmne S E = number of alignments reaching score S just by chance K = minor constant m = the length of query sequence n = the size of the database (DB) e (neperin luku) 2,71 S = normalized alignment score (S is the score, lambda is a normalization factor) E-value estimates number of equally good (or better) hits from DB by random

Karlin-Altschul, example What is the chance that when two equally long (250) amino acid sequences are aligned using PAM250 matrix, the alignment score is 75? E = Kmne S = 0,1*250*250*2,71 -(0,229*75) = 0,000217 http://www.ncbi.nlm.nih.gov/blast/tutorial/altschul-1.html

Filtering out repeats The human genome (like most others) contains large amounts of repetitive DNA (LINE, SINE, Alu, etc.) If the query sequence contains repeats, many of the homologies identified will be to other sequences containing the same repeats. Repeats should in most instances be masked out Usually represented as AATAGNNNNCGC Same represented for aminoacids with X

Disadvantages of BLAST When expected sequence similarity drops below 80%, nucleotide-nucleotide blast no longer performs that well. Many significant homologies are missed due to the initial word size requirement. If initial words are allowed to be discontinuous, matching is improved.

Discontinuous initial words For instance, require 11 positions out of 21 consecutive nucleotides to be homologous. Description of BLAST Services http://www.ncbi.nlm.nih.gov/blast/html/blasthomehelp.html

Different varieties of BLAST BLASTN: DNA query against a database of DNA sequences (blastn). BLASTP: Protein query against protein sequences (blastp). BLASTX: DNA query translated in six reading frames against a protein database (blastx). TBLASTX: Search DNA query against the via the translation to proteins

Blastn and Megablast Typically used for identifying your sequence. Megablast is a fast alternative for finding nearly exact matches. Blastn is better at finding somewhat diverged sequences (e.g. from a related species). Blastn is more sensitive but slower than megablast

Blastx and tblastx Blastx translates the query sequence in all reading frames and compares it to a protein database. Aggregate statistics are provided for all reading frames. Tblastx queries a translated DNA sequence against a database of translated DNA sequences. Also produces aggregate statistics for all reading frames.

BLAST programs Query Database Program Typical uses DNA DNA blastn Annotation, mapping oligonucleotides to genome protein protein blastp Identifying common regions in proteins translated DNA protein blastx Finding protein-coding genes in genomic DNA protein translated DNA tblastn Identifying transcripts, possibly from multiple organisms translated DNA translated DNA tblastx Cross-species gene prediction, searching for genes not yet in megablast protein databases Large and closely related sequences

Extensions to BLAST Make specific primers with Primer-BLAST (Finding primers specific to your PCR template http://www.ncbi.nlm.nih.gov/tools/primerblast/index.cgi?link_loc=blasthome) PSI-BLAST: Protein sequence search method where: Best matches are aligned with query sequence Alignment creates a profile that emphasises conserved regions Search is repeated with created profile Whole thing is repeated CS-BLAST: BLAST version that uses information about the neighboring aminoacids to estimate substitutions

BLAST from command line Why command line usage? Runs with 100 10 000 query sequences Runs against specific database (say all sequences from human, chimp and gorilla) Applications: BLAST all human genes vs. all mouse genes Running BLAST between all sequences in an analyzed set.

...miten valita omaan tarkoitukseen sopivin blast-versio?! Apua ohjelman valintaan: http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstab

Pisteytysmatriisien valinta PAM-matriisit eivät välttämättä toimi kovin hyvin tietokantahauissa. Blosum-sarjaa kannattaa siis suosia mahdollisuuksien rajoissa. Blosum62 Blast-ohjelmille Aukkosakot: -8/-2 DNA:lle käytetään useimmiten matriisia, jossa osumat=5 ja hudit =-4. Perustuen J. Tuimalan originaaleihin

Päätäntäkaavio homologisten sekvenssien löytämiseksi Koodaako sekvenssi proteiinia tai voidaanko se kääntää proteiiniksi? Kyllä Tee haku prot.tietokannoista tai käännetystä DNA-sekvenssitietokannasta Onko saatu hakutulos mielekäs ja tilastollisesti merkitsevä? Kyllä Toista tietokantahaku käyttäen löydettyjä sekvenssejä hakusekvensseinä Ei Ei Tee haku DNAsekvenssitietokannoista Säädä ensin BLAST-parametreja Etsi seuraavaksi motiiveja ja blokkeja Tee saaduille sekvensseille usean sekvenssin rinnastus ja muodosta mahdollinen fylogeneettinen puu Kaavio J.Tuimalan Bioinformatiikan perusteet mukaan

Low-complexity filtering Low-complexity regions can be removed from the sequences before running the search. Low complexity regions consist of, e.g., several copies of the nucleotide or amino acid in sequence. Low-complexity regions are spotted using a sliding window method (complexity is calculated repeatedly inside every frame). In BLAST, window size is 12 nucleotides or amino acids. Human repeats, like SINE and LINE regions can also be masked. In the BLAST report, the masked (filtered) areas are marked with Ns (DNA) or Xs (amino acid).

Kokeillaan tttcggctca ctactaggag catgcctaat tacccagatc ctaacagggt tatttctagc catacattat acacctgaca caataactgc cattttcatc tatatcccat atctgccgag atgtcaacta cggttgaatt attcgacaac tacactcaaa cggagcatct attttcttcc

Käytännön vinkkejä hakuihin Tee samalla huolellisuudella kuin laboratoriokokeetkin! Aloita aina hakuparametrien (aukkosakot, pisteytysmatriisit) oletusasetuksilla, ja jos tulokset eivät tyydytä, muuta sopivampaan suuntaan sanakokoa, pisteytysmatriisia ja/tai mahd. E-arvorajaa Yleiskäyttöisiä pisteytysmatriiseja aminohapoille: BLOSUM62 (aukkosakoilla -8 ja -2) ja BLOSUM50 (-12/-2 tai -14/-2) Rajoita haku vain kiinnostavaan tietokantaan (ja/tai sen osastoon), tämä voi nopeuttaa hakuasi oleellisesti! Esim, jos et halua monia kertoja saman sekvenssin eri muotoja vastauksina, ja tiedät minkä organismin sekvenssivastaavuuksista on kyse, tee hakusi Genomic BLASTilla! (Suoraan geenipankista=nukleotiditietokannasta hakeva BLAST antaa vastaukseksi KAIKKI vastaavat sekvenssit, vaikka olisivat vain saman genomisen sekvenssin eri versioita)

Käytännön vinkkejä: Hakukoneet ovat eniten kuormitettuja keskellä työpäivää, paikallista aikaa klo 10-16 Mikäli sekvenssisi on proteiinia koodaava, käytä ah-sekvenssiä, ei DNAta vertailuihin. Eliöiden välillä on eroa mm. kodonien käytössä, mikä voi aiheuttaa ongelmia tietokantahauissa! Poista low-complexity (yksinkertaiset ja toistojakso-) alueet suodattamalla ( filtering, löytyy optiona BLASTissa) -> vähentää biologisesti ei-relevanttien samankaltaisuuksien löytymistä. Hyvin lyhyet sekvenssit, noin 20 bp: perus-blastin hakuparametrien oletusarvot eivät toimi näille hyvin! Siispä: Pienennä sanakokoa, kasvata E-arvoa PCR-alukkeiden genomispesifisyyttä tutkittaessa käytä uutta Primer- BLASTia! Lyhyet ah-sekvenssit: pisteytysmatriisiksi lähisukuisille sekvensseille sopivat, esim. PAM30, BLOSUM80, BLOSUM90 Kaukaisille sukulaisille: PAM250, BLOSUM62

Tulosten tulkintaan Osumat ESTeihin ja hypoteettisiin proteiineihin (varsinkin hyvin lyhyisiin) suhtaudu näihin varauksella! Huonot osumat on helppo tunnistaa linjauksessa olevan suuren aukkomäärän perusteella (tällöin nosta aukkosakkoparametriesi arvoja!)

Esimerkkisekvenssi 2: Blastaa! gactgtgagc aaagctttag taccaggcaa cattttcaaa tcaagtggac ttacagatgg tattgcttat gagttccggg tgattgcaga aaacatggca ggcaaaagta agccaagcaa gccatcagaa cctatgttgg ctctggatcc cattgaccca cctggaaaac cagtacctct aaatattaca agacacacag taacacttaa atgggctaag cctgaatata ctgggggctt