CSC:n käyttäjätunnukset - myös opiskelijoille http://www.csc.fi/asiakkaaksi/korkeakoulut/kayttol upahakemukset/index_html Ohjaajan nimeksi Petri Törönen Perusteluiksi opiskelu ja luentokurssin nimi (Geneettisen Bioinformatiikan luennot) Pyytäkää käyttöoikeus myös Chipsteriin (rastittava) jos olette tulossa Bioinfon työt - kurssille!
Sekvenssihaut ja vertailut: BLAST 2014
Sequence databases Many different types of databases: Nucleotide data banks EMBL, GenBank, DDBJ EST databases (dbest: http://www.ncbi.nlm.nih.gov/dbest/ ) genome builds with browsers ENSEMBL (www.ensembl.org) UCSC (http://www.genome.ucsc.edu/cgibin/hggateway) Protein databases (Uniprot, ) etc
Searching databases Different types of queries: Find DNA sequences that are homologous to your sequence (the same evolutionary origin) Find gene families across the species Align an mrna sequence to a genome assembly (what gene is my mrna coding?) Does my RNA sequence remind any protein? Design primers for PCR Search a database for a specific motif
Database search The most common application in bioinformatics Used for searching a sequence database for a match for the query sequence. Proteins are frequently composed of functional domains repeated in many different proteins These parts are most likely to be conserved -> look for shared patterns Two computer program families FastA (older, used seldom anymore, BUT may offer advantages which do not exist with BLAST) BLAST
Database search tools BLAST vs. other tools SPEED versus SENSITIVITY FAST TOOLS BLAT MEGABLAST Standard BLAST SENSITIVE TOOLS SSEARCH PSI-BLAST
What is BLAST? BLAST Basic Local Alignment Search Tool finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. A set of several computer programs (blastn, blastx ) Optimized for finding local alignments between two sequences Requires user to choose some parameters http://blast.ncbi.nlm.nih.gov/blast.cgi
BLAST Since DNA databases can be very large, searching for the optimal alignment to all sequences would take too long. Thus, BLAST is a heuristic algorithm. Heuristic algorithms find a match reasonably close to the optimal one in a much shorter time than the full dynamic programming. The alignment found can then be separately verified / refined using slower but more accurate dynamic programming Understanding BLAST function is relevant for understanding when it fails to find hits! http://en.wikipedia.org/wiki/blast
How BLAST works? All BLAST programs work more or less similarly, and computationally a BLAST search consists of three phases: Seeding Extension Evaluation
How BLAST works The query sequence is divided into subsequences of a given length. word size 3 for proteins, 11 for nucleotides. These are used to look for exact or nearly exact matches in the sequence database. Fast to do = computationally inexpensive. When a match is found, it is extended further.
Seeding Word size (W=3) KRISTIAN KRISTIAN KRISTIAN KRISTIAN KRISTIAN KRISTIAN Q u e r y Search space Database Word hits Remember dot plot! Alignment Gapped alignment
Threshold in seeding Word hit Hit is two matching, identical words, one in database, another in the query sequence (used in blastn) Hit is a neighborhood (used in protein-related searches) The neighborhood of a word contains the word itself and all other words whose score is at least as big as T (threshold) when compared via the scoring matrix. For example, if T=13, word=pqg, matrix=blosum62, only words getting a score over 13 will be scored as hits: PQG-PEG (15) is accepted, but PQG-PQA (12) is not. Setting T higher will remove more word hits, making BLAST run faster, but increases the chance of missing an interesting alignment. Setting W (wordsize) higher will decrease sensitivity (chance of finding the alignment), but increase speed of the search.
Extension Word hits found during seeding are extented from their ends. Extension is stopped when the alignment score drops, or in newer implementations, when the alignment score has dropped enough (drop-off score) compared to its previous maximum. Alignment Word hit Extension
Extension, example KRISTIAN gap=0, X=2 -RISTISANA BLOSUM62 0544541200 <- BLOSUM62 values 059 18 23 21 13 22 21 21 <- Score 00000002 <- Drop off score drop off score Extension terminates when drop off score falls below X.
Evaluation When the extension stage has produced the alignments, they will be evaluated to determine whether they are statistically significant. Statistical significance is determined using Karlin-Altschul statistics (the E-score) Some simplifying assumptions are made (such as sequences inifinitely long, no gaps), but in practice, K-A statistics is nicely generalizable.
E-score The lower the E-score, the more significant the alignment The E-score is dependent on both the database size and the scoring system (substitution matrix, gap penalties). If these are changed, the E-score for a specific alignment will also change.
Karlin-Altschul statistics E value. E = Kmne S E = number of alignments reaching score S just by chance K = minor constant m = the length of query sequence n = the size of the database (DB) e (neperin luku) 2,71 S = normalized alignment score (S is the score, lambda is a normalization factor) E-value estimates number of equally good (or better) hits from DB by random
Karlin-Altschul, example What is the chance that when two equally long (250) amino acid sequences are aligned using PAM250 matrix, the alignment score is 75? E = Kmne S = 0,1*250*250*2,71 -(0,229*75) = 0,000217 http://www.ncbi.nlm.nih.gov/blast/tutorial/altschul-1.html
Filtering out repeats The human genome (like most others) contains large amounts of repetitive DNA (LINE, SINE, Alu, etc.) If the query sequence contains repeats, many of the homologies identified will be to other sequences containing the same repeats. Repeats should in most instances be masked out Usually represented as AATAGNNNNCGC Same represented for aminoacids with X
Disadvantages of BLAST When expected sequence similarity drops below 80%, nucleotide-nucleotide blast no longer performs that well. Many significant homologies are missed due to the initial word size requirement. If initial words are allowed to be discontinuous, matching is improved.
Discontinuous initial words For instance, require 11 positions out of 21 consecutive nucleotides to be homologous. Description of BLAST Services http://www.ncbi.nlm.nih.gov/blast/html/blasthomehelp.html
Different varieties of BLAST BLASTN: DNA query against a database of DNA sequences (blastn). BLASTP: Protein query against protein sequences (blastp). BLASTX: DNA query translated in six reading frames against a protein database (blastx). TBLASTX: Search DNA query against the via the translation to proteins
Blastn and Megablast Typically used for identifying your sequence. Megablast is a fast alternative for finding nearly exact matches. Blastn is better at finding somewhat diverged sequences (e.g. from a related species). Blastn is more sensitive but slower than megablast
Blastx and tblastx Blastx translates the query sequence in all reading frames and compares it to a protein database. Aggregate statistics are provided for all reading frames. Tblastx queries a translated DNA sequence against a database of translated DNA sequences. Also produces aggregate statistics for all reading frames.
BLAST programs Query Database Program Typical uses DNA DNA blastn Annotation, mapping oligonucleotides to genome protein protein blastp Identifying common regions in proteins translated DNA protein blastx Finding protein-coding genes in genomic DNA protein translated DNA tblastn Identifying transcripts, possibly from multiple organisms translated DNA translated DNA tblastx Cross-species gene prediction, searching for genes not yet in megablast protein databases Large and closely related sequences
Extensions to BLAST Make specific primers with Primer-BLAST (Finding primers specific to your PCR template http://www.ncbi.nlm.nih.gov/tools/primerblast/index.cgi?link_loc=blasthome) PSI-BLAST: Protein sequence search method where: Best matches are aligned with query sequence Alignment creates a profile that emphasises conserved regions Search is repeated with created profile Whole thing is repeated CS-BLAST: BLAST version that uses information about the neighboring aminoacids to estimate substitutions
BLAST from command line Why command line usage? Runs with 100 10 000 query sequences Runs against specific database (say all sequences from human, chimp and gorilla) Applications: BLAST all human genes vs. all mouse genes Running BLAST between all sequences in an analyzed set.
...miten valita omaan tarkoitukseen sopivin blast-versio?! Apua ohjelman valintaan: http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstab
Pisteytysmatriisien valinta PAM-matriisit eivät välttämättä toimi kovin hyvin tietokantahauissa. Blosum-sarjaa kannattaa siis suosia mahdollisuuksien rajoissa. Blosum62 Blast-ohjelmille Aukkosakot: -8/-2 DNA:lle käytetään useimmiten matriisia, jossa osumat=5 ja hudit =-4. Perustuen J. Tuimalan originaaleihin
Päätäntäkaavio homologisten sekvenssien löytämiseksi Koodaako sekvenssi proteiinia tai voidaanko se kääntää proteiiniksi? Kyllä Tee haku prot.tietokannoista tai käännetystä DNA-sekvenssitietokannasta Onko saatu hakutulos mielekäs ja tilastollisesti merkitsevä? Kyllä Toista tietokantahaku käyttäen löydettyjä sekvenssejä hakusekvensseinä Ei Ei Tee haku DNAsekvenssitietokannoista Säädä ensin BLAST-parametreja Etsi seuraavaksi motiiveja ja blokkeja Tee saaduille sekvensseille usean sekvenssin rinnastus ja muodosta mahdollinen fylogeneettinen puu Kaavio J.Tuimalan Bioinformatiikan perusteet mukaan
Low-complexity filtering Low-complexity regions can be removed from the sequences before running the search. Low complexity regions consist of, e.g., several copies of the nucleotide or amino acid in sequence. Low-complexity regions are spotted using a sliding window method (complexity is calculated repeatedly inside every frame). In BLAST, window size is 12 nucleotides or amino acids. Human repeats, like SINE and LINE regions can also be masked. In the BLAST report, the masked (filtered) areas are marked with Ns (DNA) or Xs (amino acid).
Kokeillaan tttcggctca ctactaggag catgcctaat tacccagatc ctaacagggt tatttctagc catacattat acacctgaca caataactgc cattttcatc tatatcccat atctgccgag atgtcaacta cggttgaatt attcgacaac tacactcaaa cggagcatct attttcttcc
Käytännön vinkkejä hakuihin Tee samalla huolellisuudella kuin laboratoriokokeetkin! Aloita aina hakuparametrien (aukkosakot, pisteytysmatriisit) oletusasetuksilla, ja jos tulokset eivät tyydytä, muuta sopivampaan suuntaan sanakokoa, pisteytysmatriisia ja/tai mahd. E-arvorajaa Yleiskäyttöisiä pisteytysmatriiseja aminohapoille: BLOSUM62 (aukkosakoilla -8 ja -2) ja BLOSUM50 (-12/-2 tai -14/-2) Rajoita haku vain kiinnostavaan tietokantaan (ja/tai sen osastoon), tämä voi nopeuttaa hakuasi oleellisesti! Esim, jos et halua monia kertoja saman sekvenssin eri muotoja vastauksina, ja tiedät minkä organismin sekvenssivastaavuuksista on kyse, tee hakusi Genomic BLASTilla! (Suoraan geenipankista=nukleotiditietokannasta hakeva BLAST antaa vastaukseksi KAIKKI vastaavat sekvenssit, vaikka olisivat vain saman genomisen sekvenssin eri versioita)
Käytännön vinkkejä: Hakukoneet ovat eniten kuormitettuja keskellä työpäivää, paikallista aikaa klo 10-16 Mikäli sekvenssisi on proteiinia koodaava, käytä ah-sekvenssiä, ei DNAta vertailuihin. Eliöiden välillä on eroa mm. kodonien käytössä, mikä voi aiheuttaa ongelmia tietokantahauissa! Poista low-complexity (yksinkertaiset ja toistojakso-) alueet suodattamalla ( filtering, löytyy optiona BLASTissa) -> vähentää biologisesti ei-relevanttien samankaltaisuuksien löytymistä. Hyvin lyhyet sekvenssit, noin 20 bp: perus-blastin hakuparametrien oletusarvot eivät toimi näille hyvin! Siispä: Pienennä sanakokoa, kasvata E-arvoa PCR-alukkeiden genomispesifisyyttä tutkittaessa käytä uutta Primer- BLASTia! Lyhyet ah-sekvenssit: pisteytysmatriisiksi lähisukuisille sekvensseille sopivat, esim. PAM30, BLOSUM80, BLOSUM90 Kaukaisille sukulaisille: PAM250, BLOSUM62
Tulosten tulkintaan Osumat ESTeihin ja hypoteettisiin proteiineihin (varsinkin hyvin lyhyisiin) suhtaudu näihin varauksella! Huonot osumat on helppo tunnistaa linjauksessa olevan suuren aukkomäärän perusteella (tällöin nosta aukkosakkoparametriesi arvoja!)
Esimerkkisekvenssi 2: Blastaa! gactgtgagc aaagctttag taccaggcaa cattttcaaa tcaagtggac ttacagatgg tattgcttat gagttccggg tgattgcaga aaacatggca ggcaaaagta agccaagcaa gccatcagaa cctatgttgg ctctggatcc cattgaccca cctggaaaac cagtacctct aaatattaca agacacacag taacacttaa atgggctaag cctgaatata ctgggggctt