Information Retrieval

Samankaltaiset tiedostot
Searching (Sub-)Strings. Ulf Leser

Capacity Utilization

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Efficiency change over time

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

Tietorakenteet ja algoritmit

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

anna minun kertoa let me tell you

Bounds on non-surjective cellular automata

Information on preparing Presentation

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

The CCR Model and Production Correspondence

Alternative DEA Models

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

LYTH-CONS CONSISTENCY TRANSMITTER

Results on the new polydrug use questions in the Finnish TDI data

Nuku hyvin, pieni susi -????????????,?????????????????. Kaksikielinen satukirja (suomi - venäjä) ( (Finnish Edition)

1. Liikkuvat määreet

Gap-filling methods for CH 4 data

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

T Statistical Natural Language Processing Answers 6 Collocations Version 1.0

The Viking Battle - Part Version: Finnish

16. Allocation Models

Choose Finland-Helsinki Valitse Finland-Helsinki

1.3Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Oma sininen meresi (Finnish Edition)

Other approaches to restrict multipliers

make and make and make ThinkMath 2017

Skene. Games Refueled. Muokkaa perustyyl. for Health, Kuopio

Integration of Finnish web services in WebLicht Presentation in Freudenstadt by Jussi Piitulainen

Salasanan vaihto uuteen / How to change password

Small Number Counts to 100. Story transcript: English and Blackfoot

S SÄHKÖTEKNIIKKA JA ELEKTRONIIKKA

Exercise 1. (session: )

C++11 seminaari, kevät Johannes Koskinen

Statistical design. Tuomas Selander

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

Miksi Suomi on Suomi (Finnish Edition)

BDD (behavior-driven development) suunnittelumenetelmän käyttö open source projektissa, case: SpecFlow/.NET.

AYYE 9/ HOUSING POLICY

FinFamily Installation and importing data ( ) FinFamily Asennus / Installation

Käyttöliittymät II. Käyttöliittymät I Kertaus peruskurssilta. Keskeisin kälikurssilla opittu asia?

MEETING PEOPLE COMMUNICATIVE QUESTIONS

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

KONEISTUSKOKOONPANON TEKEMINEN NX10-YMPÄRISTÖSSÄ

Operatioanalyysi 2011, Harjoitus 4, viikko 40

Sisällysluettelo Table of contents

Uusi Ajatus Löytyy Luonnosta 3 (Finnish Edition)

Travel Getting Around

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

BLOCKCHAINS AND ODR: SMART CONTRACTS AS AN ALTERNATIVE TO ENFORCEMENT

FIS IMATRAN KYLPYLÄHIIHDOT Team captains meeting

Information on Finnish Language Courses Spring Semester 2017 Jenni Laine

Huom. tämä kulma on yhtä suuri kuin ohjauskulman muutos. lasketaan ajoneuvon keskipisteen ympyräkaaren jänteen pituus

Alueellinen yhteistoiminta

1.3 Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Information on Finnish Courses Autumn Semester 2017 Jenni Laine & Päivi Paukku Centre for Language and Communication Studies

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

Metsälamminkankaan tuulivoimapuiston osayleiskaava

7.4 Variability management

Valuation of Asian Quanto- Basket Options

812336A C++ -kielen perusteet,

Capacity utilization

( ( OX2 Perkkiö. Rakennuskanta. Varjostus. 9 x N131 x HH145

ALOITUSKESKUSTELU / FIRST CONVERSATION

Tynnyrivaara, OX2 Tuulivoimahanke. ( Layout 9 x N131 x HH145. Rakennukset Asuinrakennus Lomarakennus 9 x N131 x HH145 Varjostus 1 h/a 8 h/a 20 h/a

TM ETRS-TM35FIN-ETRS89 WTG

MRI-sovellukset. Ryhmän 6 LH:t (8.22 & 9.25)

Use of spatial data in the new production environment and in a data warehouse

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

WindPRO version joulu 2012 Printed/Page :47 / 1. SHADOW - Main Result

TM ETRS-TM35FIN-ETRS89 WTG

4x4cup Rastikuvien tulkinta

MUSEOT KULTTUURIPALVELUINA

TM ETRS-TM35FIN-ETRS89 WTG

WindPRO version joulu 2012 Printed/Page :42 / 1. SHADOW - Main Result

Tarua vai totta: sähkön vähittäismarkkina ei toimi? Satu Viljainen Professori, sähkömarkkinat

TM ETRS-TM35FIN-ETRS89 WTG

ProAgria. Opportunities For Success

Lab SBS3.FARM_Hyper-V - Navigating a SharePoint site

EUROOPAN PARLAMENTTI

Basic Flute Technique

Infrastruktuurin asemoituminen kansalliseen ja kansainväliseen kenttään Outi Ala-Honkola Tiedeasiantuntija

RINNAKKAINEN OHJELMOINTI A,

TM ETRS-TM35FIN-ETRS89 WTG

Data Quality Master Data Management

Data protection template

TIETEEN PÄIVÄT OULUSSA

The role of 3dr sector in rural -community based- tourism - potentials, challenges

Metal 3D. manufacturing. Kimmo K. Mäkelä Post doctoral researcher

Ajettavat luokat: SM: S1 (25 aika-ajon nopeinta)

Uusia kokeellisia töitä opiskelijoiden tutkimustaitojen kehittämiseen

Green Growth Sessio - Millaisilla kansainvälistymismalleilla kasvumarkkinoille?

Suunnittelumallit (design patterns)

Returns to Scale Chapters

SELL Student Games kansainvälinen opiskelijaurheilutapahtuma

,0 Yes ,0 120, ,8

( ,5 1 1,5 2 km

Transkriptio:

Information Retrieval Searching erms atrick Schäfer Ulf Leser

Content of this Lecture Searching strings Naïve exact string matching Boyer-Moore BM-Variants and comparisons Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 2

Searching Strings in ext All IR models require finding occurrences of terms in documents Fundamental operation: find(k,d) -> (DD) Indexing: reprocess docs and use index for searching Apply tokenization; can only find entire words Classical IR technique (inverted files) Online searching: Consider docs and query as new No preprocessing - slower Usually without tokenization more searchable substrings Classical algorithmic problem: Substring search Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 3

roperties Advantages of substring search Does not require (erroneous, ad-hoc) tokenization U.S., 35,00=.000, alpha-type1 AML-3 protein, Search across tokens / sentences / paragraphs, that, happen., Searching prefixes, infixes, suffixes, stems compar, ver (German), Searching substrings is harder than searching terms Number of unique terms doesn t increase much with corpus size (from a certain point on) English: ~ 1 Million terms, but 200 Million potential substrings of size 6 Need to index all possible substrings Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 4

ypes of Substring Searching Exact search: Find all exact occurrences of a pattern (substring) p in D RegExp matching: Find all matches of a regular exp. p in D Approximate search: Find all substrings in D that are similar to a pattern p honetically similar (Soundex) Only one typo away (keyboard errors) Strings that can be produced from p by at most n operations of type insert a letter, delete a letter, change a letter Multiple strings: Searching >1 strings at once in D Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 5

Definition: Strings A String S is a sequential, ordered list of characters from a finite alphabet Σ S is the number of characters in S ositions in S are counted from 1,..., S S[i] denotes the character at position i in S S 123456789 dadfabzzb S[i..j] denotes the substring of S starting at position i and ending at position j (including both) S[1..i] or S[..i] is the prefix of S until position i S[i.. S ] or S[i..] is the suffix of S starting from position i S[..i] (S[i..]) is called a proper prefix (suffix) of S iff i 0 (not empty) and i S (not entire String). Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 6

Exact Substring Matching Given: attern to search for, text to search in We require ( is longer than ) We assume << ( is much longer than ) ask: Find all occurrences of in Where is GAAC tcagcttactaattaaaaattctttctagtaagtgctaagatcaagaaaataaattaaaaataatggaacatggcacattttcctaaactcttcacagattgctaatga ttattaattaaagaataaatgttataattttttatggtaacggaatttcctaaaatattaattcaagcaccatggaatgcaaataagaaggactctgttaattggtact attcaactcaatgcaagtggaactaagttggtattaatactcttttttacatatatatgtagttattttaggaagcgaaggacaatttcatctgctaataaagggattac atatttatttttgtgaatataaaaaatagaaagtatgttatcagattaaacttttgagaaaggtaagtatgaagtaaagctgtatactccagcaataagttcaaataggc gaaaaactttttaataacaaagttaaataatcattttgggaattgaaatgtcaaagataattacttcacgataagtagttgaagatagtttaaatttttctttttgtatt acttcaatgaaggtaacgcaacaagattagagtatatatggccaataaggtttgctgtaggaaaattattctaaggagatacgcgagagggcttctcaaatttattcaga gatggatgtttttagatggtggtttaagaaaagcagtattaaatccagcaaaactagaccttaggtttattaaagcgaggcaataagttaattggaattgtaaaagatat ctaattcttcttcatttgttggaggaaaactagttaacttcttaccccatgcagggccatagggtcgaatacgatctgtcactaagcaaaggaaaatgtgagtgtagact ttaaaccatttttattaatgactttagagaatcatgcatttgatgttactttcttaacaatgtgaacatatttatgcgattaagatgagttatgaaaaaggcgaatatat tattcagttacatagagattatagctggtctattcttagttataggacttttgacaagatagcttagaaaataagattatagagcttaataaaagagaacttcttggaat tagctgcctttggtgcagctgtaatggctattggtatggctccagcttactggttaggttttaatagaaaaattccccatgattgctaattatatctatcctattgagaa caacgtgcgaagatgagtggcaaattggttcattattaactgctggtgctatagtagttatccttagaaagatatataaatctgataaagcaaaatcctggggaaaatat tgctaactggtgctggtagggtttggggattggattatttcctctacaagaaatttggtgtttactgatatccttataaataatagagaaaaaattaataaagatgatat Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 7

How to do it? he straight-forward way (naïve algorithm) We use two counters: t, p One (outer, t) runs through One (inner, p) runs through Compare characters at position [t+p-1] and [p] for t = 1 to - + 1 p := 1; while (p <= and [t+p 1] == [p]) p := p + 1; end while; if (p == ) then REOR t end for; 123456789 ctgagatcgcgta gagatc gagatc gagatc gagatc gagatc gatatc gatatc gatatc Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 8

Examples ypical case Worst case ctgagatcgcgta gagatc gagatc gagatc gagatc gagatc gatatc gatatc gatatc aaaaaaaaaaaaaa aaaaat aaaaat aaaaat aaaaat... How many comparisons do we need in the worst case? t runs through p runs through the entire for every value of t hus: * comparisons Indeed: he algorithm has worst-case complexity O( * ) Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 9

Other Algorithms Exact substring search has been researched for decades Boyer-Moore, Z-Box, Knuth-Morris-ratt, Karp-Rabin, Shift-AND, All have WC complexity O( + ) For many, WC=AC, but empirical performance differs much Real performance depends much on the size of alphabet and the composition of strings (algs have their strength in certain settings) Better performance possible if is preprocessed (up to O( )) In practice, our naïve algorithm is quite competitive for non-trivial alphabets and biased letter frequencies E.g., English text But we can do better: Boyer-Moore We present a simplified form BM is among the fastest algorithms in practice Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 10

Content of this Lecture Searching strings Naïve exact string matching Boyer-Moore BM-Variants and comparisons Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 11

Boyer-Moore Algorithm R.S. Boyer /J.S. Moore. A Fast String Searching Algorithm, Communications of the ACM, 1977 Main idea As for the naïve alg, we use two counters (inner loop, outer loop) Inner loop runs from right-to-left If we reach a mismatch, we know he character in we just haven t seen his is captured by the bad character rule he suffix in we just have seen his is captured by the good suffix rule Use this knowledge to make longer shifts in Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 12

Boyer-Moore Main Idea Inner loop runs from right-to-left If we reach a mismatch, and this bad character does not appear in at all, we can shift the pattern my positions: 123456789 dadfabzzbwzzbzzb aaba aaba Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 13

Bad Character Rule Setting 1 We are at position t in and compare right-to-left Let i by the position of the first mismatch in, n= We saw n-i matches before Let x be the character at the corresponding pos (t-n+i) in Candidates for matching xx in Case 1: xx does not appear in at all we can move t such that t-n+i is not covered by anymore 123456789 dadfabzzbwzzbzzb abwzabzz 123456789 dadfabzzbwzzbzzb abwzabzz abwzabzz t-n+i t What next? Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 14

Bad Character Rule 2 j a 5 b 6 Setting 2 We are at position t in and compare right-to-left Let i by the position of the first mismatch in, n= Let x be the character at the corresponding pos (t-n+i) in Candidates for matching xx in Case 1: xx does not appear in at all w 3 z 8 Case 2: Let j be the right-most appearance of xx in and let j<i we can move t such that j and t align 123456789 dadfabzzbwzzbzzb abwzabzz 123456789 dadfabzzbwzzbzzb abwzabzz abwzabzz j i What next? Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 15

Bad Character Rule 3 Setting 3 We are at position t in and compare right-to-left Let i by the position of the first mismatch in, n= Let x be the character at the corresponding pos (t-n+i) in Candidates for matching xx in Case 1: x does not appear in at all Case 2: Let j be the right-most appearance of xx in and let j<i Case 3: As case 2, but j>i we need some more knowledge 123456789 dadfabzzbwzzbzzb abwzabzz i j Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 16

reprocessing 1 In case 3, there are some xx right from position i For small alphabets (DNA), this will almost always be the case In human languages, this is often the case (e.g. for vowels) hus, case 3 is a usual one hese xx are irrelevant we need the right-most xx left of i his can (and should!) be pre-computed Build a two-dimensional array A[, ] Run through from left-to-right (pointer i) If character c appears at position i, set all A[c,j]:=i for all j>=i Requested time (complexity?) negligible Because << and complexity independent from Array: Constant lookup, needs some space (lists ) Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 17

(Extended) Bad Character Rule 12345678 abwzabzz A 1 2 3 4 5 6 7 8 a 1 1 1 1 5 5 5 5 b 0 2 2 2 2 6 6 6 w 0 0 3 3 3 3 3 3 z 0 0 0 4 4 4 7 8 123456789 dadfabzzbwzzbzzb abwzabzz 123456789 dadfabzzbwzzbzzb abwzabzz abwzabzz i A[ z,5] Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 18

(Extended) Bad Character Rule EBCR: Shift t by i-a[x,i] positions Simple and effective for larger alphabets For random strings over, average shift-length is /2 hus, n# of comparisons down to *2/ Worst-Case complexity does not change Why? Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 19

(Extended) Bad Character Rule EBCR: Shift t by i-a[x,i] positions Simple and effective for larger alphabets For random strings over, average shift-length is /2 hus, n# of comparisons down to *2/ Worst-Case complexity does not change Why? Shift-length can be always 1: =g m ggggggggggggggggggggggggggggggggggggggggg =ag n O( * ) aggggggggggg aggggggggggg aggggggggggg aggggggggggg Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 20

Good-Suffix Rule Recall: If we reach a mismatch, we know he character in we just haven t seen he suffix in we just have seen Good suffix rule We have just seen some matches (let these be S) in Where else does S appear in? If we know the right-most appearance S of S in, we can immediately align S with the current match in If S does not appear anymore in, we can shift t by x S x S S y S S y S Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 21

Good-Suffix Rule One Improvement Actually, we can do a little better Not all S are of interest to us Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 22

Good-Suffix Rule One Improvement Actually, we can do a little better Not all S are of interest to us x S x S S y S y S y S We only need S whose next character to the left is not y Why don t we directly require that this character is x? Of course, this could be used for further optimization Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 23

Good-Suffix Rule Special case: Let S be a suffix of S and S be a prefix of : x S x S S y S S y S We have to align S with S. Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 24

Good-Suffix-Rule reprocessing 2 Use two arrays: osition of the longest suffix f: f[i] stores the starting position of prefix [i..] in the suffix of. Maximum shift s: s[i] stores for position i the maximum shift to the left. 1 2 3 4 5 6 7 8 9 10 11 f c a b a a b g b a a 11 10 8 9 10 11 11 11 10 11 12 s 0 0 0 0 0 0 0 5 0 1 2 Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 25

Concluding Remarks reprocessing 2 For the GSR, we need to find all occurrences of all suffixes of in his can be solved using our naïve algorithm for each suffix Or, more complicated, in linear time (not this lecture) WC complexity of Boyer-Moore is still O( * ) But average case is sub-linear: O( / ); especially when >, which causes many shifts by. WC complexity can be reduced to linear (not this lecture), but this usually doesn t pay-off on real data Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 26

Boyer-Moore - Algorithm Compare characters at position [t+ -p-1] and [p] t runs from left-to-right through ; p runs from right-to-left through ; Mismatch: shift by maximum of GSR and EBCR. Match found: shift using GSR. for t := 1 to - +1 do p := ; while (p > 0 and [t+ -p-1] == [p]) do p := p-1; end while if (p==0) then // match REOR t; shift t to largest prefix of, which is also a suffix of else // no match shift t by GSR, EBCR; end for Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 27

Example b b c g g b c b a a g g b b a a c a b a a b g b a a c g c a b a a b c a b c a b a a b g b a a b b c g g b c b a a g g b b a a c a b a a b g b a a c g c a b a a b c a b EBCR wins c a b a a b g b a a b b c g g b c b a a g g b b a a c a b a a b g b a a c g c a b a a b c a b GSR wins c a b a a b g b a a b b c g g b c b a a g g b b a a c a b a a b g b a a c g c a b a a b c a b GSR wins c a b a a b g b a a b b c g g b c b a a g g b b a a c a b a a b g b a a c g c a b a a b c a b Match Mismatch Good suffix Ext. Bad character c a b a a b g b a a Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 28

Content of this Lecture Searching strings Naïve exact string matching Boyer-Moore BM-Variants and comparisons Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 29

wo Faster Variants BM-Horspool Drop the good suffix rule GSR makes algorithm slower in practice Rarely shifts longer than EBCR Needs time to compute the shift Instead of looking at the mismatch character x, always look at the symbol in aligned to the last position of Generates longer shifts on average (i is maximal) BM-Sunday Instead of looking at the mismatch character x, always look at the symbol in after the symbol aligned to the last position of Generates even longer shifts on average Alternative: Always look at the least frequent (in the language of ) symbol of first Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 30

BM Variants 123456789 123456789 123456789 abcabdaacba bcaab bcaab abcabdaacba bcaab bcaab abcabdaacba bcaab bcaab (a) Boyer-Moore (b) BM-Horspool (c) BM-Sunday 123456789 abcabdaacba bcaab bcaab (d) BM-Sunday + Least Frequent Char Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 31

Empirical Comparison n Machine word Shift-OR: Using parallelization in CU (only small alphabets) BNDM: Backward nondeterministic Dawg Matching (automata-based) BOM: Backward Oracle Matching (automata-based) Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 32

Self Assessment Explain the Boyer-Moore algorithm Which rule is better GSR or EBCR? How can we efficiently implement EBCR? How does the Sunday algorithm deviate from BM? How can we use character frequencies to speed up BM? If we do so - which part of the algorithm is sped up? Schäfer, Leser: Searching Strings, Winter Semester 2016/2017 33