Tiedon louhinnan tulosten arviointi. OUGF syysseminaari 2004 Hannu Toivonen

Tiedon louhinnan tulosten arviointi OUGF syysseminaari 2004 Hannu Toivonen hannu.toivonen@cs.helsinki.fi 1

Kerry voittaa vaalit! Tiedon louhinta on tuottanut seuraavan tuloksen ( ): Jos Washington Redskins häviää viimeisen kotiottelunsa ennen USAn presidentinvaaleja, niin istuva presidentti häviää vaalit. Jos WR voittaa, istuva presidentti voittaa vaalit. Havainto pitää paikkansa vuodesta 1936, koko seuran olemassaoloajan, joka kattaa 17 vaalit Redskins hävisi 31.10.2004 kotonaan Green Bay Packerseille 14-28, vaalit pidettiin 2.11.2004 2

Ennustusten luotettavuus Keskeinen tiedon louhinnan ongelma: miten saadaan luotettavia ennustuksia? Tämän esityksen aihe: miten arvioidaan ennustusten luotettavuutta? Sisältö: Tarkkuusmitat Luokka- ja kustannusjakaumien huomioon ottaminen ROC-käyrä Testi- ja validointiaineistot Tilastollinen merkitsevyys Esimerkkisovellus: kemiallisen yhdisteen karsinogeenisyyden ennustaminen 3

Identification of carcinogenic chemicals UK population: 33% get cancer, 25% of them die of cancer Large part due to environmental chemicals 100,000 chemicals not tested US$ 250,000,000 spent on screening 70,000 animals tested in bioassays Need for predictive toxicology models identify hazardous-chemical exposures more rapidly at lower cost than current procedures Slow Expensive Fast Fast Cheap 4

Predictive Toxicology Challenge (PTC) A scientific challenge/competition to make predictors Training set: ~500 tested and known compounds Test set: 185 new compounds 4 data sets: male/female rats/mice Skewed class distribution: 29-52 positive (carcinogenic) compounds 33-156 negative (non-carcinogenic) compounds Slow Expensive Fast Fast Cheap 5

How to evaluate predictors? Blind test: 185 unknown cases Predictions for the unknown cases submitted to a workshop in ECML/PKDD Only the organizers knew the true classes by the time of the workshop Around 30 submissions received for each data set What can be said about the (statistical) quality of the submissions? 6

Performance characteristics E.g., classifier Leu2 (id number 13) for male mice: confusion matrix: true class 0 1 predicted class 0 97 13 110 1 56 16 72 0 = not carcinogenic 153 29 182 1 = carcinogenic Various accuracy measures: "Prediction accuracy": (97+16)/182 = 0.62 "Power", "recall", "true positive rate", "sensitivity": 16/29 = 0.55 "Precision": 16/72 = 0.22 "False positive rate": 56/153 = 0.37 7

Problems with predictive accuracy Examples: always predict "negative" prediction accuracy is high 153/182 = 0.84 (vs. 0.62 for Leu2) a random classifier that predicts "negative" with probability 0.84 has, on average, an accuracy of 0.73 Predictive accuracy has problems with skewed class distributions most examples are negative any classifier that predicts mostly "negative" seems to perform well different misclassification costs what if a false negative (misclassification of a carcinogenic compound as non-carcinogenic) is expensive, but a false positive cheap? 8

9 Analysis and visualization of classifier performance ROC (receiver operating characteristics) True positive rate ("benefit") as a function of false positive rate ("cost") ROC convex hull gives the best available classifier for any given conditions, also skewed class distributions unequal error costs Also for one method with different parameters ROC

Properties of ROC space Two points (FP 1, TP 1 ) and (FP 2, TP 2 ) have the same performance if TP FP 2 2 TP1 FP 1 p(neg) p(pos) Cost(fp) Cost(fn) where p(neg) and p(pos) are prior probabilities of negative and positive examples, and Cost(fp) and Cost(fn) are the costs of false positives and negatives -> Optimal classifiers are on the convex hull 10

Ennustusten tarkkuus Näennäisesti Washington Redskins on kuitenkin täydellinen ennustaja: tarkkuus on ollut 100% (samoin muut vastaavat mitat) Yksi ongelma: WR:ää ei ole testattu riippumattomalla testijoukolla 11

Testijoukon käyttö Ennustajan tarkkuutta voidaan arvioida luotettavammin, jos käytettävissä on erillinen testiaineisto (test set) (kuten PTC:ssä oli, mutta WR:n tapauksessa ei) 1. Ennustaja louhitaan opetusaineistosta (training set) 2. Ennustajaa sovelletaan testiaineistoon: kullekin testitapaukselle tehdään ennuste 3. Tehtyjä ennusteita verrataan oikeisiin tuloksiin Opetusaineisto 2. 1. Louhinta 1. Ennustaja Testiaineisto 3. 12

Ansa Entä jos etsitään hyvä ennustaja seuraavasti: 1. Generoi opetusaineiston avulla 1000 ennustajaa 2. Testaa niistä jokainen testiaineistossa 3. Tulosta testijoukossa parhaiten pärjännyt ennustaja Ongelma: ennustaja on valittu testijoukon avulla, joten tarkkuus testijoukossa on ylioptimistinen arvio todellisesta tarkkuudesta Ratkaisu: kolme erillistä aineistoa opetusaineisto (ennustajien oppimista varten) validointiaineisto (ennustajan valintaa varten) testiaineisto (ennustajan testaamista varten) 13

Statistical significance ROC graph and prediction accuracy (in a hold-out test set) describe how well a classifier performs, but not if it really has learned anything Could the submitted classifiers be trusted in real life toxicology testing? Can we assume they perform as well with other, yet unseen cases? Statistical significance tells if the same result could be obtained just by chance, e.g., by flipping a weighted coin Example: a random classifier that predicts "positive" 72 out of 182 cases (like Leu2) has, on average, a true positive rate of 0.40 (vs. 0.55 for Leu2) -> despite its low predictive accuracy, Leu2 seems to make informed predictions 14

Statistical significance in ROC space Points on the ROC curve of a classifier correspond to different numbers of predicted positives by the classifier (vary a threshold on classifier's continuous output to obtain a curve) -> Natural references to a point in the ROC space are other classifiers with the same number of predicted positives P value of a point (FP, TP) in ROC space, obtained with n predicted positives: p = P(TP*(n) TP), where TP*(n) is the true positive ratio obtained by randomly selecting n examples (from the same data set) 15

P values for male mice, 72 predicted positives Computation of p values by randomization: Generate 10000 times a random prediction of n positives Count the number c of times the result is at least as good as the original result P value = c/10000 Submission 13, Leu2: p = 518/10000 = 0.0518 (can be done analytically in this case, too) 17

Visual overview of p values of points in the ROC space Connecting similar p values for different n gives p value isolines All points on the same line have the same p value P value isolines P value isolines - like p values - depend on the size and class distribution of the data set 18

P value isolines and PTC submissions Overall observations about the PTC submissions 2 significant results (8, 26) 1 close to significant (13) the rest are close to or above p = 0.5! 19

Best and worst p values P value Classifier #Predicted cases FP rate TP rate 0.0025 8:Baun 156 29 0.109 0.345 0.0055 26:Vini 97 14 0.031 0.286 0.0518 13:Leu2 153 29 0.366 0.552 0.2446 7:Baus 156 29 0.327 0.414... 0.8612 4:Anu1 142 25 0.324 0.240 0.8637 19:Ple1 156 29 0.205 0.138 0.9192 10:Gons 156 29 0.083 0.034 0.9309 20:Ple2 156 29 0.090 0.034 What is the true significance of the best findings? 22 submissions -> some are likely to be good just by chance (cf. use of separate validation and test sets) Bonferroni adjustment for independent tests would give: adjusted p value = 22 0.0025 = 0.055 20

Overall analysis (all 4 data sets, 96 submissions) Overall, based on the p values, had the submitted classifiers learned something? Does the distribution of p values differ from the uniform distribution? Visual analysis: sort the p values and plot them in increasing order compare to the diagonal compare to lines produced by points from the uniform distribution 21

PTC-tulosten kritiikki PTC-ennustustulokset olivat huonoja erillisellä testiaineistolla mitattuna Tehtävä on erittäin vaikea, ellei mahdoton...mutta kuinkahan moni kilpailuun osallistuja arvasi lähettävänsä huonot ennustukset? arvaus: ei juuri kukaan Miten osallistujat olivat arvioineet ennustustarkkuutta? arvaus: jakamalla aineiston vain kahteen osaan, ei kolmeen ylioptimistinen kuva tarkkuudesta arvaus: ei myöskään tilastollista testausta 22

Yhteenveto Tiedon louhinnan tulosten luotettavuus esimerkkinä erityisesti luokittelu Sopivan tarkkuusmitan valinta luokkajakauman vinous? virhe-ennusteiden erilaiset kustannukset? ROC tulosten visualisointiin yli eri luokka- ja kustannusjakaumien; optimaalisen luokittelijan valinta annetuille jakaumille Erillisen testiaineiston käyttö (vrt. Redskins) Erillisen validointi- ja testiaineiston käyttö (vrt. PTC) Tilastollisen merkitsevyyden arviointi soveltuu lähinnä pienehköille aineistoille analyyttisesti tai permutointitestillä 23