T Statistical Natural Language Processing Answers 6 Collocations Version 1.0

Samankaltaiset tiedostot
T Luonnollisten kielten tilastollinen käsittely

T Luonnollisen kielen tilastollinen käsittely Vastaukset 3, ti , 8:30-10:00 Kollokaatiot, Versio 1.1

Statistical design. Tuomas Selander

Capacity Utilization

Gap-filling methods for CH 4 data

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Efficiency change over time

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

LYTH-CONS CONSISTENCY TRANSMITTER

Bounds on non-surjective cellular automata

Metsälamminkankaan tuulivoimapuiston osayleiskaava

Results on the new polydrug use questions in the Finnish TDI data

Tynnyrivaara, OX2 Tuulivoimahanke. ( Layout 9 x N131 x HH145. Rakennukset Asuinrakennus Lomarakennus 9 x N131 x HH145 Varjostus 1 h/a 8 h/a 20 h/a

Information on preparing Presentation

WindPRO version joulu 2012 Printed/Page :47 / 1. SHADOW - Main Result

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

16. Allocation Models

( ( OX2 Perkkiö. Rakennuskanta. Varjostus. 9 x N131 x HH145

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

TM ETRS-TM35FIN-ETRS89 WTG

Other approaches to restrict multipliers

anna minun kertoa let me tell you

TM ETRS-TM35FIN-ETRS89 WTG

Counting quantities 1-3

TM ETRS-TM35FIN-ETRS89 WTG

WindPRO version joulu 2012 Printed/Page :42 / 1. SHADOW - Main Result

TM ETRS-TM35FIN-ETRS89 WTG

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

The Viking Battle - Part Version: Finnish

The CCR Model and Production Correspondence

Information on Finnish Language Courses Spring Semester 2017 Jenni Laine

TM ETRS-TM35FIN-ETRS89 WTG

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

( ,5 1 1,5 2 km

TM ETRS-TM35FIN-ETRS89 WTG

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

Alternative DEA Models

,0 Yes ,0 120, ,8

Alueellinen yhteistoiminta

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

TM ETRS-TM35FIN-ETRS89 WTG

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

Capacity utilization

Information on Finnish Courses Autumn Semester 2017 Jenni Laine & Päivi Paukku Centre for Language and Communication Studies

HARJOITUS- PAKETTI A

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

MEETING PEOPLE COMMUNICATIVE QUESTIONS

AYYE 9/ HOUSING POLICY

1.3Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Rakennukset Varjostus "real case" h/a 0,5 1,5

Kysymys 5 Compared to the workload, the number of credits awarded was (1 credits equals 27 working hours): (4)

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

Choose Finland-Helsinki Valitse Finland-Helsinki

You can check above like this: Start->Control Panel->Programs->find if Microsoft Lync or Microsoft Lync Attendeed is listed

Nuku hyvin, pieni susi -????????????,?????????????????. Kaksikielinen satukirja (suomi - venäjä) ( (Finnish Edition)

812336A C++ -kielen perusteet,

Salasanan vaihto uuteen / How to change password

TM ETRS-TM35FIN-ETRS89 WTG

E80. Data Uncertainty, Data Fitting, Error Propagation. Jan. 23, 2014 Jon Roberts. Experimental Engineering

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

MRI-sovellukset. Ryhmän 6 LH:t (8.22 & 9.25)

Voice Over LTE (VoLTE) By Miikka Poikselkä;Harri Holma;Jukka Hongisto

1. Liikkuvat määreet

Data Quality Master Data Management

Sisällysluettelo Table of contents

( N117 x HH141 ( Honkajoki N117 x 9 x HH120 tv-alueet ( ( ( ( ( ( ( ( ( ( m. Honkajoki & Kankaanpää tuulivoimahankkeet

VAASAN YLIOPISTO Humanististen tieteiden kandidaatin tutkinto / Filosofian maisterin tutkinto

Counting quantities 1-3

Valuation of Asian Quanto- Basket Options

1.3 Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

RINNAKKAINEN OHJELMOINTI A,

A DEA Game II. Juha Saloheimo S ysteemianalyysin. Laboratorio. Teknillinen korkeakoulu

ELEMET- MOCASTRO. Effect of grain size on A 3 temperatures in C-Mn and low alloyed steels - Gleeble tests and predictions. Period

TIETEEN PÄIVÄT OULUSSA

make and make and make ThinkMath 2017

Co-Design Yhteissuunnittelu

ENE-C2001 Käytännön energiatekniikkaa. Aloitustapaaminen Osa II: Projekti- ja tiimityö

Green Growth Sessio - Millaisilla kansainvälistymismalleilla kasvumarkkinoille?

The role of 3dr sector in rural -community based- tourism - potentials, challenges

Security server v6 installation requirements

Exercise 1. (session: )

Tarua vai totta: sähkön vähittäismarkkina ei toimi? Satu Viljainen Professori, sähkömarkkinat

BDD (behavior-driven development) suunnittelumenetelmän käyttö open source projektissa, case: SpecFlow/.NET.

I. Principles of Pointer Year Analysis

1. Laitoksen tutkimusstrategia: mitä painotetaan (luettelo, ei yli viisi eri asiaa)

OFFICE 365 OPISKELIJOILLE

FIS IMATRAN KYLPYLÄHIIHDOT Team captains meeting

RANTALA SARI: Sairaanhoitajan eettisten ohjeiden tunnettavuus ja niiden käyttö hoitotyön tukena sisätautien vuodeosastolla

Introduction to Mathematical Economics, ORMS1030

Security server v6 installation requirements

Data protection template

Group 2 - Dentego PTH Korvake. Peer Testing Report

Ajettavat luokat: SM: S1 (25 aika-ajon nopeinta)

Characterization of clay using x-ray and neutron scattering at the University of Helsinki and ILL

4x4cup Rastikuvien tulkinta

Asiakaspalautteen merkitys laboratoriovirheiden paljastamisessa. Taustaa

Opiskelijat valtaan! TOPIC MASTER menetelmä lukion englannin opetuksessa. Tuija Kae, englannin kielen lehtori Sotungin lukio ja etälukio

Transkriptio:

T-61.5020 Statistical Natural Language Processing Answers 6 Collocations Version 1.0 1. Let s start by calculating the results for pair valkoinen, talo manually: Frequency: Bigrams valkoinen, talo occurred times. Normalized frequence: Word valkoinen occurred 3665 times and talo 10 767 times. We get 3665 10767 1.8 10 5. All the results for the frequency method are in Table 1 and for the normalized method in Table 2. We see that results are quite good even for as simple method as this. 2. For the collocation valkoinen, talo : 1 2 2 + 1 + 2 6 Mean( valkoinen, talo ) = + 2 + 1 + 6 0.975 V ar( valkoinen, talo ) = ( 1 ( 0.975))2 + ( 2 ( 0.975)) 2 2 + (1 ( 0.975)) 2 1 + (2 ( 0.975)) 2 6 + 2 + 1 + 6 0.083 Rest of the results, sorted by the variance, are in Table 3. This method has found in practice all the fixed collocations. However, results are not so good with sparse data: vihainen mielenosoittaja is definitely not a collocation. Size of the window surely affects the results. If it is too large, pairs start to occur together randomly too often, if too small, the collocations with longer effect are not found. If the second word of the collocation can be either before or after the first one, the method will clearly not work at all.

Table 1: Results for the frequency method s 1 s 2 C(s 1,s 2 ) ja olla 7329 venäjä presidentti 717 valkoinen talo kova tuuli 279 aste pakkanen 160 tuntematon sotilas 154 sekä myös 138 liukas keli 106 hakea työ 31 oppia lukea 21 ottaa onki 9 vihainen mielenosoittaja 7 olla ula 5 heittää veivi 3 herne nenä 3 Table 2: Results for the normalized frequency method s 1 s 2 Normalized frequency 10 8 liukas keli 1981 aste pakkanen 386 heittää veivi 293 herne nenä 268 valkoinen talo 180 tuntematon sotilas 163 vihainen mielenosoittaja 68 kova tuuli 35 ottaa onki 21 venäjä presidentti 10 oppia lukea 8 hakea työ 1 olla ula 0 sekä myös 0 ja olla 0 2

Table 3: Results sorted by the smallest variance s 1 s 2 Mean Variance herne nenä -1.000 0.000 vihainen mielenosoittaja -1.000 0.000 tuntematon sotilas -1.025 0.025 valkoinen talo -0.975 0.083 ottaa onki -1.250 0.188 venäjä presidentti -1.128 0.472 kova tuuli -0.880 0.492 liukas keli -0.788 0.608 oppia lukea -0.606 1.087 heittää veivi -0.500 1.250 aste pakkanen -0.465 1.347 hakea työ -0.433 2.046 olla ula -0.250 2.438 sekä myös 0.252 2.981 ja olla -0.083 3.635 3. In statistical test, one should start by forming a null hypothesis. In our case, we assume that the words in a pair are independent: P(s 1, s 2 ) = P(s 1 )P(s 2 ). The tests will give a level of confidence for the null hypothesis. The significance level, below which the null hypothesis is discarded, is usually at most 0.05. In t-test we assume that the probabilities are normally distributed, and check if the expectation value for the observed data differs from the expectation value given by the null hypothesis. The t-values are given by t = ˆx µ, s 2 N where ˆx is the sample mean, s 2 is the sample variance, N is the number of samples and µ is the mean of the distribution. In our case µ = P(s 1 )P(s 2 ) = C(s 1) C(s 2 ) N N ˆx = C(s 1, s 2 ) = ˆp N = p(1 p) = ˆp(1 ˆp) ˆp s 2 For the pair valkoinen talo we get t = 2665 10767 28181344 28181344 2 27. 28181344 2 3

If the t-value is over 6.314, the probability that the sample was from the distribution given by the independence assumption is less than 5%. Consequently, we can mark valkoinen talo as a collocation. Table 4 has values for all of the candidates. Note that the last pairs get negative values. This is because they occur together more rarely than the null hypothesis gives. Table 4: Results for the t-test s 1 s 2 t valkoinen talo 27 venäjä presidentti 26 kova tuuli 17 aste pakkanen 13 tuntematon sotilas 12 liukas keli 10 oppia lukea 4 hakea työ 4 ottaa onki 3 vihainen mielenosoittaja 3 heittää veivi 2 herne nenä 2 olla ula 0 sekä myös -9 ja olla -385 χ 2 -test is based on a simple assumption: We look at the separate probabilities and estimate how many times the words should occur together. This is compared to the observed co-occurrence value, and if they differ too much, the pair is likely to be a collocation. Let s start by collecting the following table (table 5): These values can be used in the two-variable χ 2 -test: χ 2 = By assigning the numbers: N(O 11 O 22 O 12 O 21 ) 2 (O 11 + O 12 )(O 11 + O 21 )(O 12 + O 22 )(O 21 + O 22 ) χ 2 = 28181344( 28167622 10057 2955) 2 ( + 10057)( + 2955)(10056 + 28167622)(2955 + 28167622) 358771 If the result for χ 2 -test is over 3.843, the sample is drawn from an independent distribution with less than 5% probability. Valkoinen talo seems to be a collocation. 4

Table 5: Quantities needed in the χ 2 -test. w 1 =valkoinen w 1 valkoinen w 2 =talo (valkoinen talo) 10767 = 10057 (punainen talo) w 2 talo 3665 = 2955 28181344 (valkoinen mopo) 10057 2955 = 28167622 (punainen mopo) However, if we look at Table 6, we notice that almost every pair would be a collocation according to this. The reason is that the χ 2 -test does not test for the pairs to be collocations, but whether they are independent. For example, the pair ja, olla has a high correlation, but actually it is a negative one: They occur together less than they should according to the independence assumption. 4. Mutual information tells how much more information X gives for determining Y. If X and Y are independent, the mutual information is zero. For valkoinen, talo, The rest of the results are in Table 7. P(X, Y ) I(x, y) = log 2 P(X)P(Y ) 28181344 = log 2 9.0 3665 28181344 10767 28181344 The results seem to be good. The course book criticises that this method favours the less frequent words. Reason for this is the way that is used to estimate the probabilities, i.e. maximum likelihood estimation. Better result can be obtained if we instead set a prior for the words to be independent, and let the data modify it. As a conclusion, we could say the following. The heuristic methods (ex. 1 and 2) are easy to apply and still give fair results. The statistical models applied in exercises 3 and 4 are justifiable as such, but they measure the correlation of the words, not whether they are collocations. However, if this is remembered, the results can be good. The statistical tests (ex. 3) may be harder to piece together, and the assumtions behind the methods must be kept in mind. In methods based directly on probability calculations (ex. 4), these assumptions are usually brought out more explicitly. In all methods based on probability estimates from data, one must choose how to approximate the probabilities. Maximum likelihood estimates can be too susceptible to sparse data. A better choice could be Maximum A Posterior (MAP) estimation with a prior belief that the words are independent. 5

Table 6: Results for the χ 2 -test s 1 s 2 χ 2 liukas keli 591591 valkoinen talo 358771 aste pakkanen 173726 tuntematon sotilas 70409 ja olla 29194 kova tuuli 26644 venäjä presidentti 18147 heittää veivi 4120 herne nenä 2258 vihainen mielenosoittaja 1321 ottaa onki 525 oppia lukea 449 hakea työ 47 sekä myös 45 olla ula 0 Table 7: Results for the mutual information method s 1 s 2 MI liukas keli 12.4 aste pakkanen 10.1 heittää veivi 9.7 herne nenä 9.6 valkoinen talo 9.0 tuntematon sotilas 8.8 vihainen mielenosoittaja 7.6 kova tuuli 6.6 ottaa onki 5.9 venäjä presidentti 4.8 oppia lukea 4.5 hakea työ 1.7 olla ula 0.5 sekä myös -0.8 ja olla -2.5 6