Data Mining - The Diary

Samankaltaiset tiedostot
Efficiency change over time

Capacity Utilization

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

Bounds on non-surjective cellular automata

Information on preparing Presentation

Information on Finnish Language Courses Spring Semester 2018 Päivi Paukku & Jenni Laine Centre for Language and Communication Studies

anna minun kertoa let me tell you

16. Allocation Models

Choose Finland-Helsinki Valitse Finland-Helsinki

LYTH-CONS CONSISTENCY TRANSMITTER

Information on Finnish Language Courses Spring Semester 2017 Jenni Laine

Curriculum. Gym card

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

The CCR Model and Production Correspondence

Uusi Ajatus Löytyy Luonnosta 4 (käsikirja) (Finnish Edition)

Other approaches to restrict multipliers

On instrument costs in decentralized macroeconomic decision making (Helsingin Kauppakorkeakoulun julkaisuja ; D-31)

1. SIT. The handler and dog stop with the dog sitting at heel. When the dog is sitting, the handler cues the dog to heel forward.

Alternative DEA Models

MEETING PEOPLE COMMUNICATIVE QUESTIONS

1. Liikkuvat määreet

1.3Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Kysymys 5 Compared to the workload, the number of credits awarded was (1 credits equals 27 working hours): (4)

The Viking Battle - Part Version: Finnish

Information on Finnish Courses Autumn Semester 2017 Jenni Laine & Päivi Paukku Centre for Language and Communication Studies

Results on the new polydrug use questions in the Finnish TDI data

AYYE 9/ HOUSING POLICY

ECVETin soveltuvuus suomalaisiin tutkinnon perusteisiin. Case:Yrittäjyyskurssi matkailualan opiskelijoille englantilaisen opettajan toteuttamana

Voice Over LTE (VoLTE) By Miikka Poikselkä;Harri Holma;Jukka Hongisto

Network to Get Work. Tehtäviä opiskelijoille Assignments for students.

FinFamily PostgreSQL installation ( ) FinFamily PostgreSQL

Data quality points. ICAR, Berlin,

1.3 Lohkorakenne muodostetaan käyttämällä a) puolipistettä b) aaltosulkeita c) BEGIN ja END lausekkeita d) sisennystä

Kvanttilaskenta - 2. tehtävät

Sisällysluettelo Table of contents

HARJOITUS- PAKETTI A

Returns to Scale II. S ysteemianalyysin. Laboratorio. Esitelmä 8 Timo Salminen. Teknillinen korkeakoulu

toukokuu 2011: Lukion kokeiden kehittämistyöryhmien suunnittelukokous

Constructive Alignment in Specialisation Studies in Industrial Pharmacy in Finland

Uusia kokeellisia töitä opiskelijoiden tutkimustaitojen kehittämiseen

VUOSI 2015 / YEAR 2015

Gap-filling methods for CH 4 data

Small Number Counts to 100. Story transcript: English and Blackfoot

FIS IMATRAN KYLPYLÄHIIHDOT Team captains meeting

National Building Code of Finland, Part D1, Building Water Supply and Sewerage Systems, Regulations and guidelines 2007

Guidebook for Multicultural TUT Users

make and make and make ThinkMath 2017

Mat Seminar on Optimization. Data Envelopment Analysis. Economies of Scope S ysteemianalyysin. Laboratorio. Teknillinen korkeakoulu

OP1. PreDP StudyPlan

E U R O O P P A L A I N E N

Käyttöliittymät II. Käyttöliittymät I Kertaus peruskurssilta. Keskeisin kälikurssilla opittu asia?

Kvanttilaskenta - 1. tehtävät

HOITAJAN ROOLI TEKNOLOGIAVÄLITTEISESSÄ POTILASOHJAUKSESSA VÄITÖSKIRJATUTKIJA JENNI HUHTASALO

Salasanan vaihto uuteen / How to change password

Matkustaminen Majoittuminen

Siirtymä maisteriohjelmiin tekniikan korkeakoulujen välillä Transfer to MSc programmes between engineering schools

Tietorakenteet ja algoritmit

Matkustaminen Majoittuminen

ESITTELY. Valitse oppilas jonka haluaisit esitellä luokallesi ja täytä alla oleva kysely. Age Grade Getting to school. School day.

3 9-VUOTIAIDEN LASTEN SUORIUTUMINEN BOSTONIN NIMENTÄTESTISTÄ

21~--~--~r--1~~--~--~~r--1~

Tutorial on Assignment 3 in Data Mining 2008 Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

C++11 seminaari, kevät Johannes Koskinen

Uusi Ajatus Löytyy Luonnosta 3 (Finnish Edition)

EUROOPAN PARLAMENTTI

Sisustusarkkitehtuuri Kansavälinen Työpaja kauppankulttuuri ja ostoskeskuksen tilasuunnittelu Istanbulin Tekniillinen yliopisto Istanbul, Turkki

Tutkimusdata ja julkaiseminen Suomen Akatemian ja EU:n H2020 projekteissa

Rotarypiiri 1420 Piiriapurahoista myönnettävät stipendit

Ajettavat luokat: SM: S1 (25 aika-ajon nopeinta)

Characterization of clay using x-ray and neutron scattering at the University of Helsinki and ILL

Opiskelijat valtaan! TOPIC MASTER menetelmä lukion englannin opetuksessa. Tuija Kae, englannin kielen lehtori Sotungin lukio ja etälukio

7.4 Variability management

Operatioanalyysi 2011, Harjoitus 4, viikko 40

Travel Getting Around

Miten koulut voivat? Peruskoulujen eriytyminen ja tuki Helsingin metropolialueella

Toppila/Kivistö Vastaa kaikkin neljään tehtävään, jotka kukin arvostellaan asteikolla 0-6 pistettä.

Miksi Suomi on Suomi (Finnish Edition)

Infrastruktuurin asemoituminen kansalliseen ja kansainväliseen kenttään Outi Ala-Honkola Tiedeasiantuntija

Ohjelmistoarkkitehtuurit Kevät 2016 Johdantoa

Methods S1. Sequences relevant to the constructed strains, Related to Figures 1-6.

Expression of interest

Genome 373: Genomic Informatics. Professors Elhanan Borenstein and Jay Shendure

DS-tunnusten haku Outi Jäppinen CIMO

Vastuuvelan markkina-arvon määrittämisestä *

Exercise 1. (session: )

Vertaispalaute. Vertaispalaute, /9

You can check above like this: Start->Control Panel->Programs->find if Microsoft Lync or Microsoft Lync Attendeed is listed

Operatioanalyysi 2011, Harjoitus 2, viikko 38

Suunnittelumallit (design patterns)

NAO- ja ENO-osaamisohjelmien loppuunsaattaminen ajatuksia ja visioita

S Sähkön jakelu ja markkinat S Electricity Distribution and Markets

Karkaavatko ylläpitokustannukset miten kustannukset ja tuotot johdetaan hallitusti?

812336A C++ -kielen perusteet,

Digitalisoituminen, verkottuminen ja koulutuksen tulevaisuus. Teemu Leinonen Medialaboratorio Taideteollinen korkeakoulu

Telecommunication Software

Roolipeliharjoitus. - Opiskelijoiden suunni=elemat neuvo=eluvideot ja niiden vertaisarvioinnit

T Statistical Natural Language Processing Answers 6 Collocations Version 1.0

Strict singularity of a Volterra-type integral operator on H p

KMTK lentoestetyöpaja - Osa 2

Metsälamminkankaan tuulivoimapuiston osayleiskaava

Transkriptio:

Data Mining - The Diary Rodion rodde Efremov March 31, 2015 Introduction This document is my learning diary written on behalf of Data Mining course led at spring term 2015 at University of Helsinki. 1 Week 1 The support count σ(x) of an item set X is the amount of transactions containing X (X t i ). Basically, we were computing support counts for various itemsets with the exception of applying additional constraints to the queries (such as particular grade range). The support of an item set X is σ(x)/n, where N is the amount of all transactions. Support of X may be thought of as a classical probability of a random transaction containing X. An association rule is an implication of the form X Y, where X and Y are itemsets having no items in common. The interpretation of an association rule is that if a transaction contains X, it tends to contain Y as well. Note that tends depends on parameters we specify to a data mining system. Support of an association rule X Y is s(x Y ) σ(x Y ). N Support of the rule R may be thought of as a classical probability of R appearing in a random transaction. Rule confidence gives the probability of Y appearing in the same transactions with set X and is defined as 1.1 Reflection c(x Y ) σ(x Y ). σ(x) Getting the data from a file to internal representation was pretty challenging: the data seems a little bit dirty and I am sure there is room for improvement. What comes to accessing data, I have made an effort to make sure that it runs fast. Basically I have three model classes: Course holds the course name, the course code, grading mode and the amount of credits awarded, 1

Student holds only a unique student ID and enrollment year, CourseAttendanceEntry holds a course C, a student S, the year and month S attended C, and the grade S received. Basically, these entries implement a many-to-many relationship between courses and students. 2 Week 2 Task 5 The supports are as follow: E 0.684 O 0.632 P 0.526 W 0.158 EO 0.474 EP 0.316 EW 0.053 OP 0.263 OW 0.053 PW 0.105 EOP 0.221 EOW 0.053 EPW 0 OPW 0 EOPW 0 The only observation that I was able to come up with is that if s(x) is support of an itemset X, then s(x) min A X s(a). Task 10 We have around 23 million (N) different paperback books and we want to generate all 10-combinations of those. Suppose we are given an index tuple t (t 1, t 2,..., t 10 ) (1, 2,..., 10). Next generate a combination of books indexed by t and increment t 10. When t 10 N + 1, increment t 9 and set t 10 t 9 + 1. After t 9 N 1 (and thus t 10 N) has been generated, increase t 8 and set t 9 t 8 + 1, t 10 t 8 + 2. Continue this routine until t 1 N 9, t 2 N 8,..., t 9 N 1, t 10 N. Task 15 In this task we are supposed to measure time of generating k-combinations of courses for k {2, 3, 5}. The results are summarized in the following table: 2

k t 2 4 ms 3 40 ms 5 291 ms Increasing k from 2 to 3 increases the running time by a factor of 10; increasing k from 3 to 5 increases the running time by a factor of 7,3. Since n 213, and ( )( n n 3 2 ( )( n n 5 3 ) 1 ) 1 n!2!(n 2)! n!3!(n 3)!) (n 2)! 3(n 3)! n 2 3 70, n!3!(n 3)! n!5!(n 5)!) (n 3)! 20(n 5)! (n 4)(n 3) 20 2100, which does not quite go hand in hand with the measurements. Task 19 The objective of this task is to compare brute-force and Apriori algorithms for frequent itemset generation. k support Brute-force (ms) Apriori (ms) 2 0.3 379 154 3 0.175 9389 774 4 0.1 N/A 1845 5 0.1 N/A 1637 After Arto s counsel, I was able to speedup generation of 3-combinations by a factor of 20, but I was not able to make 4-combination generation feasible. Task 21 The largest size of itemsets with support at least 0.05 seems to be 11. I got 19 of such itemsets; one of them is TVT-ajokortti Ohjelmoinnin perusteet 3

Opiskelutekniikka Tietokantojen perusteet Ohjelmoinnin jatkokurssi Tietoliikenteen perusteet Tietorakenteet ja algoritmit Johdatus tietojenkäsittelytieteeseen Tietokone työvälineenä Ohjelmistotekniikan menetelmät Aineopintojen harjoitustyö: Tietokantasovellus 3 Week 3 Task 10 Given a set of events E {e 1, e 2,..., e d }, a sequence s over E is S 1, S 2,..., S n, where S i I for all i. The sequence t t 1,..., t k is said to be a subsequence of s if there exist integers 1 j 1 < j 2 < < j k n such that t i S ji for all i 1, 2,..., k. Task 11 We are given events A, B and C. All possible 1-sequences are: 1. {A} 2. {B} 3. {C} All possible 2-sequences are: 1. {A, B} 2. {A, C} 3. {B, C} 4. {A} {A} 5. {A} {B} 6. {A} {C} 7. {B} {A} 8. {B} {B} 9. {B} {C} 4

10. {C} {A} 11. {C} {B} 12. {C} {C} Above we have d 3, which produces 12 2-sequences. For general d > 1, there would be ( d 2) + d 2 2-sequences. Task 13 We are given the following sequences: {B} {C} {H} {B, P } {C} {C} {H} {P } {P } {C, H} {T } {B} {C} {T } {B, P } {T } {P } {C} where B is for bathroom, C is for computer, H is for homework, P is for phone and T is for TV. Now the possible 4-candidates are: {B} {C} {H} {P } {T } {B} {C} {H} {B, P } {C, H} {T } {B, P } {C} {T } {P } {C, H} {P } {C, H} {P } Task 15 The supports for maxspan of 1 is as follows: Sequence Support {courses} {courses} 0.2 {courses} {dm} 0.6 {dm} {courses} 0.2 {index} {courses} 0.0 {teaching} {dm} 0.0 The maxspan is a pruning parameter: let t 1 be the moment at which the very first event begins and t 2 the moment at which the very last event ends, then the sequence is pruned if t 2 t 1 > maxspan. 5

Task 16 The top five 2-sequences are: Sequence Support Lineaarialgebra ja matriisilaskenta I, Lineaarialgebra ja matriisilaskenta II 0.326 Lineaarialgebra ja matriisilaskenta I, Analyysi I 0.317 Turvallinen työskentely laboratoriassa, Yleinen kemia I 0.259 Analyysi I, Analyysi II 0.257 Yleinen kemia I, Yleinen kemia II 0.236 Task 18 The top 5 8-sequences are: 7. Atomien ja molekyylien rakenne 8. Kemian tietolähteet with support 0.0211, 6. Matematiikkaa kemisteille 7. Atomien ja molekyylien rakenne 8. Kemian tietolähteet with support 0.0204, 6

7. Orgaanisten yhdisteiden rakenteiden selvittäminen 8. Integroidut TVT-opinnot with support 0.0204, 7. Orgaanisten yhdisteiden rakenteiden selvittäminen 8. Kemian tietolähteet with support 0.0204, 7. Matematiikkaa kemisteille 8. Atomien ja molekyylien rakenne with support 0.0204. It is obvious that the above five sequences are very alike. Actually the four last sequences have exactly the same support. Task 19 What comes to the results in Task 18, doing the same with maxspan produces a result with less variation. This can be explained that those students that fit in maxspan of 36 months, tend to perform the same course permutation. On behalf of Task 17, applying the maxspan of 36 months produces the same sequences (with slightly smaller supports each). This can be explained by assuming that 36 months is enough for any student in the data to score 5 courses. 7