MS-A0504 First course in probability and statistics

MS-A0504 First course in probability and statistics Week 6 Statistical dependence and linear regression Heikki Seppälä Department of mathematics and system analysis School of science Aalto University Spring 2016

Contents Description of data set of two variables Least squares method Linear regression

Describing the data set of two variables Collected data: n observed units, p variables. Choose two variables for analysis, which means that we analyse data set (x, y) consisting of pairs (x 1, y 1 ),..., (x n, y n ).

Example. Evaluation of the course Can we predict exam points from exercise points? id exam (y) report exercises (x) grade 1 0 0 0 0 2 17 5 20 5 3 15 5 0 3 4 12 6 16 4 5 19 5 20 5 6 21 6 17 5 7 0 0 3 0 8 13 6 9 4 9 19 6 12 5 10 0 0 0 0 11 15 5 19 5 12 12 6 0 3 13 13 5 17 4 Input (explanatory): x = (0, 20, 0, 16, 20, 17, 3, 9, 12, 0, 19, 0, 17) Output (dependent): y = (0, 17, 15, 12, 19, 21, 0, 13, 19, 0, 15, 12, 13)

Scatter plot Data points: (x 1, y 1 ),..., (x n, y n )

Sample variance The sample covariance of data vectors x and y is defined by s(x, y) = 1 n 1 n (x i m(x))(y i m(y)), i=1 where m(x) and m(y) are sample means of data vectors. Remark: s(x, x) = s 2 (x) is the sample variance of x s(y, y) = s 2 (y) is the sample variance of y s(x, x) = s(x) is the sample standard deviation of x s(y, y) = s(y) is the sample standard deviation of y

Example. Evaluation of the course id exam (y) report exercises (x) grade 1 0 0 0 0 2 17 5 20 5 3 15 5 0 3 4 12 6 16 4 5 19 5 20 5 6 21 6 17 5 7 0 0 3 0 8 13 6 9 4 9 19 6 12 5 10 0 0 0 0 11 15 5 19 5 12 12 6 0 3 13 13 5 17 4 The sample covariance s(x, y) = cov(x,y) = 43.67 We need to normalise this to be able to interpret it.

Sample correlation Pearson s sample correlation of data vectors x and y is defined by r(x, y) = s(x, y) [ 1, +1] s(x)s(y) Karl Pearson FRS 1857 1936 Pearson s correlation measures linear dependence: If r(x, y) > 0, then x and y are positively correlated If r(x, y) = 0, then x and y are uncorrelated If r(x, y) < 0, then x and y are negatively correlated

Example. Evaluation of the course id exam (y) report exercises (x) grade 1 0 0 0 0 2 17 5 20 5 3 15 5 0 3 4 12 6 16 4 5 19 5 20 5 6 21 6 17 5 7 0 0 3 0 8 13 6 9 4 9 19 6 12 5 10 0 0 0 0 11 15 5 19 5 12 12 6 0 3 13 13 5 17 4 Pearson s sample correlation r(x, y) = cor(x,y) = 0.694 Exercise points and exam points appears to be positively correlated Or is this caused by random variation?

Testing for correlations Null hypothesis (stochastic model): Observed pairs (x i, y i ) are realizations of independent random vectors (X i, Y i ) N 2 (µ X, µ Y, σ 2 X, σ2 Y, ρ XY ). H 0 : ρ XY = 0 vs. H 1 : ρ XY 0 William S Gosset (a.k.a. Student ) 1876 1937 If the initial hypothesis and the null hypothesis hold, the test statistic t(x, Y ) = r(x, Y ) n 2 1 r(x, Y ) 2 is t-distributed with degrees of freedom n 2. If the absolute value of test statistic is large, then it is unlikely that the null hypothesis is true

Example. Evaluation of the course id exam (y) report exercises (x) grade 1 0 0 0 0 2 17 5 20 5 3 15 5 0 3 4 12 6 16 4 5 19 5 20 5 6 21 6 17 5 7 0 0 3 0 8 13 6 9 4 9 19 6 12 5 10 0 0 0 0 11 15 5 19 5 12 12 6 0 3 13 13 5 17 4 Is the joint distribution of exam and exercise points a bi-variate normal distribution? No. Both variables are discrete and usually neither of them is symmetric. In this case we can not test the correlation using the aforementioned test.however, there are non-parametric tests which can be used also in this setting (course MS-C2104 Introduction to Statistical Inference).

Example. Heights of fathers and sons Height Son 150 160 170 180 190 200 150 160 170 180 190 Father Are the heights of fathers and sons from a bi-variate normal distribution?

f Example. Heights of fathers and sons Son Father

Example. Heights of fathers and sons Histogram of Fathers Histogram of Sons Density 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Density 0.00 0.02 0.04 0.06 140 150 160 170 180 190 200 140 150 160 170 180 190 200 Height Height

Example. Heights of fathers and sons Are the heights of fathers and sons from a bi-variate normal distribution? Yes - or at least bi-variate normal distribution provides accurate enough approximation. We can test the correlation using the test. Sample correlation is cor(x,y) = 0.498 The test statistic calculated from the data t(x, y) = 18.85 p-value Pr( t(x, Y ) 18.85) = 2*(1-pt(18.85,1076)) = 0 Since the p-value is less than 0.01, the null hypothesis (ρ XY = 0) is rejected with 1 % significance level. Conclusion: heights of fathers and sons are linearly dependent.

Example. Evaluation of the course id exam (y) report exercises (x) grade 1 0 0 0 0 2 17 5 20 5 3 15 5 0 3 4 12 6 16 4 5 19 5 20 5 6 21 6 17 5 7 0 0 3 0 8 13 6 9 4 9 19 6 12 5 10 0 0 0 0 11 15 5 19 5 12 12 6 0 3 13 13 5 17 4 Pearson s sample correlation r(x, y) = 0.694 Linear dependence between variables is somewhat strong What is the best line for illustrating linear dependence?

Scatter plot Data points: (x 1, y 1 ),..., (x n, y n )

Fitting the line Fitted values: ŷ i = β 0 + β 1 x i

Residuals Residuals: e i = y i ŷ i

Minimization of residuals How to choose the optimal slope β 1 and constant β 0?

Minimization of residuals Sum of squares of residuals of line ŷ = β 0 + β 1 x SSE(β 0, β 1 ) = n (y i ŷ i ) 2 = i=1 n (y i β 0 β 1 x i ) 2 i=1 Least squares method Find (β 0, β 1 ) such that sum of squared residuals is minimized. Solution: Differentiate SSE(β 0, β 1 ) with respect to β 0 and β 1, set both to zero and solve these equations. Answer: (β 0, β 1 ) = (b 0, b 1 ), where b 1 = r(x, y) s(y) s(x), b 0 = m(y) b 1 m(x).

Example. Evaluation of the course id exam (y) report exercises (x) grade 1 0 0 0 0 2 17 5 20 5 3 15 5 0 3 4 12 6 16 4 5 19 5 20 5 6 21 6 17 5 7 0 0 3 0 8 13 6 9 4 9 19 6 12 5 10 0 0 0 0 11 15 5 19 5 12 12 6 0 3 13 13 5 17 4 Sample means: m(x) = 10.2, m(y) = 12.0 Sample standard deviations: s(x) = 8.51, s(y) = 7.39 Pearson s sample correlation r(x, y) = 0.694 b 1 = r(x, y) s(y) s(x) = 0.60 b 0 = m(y) b 1 m(x) = 5.82

Example: Heights of fathers and sons Height Son 150 160 170 180 190 200 150 160 170 180 190 Father Sample means: m(x) = 171.92, m(y) = 174.46 Sample standard deviations: s(x) = 6.98, s(y) = 7.14 Pearson s sample correlation r(x, y) = 0.498 b 1 = r(x, y) s(y) s(x) = 0.514 b 0 = m(y) b 1 m(x) = 86.83

Example: Heights of fathers and sons Height Son 150 160 170 180 190 200 150 160 170 180 190 Father

Prediction interval of fitted line If we fit a line to a data set of two variables using least squares method, how accurately this line predicts the values of the response variable? How likely it is that the fitted value is close to observed value? We need the stochastic model of statistical experiment.

Linear regression model Suppose that the response variable Y depends on input x as follows: Y = β 0 + β 1 x + ɛ, where ɛ N(0, σ 2 ). We take n independent measurements with input values x 1,..., x n and obtain the values Y k = β 0 + β 1 x k + ɛ k, k = 1,..., n, where the random residuals ɛ 1,..., ɛ n of the stochastic model are independent and N(0, σ 2 )-distributed. There are 3 unknown parameters: (β 0, β 1, σ 2 ).

Estimation of parameters of linear regression model The best estimators of parameters β 0, β 1 in the sense of expected squared residuals are the least squares estimators b 1 = r(x, y) s(y) s(x), b 0 = m(y) b 1 m(x). Unbiased estimator of the unknown variance parameter σ 2 S 2 = 1 n 2 n (y j ŷ j ) 2 = 1 n 2 k=1 n (y j b 0 b 1 x j ) 2. k=1

Prediction interval of fitted value of response We want to predict the value Y ( x) of response variable corresponding to input variable x based on observed data set (x 1,..., x n ; y 1,..., y n ). Predicted value is Ŷ ( x) = b 0 + b 1 x, where b 0, b 1 are estimated from the data using least squares method. The end points of (1 α) prediction interval for the response are b 0 + b 1 x ± t α/2 S 1 + 1 ( x m(x))2 + n (n 1)s 2 (x), where t α/2 is a number, for which t(n 2)-distributed random number T satisfies Pr( t α/2 T t α/2 ). Remark: The prediction interval is wider if x is far from the sample mean m(x) of observed data.

Example. Evaluation of the course Can we predict the exam points from exercise points? id exam (y) report exercises (x) grade 1 0 0 0 0 2 17 5 20 5 3 15 5 0 3 4 12 6 16 4 5 19 5 20 5 6 21 6 17 5 7 0 0 3 0 8 13 6 9 4 9 19 6 12 5 10 0 0 0 0 11 15 5 19 5 12 12 6 0 3 13 13 5 17 4 Probably yes - but we can t test it using the method above, because residuals are not normally distributed.

Example: Heights of fathers and sons Height Son 150 160 170 180 190 200 150 160 165 170 180 190 Father

Residuals of the regression model when father is approximately 165cm Histogram of residuals vs. normal distribution Density 0.00 0.01 0.02 0.03 0.04 0.05 0.06 20 10 0 10 20

Example: Heights of fathers and sons Height Son 150 160 170 180 190 200 150 160 170 180 190 Father

Residuals of the regression model when father is approximately 170cm Histogram of residuals vs. normal distribution Density 0.00 0.02 0.04 0.06 0.08 0.10 20 10 0 10 20

Example: Heights of fathers and sons Can we predict the heights of sons from the heights of fathers? It seems that the residuals are normally distributed with equal variances so we can use the regression model. (The normality assumption for residuals is not necessary, course MS-2128 Prediction and time series analysis.)

Height of son, when father is approximately 165cm Heights of sons Density 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 150 160 170 180 190 Height Distribution of the heights of sons and the 90% prediction interval, when the height of the father is 165cm.

Height of son, when father is approximately 170cm Heights of sons Density 0.00 0.02 0.04 0.06 0.08 150 160 170 175 180 190 Height Distribution of the heights of sons and the 90% prediction interval, when the height of the father is 170cm.

Example: Heights of fathers and sons (90 % prediction interval) Height Sons 150 160 170 180 190 200 150 160 170 180 190

What next?

Stochastics and Statistics Courses 2015 2016 MS-C2111 S TOKASTISET PROSESSIT MS-E1600 P ROBABILITY THEORY Periodi I, 5 op, tekn. kand. Luennoitsija: Lasse Leskelä Esitiedot: MS-A050X Todennäköisyyslaskennan ja tilastotieteen peruskurssi MS-A000X Matriisilaskenta MS-A020X Differentiaali- ja integraalilaskenta 2 Stokastisilla prosesseilla mallinnetaan tekniikan, talouden ja luonnontieteiden sovelluksissa esiintyviä ajasta riippuvia satunnaisilmiöitä. Tällä kurssilla opimme analysoimaan stokastisia populaatiomalleja Markov-prosessien avulla sekä ennakoimattomien tapahtumien esiintymistä Poisson-prosessien avulla. Lisäksi opimme analysoimaan yksinkertaisten uhkapelien sijoitusstrategioita martingaalien avulla. Tämän kurssin tiedot ovat tärkeitä useimmilla stokastiikan ja tilastotieteen jatkokursseilla. Period III, 5 cr, MSc Lecturer: Prerequisites: MS-C2103 KOESUUNNITTELU JA TILASTOLLISET MALLIT MS-C2128 E NNUSTAMINEN JA AIKASARJA - ANALYYSI la. Kurssin tavoitteena on oppia, kuinka aikasarjoja analysoidaan ja miten niiden avulla laaditaan ennusteita. Kurssi kattaa yleisimmät mallit, kuten ARIMA-mallit ja dynaamiset regressiomallit, mutta myös muita tulosten kannalta oleellisia asioita, kuten diagnostiikan ja mallin valinnan. Kurssilla käytetään R-ohjelmistoa. -Niels Bohr Jos tietyt matemaattiset oletukset täyttyvät, voidaan tehdä käyttökelpoisia ennusteita historiallisten aikasarja-aineistojen perusteel- 30 0 10 "Ennustaminen on vaikeaa, varsinkin tulevaisuuden" 2007 2008 2009 2010 2011 2012 2013 Date MS-E1601 B ROWNIAN MOTION AND STOCHASTIC ANALYSIS Period II, 5 cr, MSc Lecturer: Prerequisites: Lauri Viitasaari MS-E1600 Probability theory (MS-C2111 Stokastiset prosessit) This course introduces the foundations of stochastic analysis and stochastic integration with respect to a Brownian motion. The course starts with a construction of Brownian motion and analysis of its basic properties, and continues with the construction of Ito stochastic integral. We derive the Ito formula which is the equivalent of the fundamental theorem of calculus for stochastic integrals, and discuss its applications to mathematical finance. MS-E1996 M ULTIVARIATE LOCATION AND SCATTER Where is the data? How is it scattered? 15 10 When dealing with multivariate observations, the very first questions that come to mind are: 20 Pauliina Ilmonen At least one matrix algebra and one MSc level statistics/probability course 5 Period II, 5 cr, MSc Lecturer: Prerequisites: 10 15 20 Periodit III IV, 5 op, tekn. kand./di Luennoitsija: Heikki Seppälä Esitiedot: MS-A050X Todennäköisyyslaskennan ja tilastotieteen peruskurssi Kurssilla esitellään tavallisimpia koejärjestelyitä sekä menetelmiä tilastollisen analyysin tekemiseen. Tavoitteena on oppia valitsemaan sopiva koejärjestely tilastollisen testin toteuttami- seksi, suorittamaan testi ja analysoimaan tulokset. Kurssi kattaa regressioanalyysin perusteet, varianssianalyysin sekä valikoituja koejärjestelyitä, kuten lohkoasetelmat, faktorikokeet sekä vastepintamenetelmän. Kurssilla käytetään R-ohjelmistoa. 20 Tenor basis spread (bp) 40 Periodi II, 5 op, tekn. kand. Luennoitsija: Heikki Seppälä Esitiedot: MS-A050X Todennäköisyyslaskennan ja tilastotieteen peruskurssi MS-A020X Differentiaali- ja integraalilaskenta 2 (MS-C2111 Stokastiset prosessit) Kalle Kytölä MS-C1540 Euklidiset avaruudet This course is about the mathematical foundations of randomness. Most advanced topics in stochastics and statistics rely on probability theory. The basic constructions are identical to measure theory, but there are a number of distinctly probabilistic features such as independence, notions of convergence of random variables, information contained in a sigma-algebra, conditional expectation, characteristic functions and generating functions, laws of large numbers and central limit theorems, etc. These questions are discussed together with selected applications. This is an advanced course in statistics for MSc and doctoral students. Only 10 students are admitted to this course, so email the lecturer ASAP to register. Topics include: M-estimates of location and scatter, MCD-estimates, spatial sign and rank based estimates, multivariate location tests, autocovariance matrices and applications, PCA using different location and scatter estimates, multivariate regression analysis based on spatial signs and ranks, scatter matrix based ICA, complex time series ICA, ICS and skewness and kurtosis. MS-C2104 T ILASTOLLISEN ANALYYSIN PERUSTEET Periodit III IV, 5 op, tekn. kand./di Luennoitsija: Pauliina Ilmonen Esitiedot: MS-A050X Todennäköisyyslaskennan ja tilastotieteen peruskurssi MS-A000X Matriisilaskenta Kurssi on johdatus tietokoneavusteiseen tilastolliseen analyysiin ja tilastolliseen päättelyyn. Kurssin aiheita ovat estimointi ja väliestimointi, yksinkertaiset parametriset ja epäparametriset testit, tilastollinen riippuvuus ja korrelaatio, lineaarinen regressioanalyysi ja varianssianalyysi. Kurssilla käytetään R-ohjelmistoa. MS-E2112 M ULTIVARIATE STATISTICAL ANALYSIS Periods III IV, 5 cr, MSc Lecturer: Pauliina Ilmonen Prerequisites: At least one statistics/probability and one matrix algebra course This course is an introduction to multivariate statistical analysis. The goal is to learn basics of common multivariate data analysis techniques and to use the methods in practice. Software R is used in the exercises of this course. The topics of the course are multivariate location and scatter, principal component analysis, bivariate correspondence analysis, multivariate correspondence analysis, canonical correlation analysis, discriminant analysis, classification, and clustering. MS-E1602 L ARGE RANDOM SYSTEMS Period IV, 5 cr, MSc Lecturers: Lasse Leskelä and Kalle Kytölä Prerequisites MS-E1600 Probability theory, (MS-C2111 Stokastiset prosessit) Many interesting random systems contain a large number of simpler constituents interacting with each other. This course covers both mathematical techniques for the study of such systems, and important probabilistic models of a range of different phenomena. The theory focuses on tightness and weak convergence of probability measures. Examples include random walk and Brownian motion, percolation, Curie-Weiss model and Ising model, and voter model and contact process.

Stochastics & statistics @ Aalto Bachelor courses First course in probability and statistics (E) Stochastic processes Introduction to Statistical Inference (E) Design of experiments and statistical models Prediction and time series analysis Masters courses Probability theory (E) Large random systems (E) Multivariate statistical analysis (E) Brownian motion and stochastic analysis (E) Multivariate location and scatter (E)

The course ends here. Thanks for attending and good luck for exams!

References The slides are partly based on the previous lecture slides (Ilkka Mellin, Milla Kibble, Juuso Liesiö, Lasse Leskelä, Kalle Kytölä).