MS-A0504 First course in probability and statistics
1 MS-A0504 First course in probability and statistics Week 6 Statistical dependence and linear regression Heikki Seppälä Department of mathematics and system analysis School of science Aalto University Spring 2016
2 Contents Description of data set of two variables Least squares method Linear regression
4 Describing the data set of two variables Collected data: n observed units, p variables. Choose two variables for analysis, which means that we analyse data set (x, y) consisting of pairs (x 1, y 1 ),..., (x n, y n ).
5 Example. Evaluation of the course Can we predict exam points from exercise points? id exam (y) report exercises (x) grade Input (explanatory): x = (0, 20, 0, 16, 20, 17, 3, 9, 12, 0, 19, 0, 17) Output (dependent): y = (0, 17, 15, 12, 19, 21, 0, 13, 19, 0, 15, 12, 13)
6 Scatter plot Data points: (x 1, y 1 ),..., (x n, y n )
7 Sample variance The sample covariance of data vectors x and y is defined by s(x, y) = 1 n 1 n (x i m(x))(y i m(y)), i=1 where m(x) and m(y) are sample means of data vectors. Remark: s(x, x) = s 2 (x) is the sample variance of x s(y, y) = s 2 (y) is the sample variance of y s(x, x) = s(x) is the sample standard deviation of x s(y, y) = s(y) is the sample standard deviation of y
8 Example. Evaluation of the course id exam (y) report exercises (x) grade The sample covariance s(x, y) = cov(x,y) = We need to normalise this to be able to interpret it.
9 Sample correlation Pearson s sample correlation of data vectors x and y is defined by r(x, y) = s(x, y) [ 1, +1] s(x)s(y) Karl Pearson FRS Pearson s correlation measures linear dependence: If r(x, y) > 0, then x and y are positively correlated If r(x, y) = 0, then x and y are uncorrelated If r(x, y) < 0, then x and y are negatively correlated
10 Example. Evaluation of the course id exam (y) report exercises (x) grade Pearson s sample correlation r(x, y) = cor(x,y) = Exercise points and exam points appears to be positively correlated Or is this caused by random variation?
11 Testing for correlations Null hypothesis (stochastic model): Observed pairs (x i, y i ) are realizations of independent random vectors (X i, Y i ) N 2 (µ X, µ Y, σ 2 X, σ2 Y, ρ XY ). H 0 : ρ XY = 0 vs. H 1 : ρ XY 0 William S Gosset (a.k.a. Student ) If the initial hypothesis and the null hypothesis hold, the test statistic t(x, Y ) = r(x, Y ) n 2 1 r(x, Y ) 2 is t-distributed with degrees of freedom n 2. If the absolute value of test statistic is large, then it is unlikely that the null hypothesis is true
12 Example. Evaluation of the course id exam (y) report exercises (x) grade Is the joint distribution of exam and exercise points a bi-variate normal distribution? No. Both variables are discrete and usually neither of them is symmetric. In this case we can not test the correlation using the aforementioned test.however, there are non-parametric tests which can be used also in this setting (course MS-C2104 Introduction to Statistical Inference).
13 Example. Heights of fathers and sons Height Son Father Are the heights of fathers and sons from a bi-variate normal distribution?
14 f Example. Heights of fathers and sons Son Father
15 Example. Heights of fathers and sons Histogram of Fathers Histogram of Sons Density Density Height Height
16 Example. Heights of fathers and sons Are the heights of fathers and sons from a bi-variate normal distribution? Yes - or at least bi-variate normal distribution provides accurate enough approximation. We can test the correlation using the test. Sample correlation is cor(x,y) = The test statistic calculated from the data t(x, y) = p-value Pr( t(x, Y ) 18.85) = 2*(1-pt(18.85,1076)) = 0 Since the p-value is less than 0.01, the null hypothesis (ρ XY = 0) is rejected with 1 % significance level. Conclusion: heights of fathers and sons are linearly dependent.
18 Example. Evaluation of the course id exam (y) report exercises (x) grade Pearson s sample correlation r(x, y) = Linear dependence between variables is somewhat strong What is the best line for illustrating linear dependence?
19 Scatter plot Data points: (x 1, y 1 ),..., (x n, y n )
20 Fitting the line Fitted values: ŷ i = β 0 + β 1 x i
21 Residuals Residuals: e i = y i ŷ i
22 Minimization of residuals How to choose the optimal slope β 1 and constant β 0?
23 Minimization of residuals Sum of squares of residuals of line ŷ = β 0 + β 1 x SSE(β 0, β 1 ) = n (y i ŷ i ) 2 = i=1 n (y i β 0 β 1 x i ) 2 i=1 Least squares method Find (β 0, β 1 ) such that sum of squared residuals is minimized. Solution: Differentiate SSE(β 0, β 1 ) with respect to β 0 and β 1, set both to zero and solve these equations. Answer: (β 0, β 1 ) = (b 0, b 1 ), where b 1 = r(x, y) s(y) s(x), b 0 = m(y) b 1 m(x).
24 Example. Evaluation of the course id exam (y) report exercises (x) grade Sample means: m(x) = 10.2, m(y) = 12.0 Sample standard deviations: s(x) = 8.51, s(y) = 7.39 Pearson s sample correlation r(x, y) = b 1 = r(x, y) s(y) s(x) = 0.60 b 0 = m(y) b 1 m(x) = 5.82
25 Example: Heights of fathers and sons Height Son Father Sample means: m(x) = , m(y) = Sample standard deviations: s(x) = 6.98, s(y) = 7.14 Pearson s sample correlation r(x, y) = b 1 = r(x, y) s(y) s(x) = b 0 = m(y) b 1 m(x) = 86.83
26 Example: Heights of fathers and sons Height Son Father
28 Prediction interval of fitted line If we fit a line to a data set of two variables using least squares method, how accurately this line predicts the values of the response variable? How likely it is that the fitted value is close to observed value? We need the stochastic model of statistical experiment.
29 Linear regression model Suppose that the response variable Y depends on input x as follows: Y = β 0 + β 1 x + ɛ, where ɛ N(0, σ 2 ). We take n independent measurements with input values x 1,..., x n and obtain the values Y k = β 0 + β 1 x k + ɛ k, k = 1,..., n, where the random residuals ɛ 1,..., ɛ n of the stochastic model are independent and N(0, σ 2 )-distributed. There are 3 unknown parameters: (β 0, β 1, σ 2 ).
30 Estimation of parameters of linear regression model The best estimators of parameters β 0, β 1 in the sense of expected squared residuals are the least squares estimators b 1 = r(x, y) s(y) s(x), b 0 = m(y) b 1 m(x). Unbiased estimator of the unknown variance parameter σ 2 S 2 = 1 n 2 n (y j ŷ j ) 2 = 1 n 2 k=1 n (y j b 0 b 1 x j ) 2. k=1
31 Prediction interval of fitted value of response We want to predict the value Y ( x) of response variable corresponding to input variable x based on observed data set (x 1,..., x n ; y 1,..., y n ). Predicted value is Ŷ ( x) = b 0 + b 1 x, where b 0, b 1 are estimated from the data using least squares method. The end points of (1 α) prediction interval for the response are b 0 + b 1 x ± t α/2 S ( x m(x))2 + n (n 1)s 2 (x), where t α/2 is a number, for which t(n 2)-distributed random number T satisfies Pr( t α/2 T t α/2 ). Remark: The prediction interval is wider if x is far from the sample mean m(x) of observed data.
32 Example. Evaluation of the course Can we predict the exam points from exercise points? id exam (y) report exercises (x) grade Probably yes - but we can t test it using the method above, because residuals are not normally distributed.
33 Example: Heights of fathers and sons Height Son Father
34 Residuals of the regression model when father is approximately 165cm Histogram of residuals vs. normal distribution Density
35 Example: Heights of fathers and sons Height Son Father
36 Residuals of the regression model when father is approximately 170cm Histogram of residuals vs. normal distribution Density
37 Example: Heights of fathers and sons Can we predict the heights of sons from the heights of fathers? It seems that the residuals are normally distributed with equal variances so we can use the regression model. (The normality assumption for residuals is not necessary, course MS-2128 Prediction and time series analysis.)
38 Height of son, when father is approximately 165cm Heights of sons Density Height Distribution of the heights of sons and the 90% prediction interval, when the height of the father is 165cm.
39 Height of son, when father is approximately 170cm Heights of sons Density Height Distribution of the heights of sons and the 90% prediction interval, when the height of the father is 170cm.
40 Example: Heights of fathers and sons (90 % prediction interval) Height Sons
41 What next?
43 Stochastics & Aalto Bachelor courses First course in probability and statistics (E) Stochastic processes Introduction to Statistical Inference (E) Design of experiments and statistical models Prediction and time series analysis Masters courses Probability theory (E) Large random systems (E) Multivariate statistical analysis (E) Brownian motion and stochastic analysis (E) Multivariate location and scatter (E)
44 The course ends here. Thanks for attending and good luck for exams!
45 References The slides are partly based on the previous lecture slides (Ilkka Mellin, Milla Kibble, Juuso Liesiö, Lasse Leskelä, Kalle Kytölä).
x = y x i = y i i = 1, 2; x + y = (x 1 + y 1, x 2 + y 2 ); x y = (x 1 y 1, x 2 + y 2 );
LINEAARIALGEBRA Harjoituksia/Exercises 2017 1. Olkoon n Z +. Osoita, että (R n, +, ) on lineaariavaruus, kun vektoreiden x = (x 1,..., x n ), y = (y 1,..., y n ) identtisyys, yhteenlasku ja reaaliluvulla
SHADOW - Main Result Calculation: N117 x 9 x HH141 Assumptions for shadow calculations Maximum distance for influence Calculate only when more than 20 % of sun is covered by the blade Please look in WTG