Shalabh [email protected] [email protected] Department of Mathematics & Statistics Indian Institute of Technology Kanpur , Kanpur - 208016 ( India )
MTH 416 : Regression Analysis
Syllabus: Simple and multiple linear regression, Polynomial regression and orthogonal polynomials, Test of significance and confidence intervals for parameters. Residuals and their analysis for test of departure from the assumptions such as fitness of model, normality, homogeneity of variances, detection of outliers, Influential observations, Power transformation of dependent and independent variables. Problem of multicollinearity, ridge regression and principal component regression, subset selection of explanatory variables, Mallow's Cp statistic. Nonlinear regression, different methods for estimation (Least squares and Maximum likelihood), Asymptotic properties of estimators. Generalised Linear Models (GLIM), Analysis of binary and grouped data using logistic and log-linear models.
Grading Scheme : Quizzes: 20%, Mid semester exam: 30%, End semester exam: 50%
Books: 1. Introduction to Linear Regression Analysis by Douglas C. Montgomery, Elizabeth A. Peck, G. Geoffrey Vining (Wiley), Low price Indian edition is available.
2. Applied Regression Analysis by Norman R. Draper, Harry Smith (Wiley), Low price Indian edition is available.
3. Linear Models and Generalizations - Least Squares and Alternatives by C.R. Rao, H. Toutenburg, Shalabh, and C. Heumann (Springer, 2008)
4. A Primer on Linear Models by John F. Monahan (CRC Press, 2008)
5. Linear Model Methodology by Andre I. Khuri (CRC Press, 2010)
Assignaments:
Assignment 1
Assignment 2
Assignment 3
Assignment 4
Assignment 5
Assignment 6
Assignment 7
Assignment 8
Lecture notes for your help (If you find any typo, please let me know)
Lecture Notes 1 : Introduction
Lecture Notes 2 : Simple Linear Regression Analysis
Lecture Notes 3 : Multiple Linear Regression Model
Lecture Notes 4 : Model Adequacy Checking
Lecture Notes 5 : Transformation and Weighting to Correct Model Inadequacies
Lecture Notes 6 : Diagnostic for Leverage and Influence
Lecture Notes 7 : Generalized and Weighted Least Squares Estimation
Lecture Notes 8 : Indicator Variables
Lecture Notes 9 : Multicollinearity
Lecture Notes 10 : Heteroskedasticity
Lecture Notes 11 : Autocorrelation
Lecture Notes 12 : Polynomial Regression Models
Lecture Notes 13 : Variable Selection and Model Building
Lecture Notes 14 : Logistic Regression Models
Lecture Notes 15 : Poisson Regression Models
Lecture Notes 16 : Generalized Linear Models
Browse Course Material
Course info.
- Prof. Dimitris Bertsimas
Departments
- Sloan School of Management
As Taught In
- Operations Management
- Probability and Statistics
Learning Resource Types
The analytics edge, 2 linear regression.
2.1 Welcome to Unit 2
- 2.1.1 Welcome to Unit 2
2.2 The Statistical Sommelier: An Introduction to Linear Regression
- 2.2.1 Video 1: Predicting the Quality of Wine
- 2.2.2 Quick Question
- 2.2.3 Video 2: One-Variable Linear Regression
- 2.2.4 Quick Question
- 2.2.5 Video 3: Multiple Linear Regression
- 2.2.6 Quick Question
- 2.2.7 Video 4: Linear Regression in R
- 2.2.8 Quick Question
- 2.2.9 Video 5: Understanding the Model
- 2.2.10 Quick Question
- 2.2.11 Video 6: Correlation and Multicollinearity
- 2.2.12 Quick Question
- 2.2.13 Video 7: Making Predictions
- 2.2.14 Quick Question
- 2.2.15 Video 8: Comparing the Model to the Experts
2.3 Moneyball: The Power of Sports Analytics
- 2.3.1 A Quick Introduction to Baseball
- 2.3.2 Video 1: The Story of Moneyball
- 2.3.3 Video 2: Making it to the Playoffs
- 2.3.4 Quick Question
- 2.3.5 Video 3: Predicting Runs
- 2.3.6 Quick Question
- 2.3.7 Video 4: Using the Models to Make Predictions
- 2.3.8 Quick Question
- 2.3.9 Video 5: Winning the World Series
- 2.3.10 Quick Question
- 2.3.11 Video 6: The Analytics Edge in Sports
- 2.3.12 Quick Question
2.4 Playing Moneyball in the NBA (Recitation)
- 2.4.1 Welcome to Recitation 2
- 2.4.2 Video 1: The Data
- 2.4.3 Video 2: Playoffs and Wins
- 2.4.4 Video 3: Points Scored
- 2.4.5 Video 4: Making Predictions
2.5 Assignment 2
- 2.5.1 Climate Change
- 2.5.2 Reading Test Scores
- 2.5.3 Detecting Flu Epidemics via Search Engine Query Data
- 2.5.4 State Data
Back: 1.5 Assignment Internet Privacy Poll
Welcome to Unit 2
- Download video
- Download transcript
Video 1: Predicting the Quality of Wine
The slides from all videos in this Lecture Sequence can be downloaded here: Introduction to Linear Regression (PDF - 1.3MB) .
Continue: Quick Question
Introduction to Baseball Video
If you are unfamiliar with the game of baseball, please watch this short video clip for a quick introduction to the game. You don’t need to be a baseball expert to understand this lecture, but basic knowledge of the game will be helpful to you.
TruScribe. “Baseball Rules of Engagement.” March 27, 2012. YouTube. This video is from TrueScribeVideos and is not covered by our Creative Commons license .
- Back: Video 8: Comparing the Model to the Experts
- Continue: Video 1: The Story of Moneyball
Welcome to Recitation 2
- Back: Quick Question
- Continue: Video 1: The Data
Climate Change
There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.
In this problem, we will attempt to study the relationship between average global temperature and several other factors.
The file climate_change (CSV) contains climate data from May 1983 to December 2008. The available variables include:
- Year : the observation year.
- Month : the observation month.
- Temp : the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia .
- CO2 , N2O , CH4 , CFC.11 , CFC.12 : atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division .
- CO2, N2O and CH4 are expressed in ppmv (parts per million by volume – i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere)
- CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume).
- Aerosols : the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun’s energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA .
- TSI : the total solar irradiance (TSI) in W/m2 (the rate at which the sun’s energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website .
- MEI : multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division .
Problem 1.1 - Creating Our First Model
We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far. To do this, first read the dataset climate_change.csv into R.
Then, split the data into a training set , consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years (hint: use subset). A training set refers to the data that will be used to build the model (this is the data we give to the lm() function), and a testing set refers to the data we will use to test our predictive ability.
Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables ( Year and Month should NOT be used in the model). Use the training set to build the model.
Enter the model R2 (the “Multiple R-squared” value):
Numerical Response
Explanation
First, read in the data and split it using the subset command:
climate = read.csv(“climate_change.csv”)
train = subset(climate, Year <= 2006)
test = subset(climate, Year > 2006)
Then, you can create the model using the command:
climatelm = lm(Temp ~ MEI + CO2 + CH4 + N2O + CFC.11 + CFC.12 + TSI + Aerosols, data=train)
Lastly, look at the model using summary(climatelm). The Multiple R-squared value is 0.7509.
CheckShow Answer
Problem 1.2 - Creating Our First Model
Which variables are significant in the model? We will consider a variable signficant only if the p-value is below 0.05. (Select all that apply.)
If you look at the model we created in the previous problem using summary(climatelm), all of the variables have at least one star except for CH4 and N2O. So MEI, CO2, CFC.11, CFC.12, TSI, and Aerosols are all significant.
Problem 2.1 - Understanding the Model
Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases: gases that are able to trap heat from the sun and contribute to the heating of the Earth. However, the regression coefficients of both the N2O and CFC-11 variables are negative , indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.
Which of the following is the simplest correct explanation for this contradiction?
Climate scientists are wrong that N2O and CFC-11 are greenhouse gases - this regression analysis constitutes part of a disproof.
There is not enough data, so the regression coefficients being estimated are not accurate.
All of the gas concentration variables reflect human development - N2O and CFC.11 are correlated with other variables in the data set.
The linear correlation of N2O and CFC.11 with other variables in the data set is quite large. The first explanation does not seem correct, as the warming effect of nitrous oxide and CFC-11 are well documented, and our regression analysis is not enough to disprove it. The second explanation is unlikely, as we have estimated eight coefficients and the intercept from 284 observations.
Problem 2.2 - Understanding the Model
Compute the correlations between all the variables in the training set. Which of the following independent variables is N2O highly correlated with (absolute correlation greater than 0.7)? Select all that apply.
Which of the following independent variables is CFC.11 highly correlated with? Select all that apply.
You can calculate all correlations at once using cor(train) where train is the name of the training data set.
Problem 3 - Simplifying the Model
Given that the correlations are so high, let us focus on the N2O variable and build a model with only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.
Enter the coefficient of N2O in this reduced model:
(How does this compare to the coefficient in the previous model with all of the variables?)
Enter the model R2:
We can create this simplified model with the command:
LinReg = lm(Temp ~ MEI + N2O + TSI + Aerosols, data=train)
You can get the coefficient for N2O and the model R-squared by typing summary(LinReg).
We have observed that, for this problem, when we remove many variables the sign of N2O flips. The model has not lost a lot of explanatory power (the model R2 is 0.7261 compared to 0.7509 previously) despite removing many variables. As discussed in lecture, this type of behavior is typical when building a model where many of the independent variables are highly correlated with each other. In this particular problem many of the variables (CO2, CH4, N2O, CFC.11 and CFC.12) are highly correlated, since they are all driven by human industrial development.
- Back: Video 4: Making Predictions
- Continue: Reading Test Scores
You are leaving MIT OpenCourseWare
- Mathematics
- Regression Analysis (Video)
- Co-ordinated by : IIT Kharagpur
- Available from : 2012-07-11
- Intro Video
- Simple Linear Regression
- Simple Linear Regression (Contd.)
- Simple Linear Regression (Contd. )
- Simple Linear Regression ( Contd.)
- Simple Linear Regression ( Contd. )
- Multiple Linear Regression
- Multiple Linear Regression (Contd.)
- Multiple Linear Regression (Contd. )
- Multiple Linear Regression ( Contd.)
- Selecting the BEST Regression Model
- Selecting the BEST Regression Model (Contd.)
- Selecting the BEST Regression Model (Contd. )
- Selecting the BEST Regression Model ( Contd.)
- Multicollinearity
- Multicollinearity (Contd.)
- Multicollinearity ( Contd.)
- Model Adequacy Checking
- Model Adequacy Checking (Contd.)
- Model Adequacy Checking ( Contd.)
- Test for Influential Observations
- Transformation and Weighting to correct model inadequacies
- Transformation and Weighting to correct model inadequacies (Contd.)
- Transformation and Weighting to correct model inadequacies ( Contd.)
- Dummy Variables
- Dummy Variables (Contd.)
- Dummy Variables (Contd. )
- Polynomial Regression Models
- Polynomial Regression Models (Contd.)
- Polynomial Regression Models (Contd. )
- Generalized Linear Models
- Generalized Linear Models (Contd.)
- Non-Linear Estimation
- Regression Models with Autocorrelated Errors
- Regression Models with Autocorrelated Errors (Contd.)
- Measurement Errors and Calibration Problem
- Tutorial - I
- Tutorial - II
- Tutorial - III
- Tutorial - IV
- Tutorial - V
- Watch on YouTube
- Assignments
- Download Videos
- Transcripts
- Self Evaluation (3)
Module Name | Download | Description | Download Size |
---|---|---|---|
Simple Linear Regression | Please see all questions attached with the last module. | 24 | |
Tutorial - V | This is a questionnaire with answers that covers all the modules and could be attempted after listening the full course. | 140 | |
Tutorial - V | This is a questionnaire with answers that covers all the modules and could be attempted after listening the full course. | 5120 |
Sl.No | Chapter Name | MP4 Download |
---|---|---|
1 | Simple Linear Regression | |
2 | Simple Linear Regression (Contd.) | |
3 | Simple Linear Regression (Contd. ) | |
4 | Simple Linear Regression ( Contd.) | |
5 | Simple Linear Regression ( Contd. ) | |
6 | Multiple Linear Regression | |
7 | Multiple Linear Regression (Contd.) | |
8 | Multiple Linear Regression (Contd. ) | |
9 | Multiple Linear Regression ( Contd.) | |
10 | Selecting the BEST Regression Model | |
11 | Selecting the BEST Regression Model (Contd.) | |
12 | Selecting the BEST Regression Model (Contd. ) | |
13 | Selecting the BEST Regression Model ( Contd.) | |
14 | Multicollinearity | |
15 | Multicollinearity (Contd.) | |
16 | Multicollinearity ( Contd.) | |
17 | Model Adequacy Checking | |
18 | Model Adequacy Checking (Contd.) | |
19 | Model Adequacy Checking ( Contd.) | |
20 | Test for Influential Observations | |
21 | Transformation and Weighting to correct model inadequacies | |
22 | Transformation and Weighting to correct model inadequacies (Contd.) | |
23 | Transformation and Weighting to correct model inadequacies ( Contd.) | |
24 | Dummy Variables | |
25 | Dummy Variables (Contd.) | |
26 | Dummy Variables (Contd. ) | |
27 | Polynomial Regression Models | |
28 | Polynomial Regression Models (Contd.) | |
29 | Polynomial Regression Models (Contd. ) | |
30 | Generalized Linear Models | |
31 | Generalized Linear Models (Contd.) | |
32 | Non-Linear Estimation | |
33 | Regression Models with Autocorrelated Errors | |
34 | Regression Models with Autocorrelated Errors (Contd.) | |
35 | Measurement Errors and Calibration Problem | |
36 | Tutorial - I | |
37 | Tutorial - II | |
38 | Tutorial - III | |
39 | Tutorial - IV | |
40 | Tutorial - V |
Sl.No | Chapter Name | English |
---|---|---|
1 | Simple Linear Regression | |
2 | Simple Linear Regression (Contd.) | |
3 | Simple Linear Regression (Contd. ) | |
4 | Simple Linear Regression ( Contd.) | |
5 | Simple Linear Regression ( Contd. ) | |
6 | Multiple Linear Regression | |
7 | Multiple Linear Regression (Contd.) | |
8 | Multiple Linear Regression (Contd. ) | |
9 | Multiple Linear Regression ( Contd.) | PDF unavailable |
10 | Selecting the BEST Regression Model | PDF unavailable |
11 | Selecting the BEST Regression Model (Contd.) | PDF unavailable |
12 | Selecting the BEST Regression Model (Contd. ) | PDF unavailable |
13 | Selecting the BEST Regression Model ( Contd.) | PDF unavailable |
14 | Multicollinearity | PDF unavailable |
15 | Multicollinearity (Contd.) | PDF unavailable |
16 | Multicollinearity ( Contd.) | PDF unavailable |
17 | Model Adequacy Checking | PDF unavailable |
18 | Model Adequacy Checking (Contd.) | PDF unavailable |
19 | Model Adequacy Checking ( Contd.) | PDF unavailable |
20 | Test for Influential Observations | PDF unavailable |
21 | Transformation and Weighting to correct model inadequacies | PDF unavailable |
22 | Transformation and Weighting to correct model inadequacies (Contd.) | PDF unavailable |
23 | Transformation and Weighting to correct model inadequacies ( Contd.) | PDF unavailable |
24 | Dummy Variables | PDF unavailable |
25 | Dummy Variables (Contd.) | PDF unavailable |
26 | Dummy Variables (Contd. ) | PDF unavailable |
27 | Polynomial Regression Models | PDF unavailable |
28 | Polynomial Regression Models (Contd.) | PDF unavailable |
29 | Polynomial Regression Models (Contd. ) | PDF unavailable |
30 | Generalized Linear Models | PDF unavailable |
31 | Generalized Linear Models (Contd.) | PDF unavailable |
32 | Non-Linear Estimation | PDF unavailable |
33 | Regression Models with Autocorrelated Errors | PDF unavailable |
34 | Regression Models with Autocorrelated Errors (Contd.) | PDF unavailable |
35 | Measurement Errors and Calibration Problem | PDF unavailable |
36 | Tutorial - I | PDF unavailable |
37 | Tutorial - II | PDF unavailable |
38 | Tutorial - III | PDF unavailable |
39 | Tutorial - IV | PDF unavailable |
40 | Tutorial - V | PDF unavailable |
Sl.No | Language | Book link |
---|---|---|
1 | English | Not Available |
2 | Bengali | Not Available |
3 | Gujarati | Not Available |
4 | Hindi | Not Available |
5 | Kannada | Not Available |
6 | Malayalam | Not Available |
7 | Marathi | Not Available |
8 | Tamil | Not Available |
9 | Telugu | Not Available |
COMMENTS
c plot.9.2 Statistical hypothesesFor simple linear regression, the chief null hypothesis is H0 : β1 = 0, and the corresponding alter. ative hypothesis is H1 : β1 6= 0. If this null hypothesis is true, then, from E(Y ) = β0 + β1x we can see that the population mean of Y is β0 for every x value, which t.
Sales, TV, radio, and newspaper, rather than in terms of the coe cients of the linear model. 3.4 (p. 120 ISLR) I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then t a linear regression model to the data, as well as a separate cubic regression, i.e. Y = 0 + 1X+ 2X2 + 3X3 + .
12.2 Regression Models for Counts, 272 12.2.1 Binomial Regression, 272 12.2.2 Deviance, 277 12.3 Poisson Regression, 279 12.3.1 Goodness of Fit Tests, 282 12.4 Transferring What You Know about Linear Models, 283 12.4.1 Scatterplots and Regression, 283
In general, linear regression is a technique used for modeling and analysis of numerical data. It tries to leverage the information between di erent variables in a way that allows us to infer the value of one given the others. In statistics, prediction can be used for prediction, estimation, hypothesis testing, and modeling ...
1- linear regression. Assume that the data is formed by. yi = wxi+ noisei. where... the noise signals are independent. the noise has a normal distribution with mean 0 and unknown variance σ2. P(y|w,x) has a normal distribution with. mean wx. variance σ2.
In the case of linear regression, the model simply consists of linear functions. Recall that a linear function of Dinputs is parameterized in terms of Dcoe cients, which we'll call the weights, and an intercept term, which we'll call the bias. Mathematically, this is written as: y= X j w jx j + b: (1) Figure 1 shows two ways to visualize ...
The hypothesis was tested using a bivariate linear regression to determine whether student grades on Assignment 2 could be predicted based on student grades from Assignment 1. Regression analysis revealed that the model significantly predicted Assignment 2 grades based on Assignment 1 grades, F (1, 23) = 18.207, p < .001. R2 for
Regression. Technique used for the modeling and analysis of numerical data. Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other. Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships.
(b) Some potential linear fits to the Income data with the parameterization y = mx + b. Figure 1: Raw data and simple linear functions. There are many different loss functions we could come up with to express different ideas about what it means to be bad at fitting our data, but by far the most popular one for linear regression is
Part 1: Motivation (Regression Problems) Part 2: Linear Regression Basics. Part 3: The Cost Function. Part 4: The Gradient Descent Algorithm. Part 5: The Normal Equation. Part 6: Linear Algebra overview. Part 7: Using Octave. Part 8: Using R. Part 9: Using Python.
el for the delivery time data isy 0(1.2)Equation. (1.2) is called a linear regression model. Customarily x is called the inde-pendent varia. le and y is called the dependent variable. However, this often causes confusion with the concept of statistical independence, so we refer to x as the pre-dictor or regress.
The regression analysis is a technique which helps in determining the statistical model by using the data on study and explanatory variables. The classification of linear and nonlinear regression analysis is based on the determination of linear and nonlinear models, respectively.
Chapter 12 Class Notes - Linear Regression and Correlation We'll skip all of §12.7 and parts of §12.8, and cover the rest. We'll consider the following two illustrations (graphs are below): Example 1 (p.503 #12.3.2): y = drop in body temperature, x = log10(dose of ethanol)
2 MAT 120 REGRESSION ASSIGNMENT A sample of the output created on Mac Numbers is given below. The regression equation y = 6137:4x 478:9 is on the graph. To predict the price of a diamond weighing 0:85 carats, calculate price by plugging in x = 0:85 into the regression equation y = 6137:4(0:85) 478:9 = $4737:89: Weight (carats) Price ($) 0.5 ...
The reduced major axis regression method minimizes the sum of the areas of rectangles defined between the observed data points and the nearest point on the line in the scatter diagram to obtain the estimates of regression coefficients. This is shown in the following figure: yi. (xi yi) Y . 0.
11.2.1 A Bayesian Multiple Regression Model with a Conjugate Prior 280 11.2.2 Marginal Posterior Density of b 282 11.2.3 Marginal Posterior Densities of tand s2 284 11.3 Inference in Bayesian Multiple Linear Regression 285 11.3.1 Bayesian Point and Interval Estimates of Regression Coefficients 285 11.3.2 Hypothesis Tests for Regression ...
Linear regression is one of only a handful of models in this course that permit direct solution. UofT CSC 411: 06-Linear Regression 16 / 37. Direct solution The minimum must occur at a point where the partial derivatives are zero. @J @w j = 0 @J @b = 0: If @J=@w j 6= 0, you could reduce the cost by changing w j.
a. The data have a linear pattern. As the number of hours studied increases, the math SAT scores improve. b. y = 22.54x + 344.19. c. For every increase of 1 hour in studying, the math SAT score will increase by about 22.54 points. If a student studies for 0 hours, their math SAT score will be 344.19. PTS: 1.
5. Linear Model Methodology by Andre I. Khuri (CRC Press, 2010) Assignaments: Assignment 1. Assignment 2 . Assignment 3 . Assignment 4 . Assignment 5 . Assignment 6 . Assignment 7 . Assignment 8 . Lecture notes for your help (If you find any typo, please let me know) Lecture Notes 1: Introduction. Lecture Notes 2: Simple Linear Regression Analysis
Next, build a linear regression model to predict the dependent variable Temp, using MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables ( Year and Month should NOT be used in the model). Use the training set to build the model. Enter the model R2 (the "Multiple R-squared" value): Exercise 1.
Assignments; Download Videos; Transcripts; Books; Self Evaluation (3) Module Name Download Description Download Size; Simple Linear Regression: Self Evaluation: Please see all questions attached with the last module. 24: Tutorial - V: Self Evaluation: ... Multiple Linear Regression ( Contd.) PDF unavailable: 10: Selecting the BEST Regression Model: