Module 4: Linear Regression

By the end of this module, you should be able to:

Standard Curves and Linear Regression

The mosaic package contains a data frame called SAT that contains information on educational expenditures and SAT scores in each state. After loading the mosaic package, the first six cases in the SAT data frame are shown below.

library(mosaic)
data(SAT)
head(SAT)
##        state expend ratio salary frac verbal math  sat
## 1    Alabama  4.405  17.2  31.14    8    491  538 1029
## 2     Alaska  8.963  17.6  47.95   47    445  489  934
## 3    Arizona  4.778  19.3  32.17   27    448  496  944
## 4   Arkansas  4.459  17.1  28.93    6    482  523 1005
## 5 California  4.992  24.0  41.08   45    417  485  902
## 6   Colorado  5.443  18.4  34.57   29    462  518  980

Notice that math scores vary from state to state. Suppose we are interested in modeling this variability using students' verbal scores. In other words, can students' verbal scores be used to predict their math scores? In this relationship, there are two types of variables:

In this example, we are using students' verbal scores to explain their math scores. Therefore, verbal scores are the explanatory variable, and math scores are the response variable.

Let's first load ggplot2 to graphically explore the relationship between the two variables.

library(ggplot2)

As a reminder, use a scatterplot to examine the relationship between two quantitative variables. There are three steps to construct a scatterplot:

Additionally, it is standard to plot the explanatory variable on the x-axis, and the response variable on the y-axis.

ggplot(SAT, aes(x=verbal, y=math)) + geom_point() + labs(title = "Plot of Math Scores By Verbal Scores", x = "Verbal", y = "Math")

plot of chunk unnamed-chunk-5

Based on the plot, there appears to be a positive linear relationship between the two variables, meaning an increase in verbal scores corresponds to an increase in math scores. This can be summarized with a standard curve (i.e. best-fit line). We can add standard curves to scatterplots with the stat_smooth() command, which takes the following two arguments:

ggplot(SAT, aes(verbal, math)) + geom_point() + labs(title = "Plot of Math Scores By Verbal Scores", x = "Verbal", y = "Math") + stat_smooth(method = "lm", se=FALSE)

plot of chunk unnamed-chunk-6

We build linear models to find equations of standard curves. This requires the following lm() function of the form:

The first linear model is built below. It is called model1.

model1 = lm(math ~ verbal, data = SAT)

We can use the regressionAnalysis() function to summarize model1 and find the standard curve's equation. This function takes one argument, which is the name of a linear model.

regressionAnalysis(model1)
## $Model.Values
##       terms coefficients p.values
## 1 Intercept        1.828    0.921
## 2    verbal        1.109    0.000
## 
## $R.Squared
## [1] 0.941
## 
## $Equation
## [1] "The equation of the linear model is: math = 1.828 + 1.109 * verbal"

For now, just focus on the 'Intercept coef' and 'verbal coef' values. How do we interpret these two numbers? We write equations of standard curves in the form y = a + bx, where

In this example, 'a' is the intercept coefficient, and 'b' is the verbal coefficient. Specifically,

math = 1.828 + 1.109*verbal

This equation tells us that for every one point increase in students' verbal SAT scores, their math scores rise 1.1 points on average. Math scores are 1.828 on average when verbal scores are zero (though it is impossible to get a 0 on the SAT, so we should not put much effort into interpreting the intercept).

The equation captures the relationship between the two variables, and also can be used to make predictions. For example, model1 predicts that a student with a verbal score of 500 will get a math score of:

math = 1.828 + 1.109*500 = 556.328

You can visually confirm this by finding 500 on the Verbal axis, drawing up to the standard curve, then left to the Math axis. This is graphically demonstrated with the red lines below.

plot of chunk unnamed-chunk-9

We can measure the strength of the relationship with a value called R-Squared (\( \mathbf{R^{2}} \)). R2 can be interpreted as the proportion of the variation of the response variable that is explained by the explanatory variable. R2 is between 0 and 1, with 0 meaning the explanatory variable explains none of the variation in the response values, and 1 meaning the explanatory variable explains all of the variation in the response values. The next three plots are scatterplots of variables with varying R2 values: 1, 0.5, and 0.1.

plot of chunk unnamed-chunk-10 plot of chunk unnamed-chunk-10 plot of chunk unnamed-chunk-10

We can find the R2 value of model1 using the regressionAnalysis() function. Its output is shown again below.

regressionAnalysis(model1)
## $Model.Values
##       terms coefficients p.values
## 1 Intercept        1.828    0.921
## 2    verbal        1.109    0.000
## 
## $R.Squared
## [1] 0.941
## 
## $Equation
## [1] "The equation of the linear model is: math = 1.828 + 1.109 * verbal"

Based on the regressionAnalysis() output, verbal scores strongly explain 94% of the variation in math scores, since the R2 value is 0.94.

Regression Analysis: A Second Example

Now, let's look at the relationship between SAT scores and expenditure per pupil from the SAT data frame. First, let's create a scatterplot.

ggplot(SAT, aes(expend, sat)) + geom_point() + labs(title = "SAT Scores by Expenditure Per Pupil", x = "expend", y = "sat") 

plot of chunk unnamed-chunk-12

It appears that there is a negative relationship between the two variables. We can add a standard curve to see summarize this relationship.

ggplot(SAT, aes(expend, sat)) + geom_point() + labs(title = "SAT Scores by Expenditure Per Pupil", x = "expend", y = "sat") + stat_smooth(method = "lm", se=FALSE)                                                                                                                          

plot of chunk unnamed-chunk-13

The downward-sloping standard curve confirms our initial statement about the relationship between the two variables. We create a linear model, called model2, and use the regressionAnalysis() function to summarize the model.

model2 = lm(sat ~ expend, data = SAT)
regressionAnalysis(model2)
## $Model.Values
##       terms coefficients p.values
## 1 Intercept      1089.29    0.000
## 2    expend       -20.89    0.006
## 
## $R.Squared
## [1] 0.145
## 
## $Equation
## [1] "The equation of the linear model is: sat = 1089.2937 + -20.8922 * expend"

According to the regressionAnalysis() output, the standard curve's equation is:

sat = 1089.29 - 20.89expend

This equation tells us that on average, a one-dollar increase in expenditure per pupil decreases SAT scores by 20.89 points (Note: 'expend' is in thousands of US dollars). SAT scores are 1089 on average when expenditure per pupil is zero (again, we can ignore this figure because it is nonsensical).

Model2's R2 value is 0.145, meaning expenditure per pupil explains only 14% of the variability in SAT values. Hence, simply creating a standard curve does not necessarily imply there is a relationship between the two variables!

Error Bars

It is also important to remember that standard curves are only estimates. Using model2 from above, we can re-plot the standard curve taking into account the standard deviations:

ggplot(SAT, aes(expend, sat)) + geom_point() + labs(title = "SAT Scores by Expenditure Per Pupil (With Standard Deviations)", x = "expend", y = "sat") + stat_smooth(method = "lm", se=TRUE)

plot of chunk unnamed-chunk-15

At this point, you do not need to know the mathematical calculation for the size of these error bars. Just remember that when making an estimation using a standard curve, the predictions can fall anywhere inside the shaded region at the desired value!

Module Recap

The process of creating scatterplots, drawing standard curves, finding their equations, and checking their R2 values is called regression analysis. Regression analysis derives linear equations between variables of interest. It allows us to predict the value of the response variable based on the value of the explanatory variable using a standard curve. It also provides us with an R2 value, which is used to measure the usefulness of standard curves.