Module 4: Linear Regression

By the end of this module, you should be able to:

Draw standard curves on scatterplots
Use the regressionAnalysis() function to find the equations of standard curves
Make predictions using linear models algebraically
Articulate the utility of R^2
Define regression analysis

Standard Curves and Linear Regression

The mosaic package contains a data frame called SAT that contains information on educational expenditures and SAT scores in each state. After loading the Mosaic package, the first six cases in the SAT data frame are shown below.

library(mosaic)

data(SAT)
head(SAT)

##        state expend ratio salary frac verbal math  sat
## 1    Alabama  4.405  17.2  31.14    8    491  538 1029
## 2     Alaska  8.963  17.6  47.95   47    445  489  934
## 3    Arizona  4.778  19.3  32.17   27    448  496  944
## 4   Arkansas  4.459  17.1  28.93    6    482  523 1005
## 5 California  4.992  24.0  41.08   45    417  485  902
## 6   Colorado  5.443  18.4  34.57   29    462  518  980

Suppose we are interested in the relationship between students’ verbal and math scores. Let’s load ggplot2 and plot these two variables to see if there is a possible relationship between them.

library(ggplot2)

As a reminder, use a scatterplot when examining the relationship between two quantitative variables. There are three steps to make a scatterplot: * Use the ggplot() command to set the data frame and variables of interest in the plot * Use the geom_point() command to tell R to make a scatterplot * Use the labs() command to label your graph. Remember to use the following three arguments: * title = “” * x = “” * y = “”

ggplot(SAT, aes(x=verbal, y=math)) + geom_point() + labs(title = "Plot of Math Scores By Verbal Scores", x = "Verbal", y = "Math")

plot of chunk unnamed-chunk-5

Based on the plot, there appears to be an approximate linear relationship between the two variables, meaning we could draw a best-fit line to summarize the data. We can draw ** least squares best-fit lines** on graphs by adding an additional function, stat_smooth(), to the commands used to build our initial scatterplot. In Biology, these are also called standard curves. There are two arguments to the stat_smooth() command: * method = “lm” tells R to make the standard curve linear (which is always the case in this course) * se = FALSE is an argument to override a default. Don’t worry about the meaning, but remember to include it!

ggplot(SAT, aes(verbal, math)) + geom_point() + labs(title = "Plot of Math Scores By Verbal Scores", x = "Verbal", y = "Math") + stat_smooth(method = "lm", se=FALSE)

plot of chunk unnamed-chunk-6

We build linear models to find the equations of standard curves. We create these models using the lm() function, which has the syntax:

model_name = lm(response_variable ~ explanatory_variable, data = data_name)

The first linear model is built below. It is called model1

model1 = lm(math ~ verbal, data = SAT)

‘math’ is the response variable because it is on the y-axis in our graph, which means we are using verbal scores to explain students’ math scores. We can use the regressionAnalysis() function to find the standard curve’s equation. This function takes one argument, which is the name of a linear model.

regressionAnalysis(model1)

##      [,1]             [,2]                  
## [1,] "Intercept coef" "1.82796706977843"    
## [2,] "verbal coef"    "1.108964503063"      
## [3,] "R-Squared"      "0.941396774893267"   
## [4,] "verbal p-value" "3.17503562695812e-31"

For now, just focus on the ‘Intercept coef’ and ‘verbal coef’ values. How do we interpret these two numbers? We can go back to writing linear equations in the form y = a + bx. In this example, ‘a’ and ‘b’ are the coefficients on the intercept and the verbal variable, respectively. In other words, the verbal value is the slope of the standard curve, which is written as:

math = 1.828 + 1.109*verbal

This equation tells us that for every one unit increase in students’ verbal SAT scores, their math scores rise approximately 1.1 points.

For example, our model predicts that a student with a verbal score of 500 will get a math score of:

math = 1.828 + 1.109*500, which equals 556.328. You can visually confirm this by finding 500 on the Verbal axis, drawing up to the standard curve, then left to the Math axis. This is graphically demonstrated with the red lines below.

plot of chunk unnamed-chunk-9

Although the relationship between these two variables is linear, this might not be true for the relationships between other variables. Therefore, we have a value called R-Squared (R^2) that tells us the proportion of the variation of the response variable that is explained by the explanatory variable. R^2 is between 0 and 1, with 0 meaning the explanatory variable(s) explain none of the variation in the response values, and 1 meaning the explanatory variable(s) explain all of the variation in the response values.

The regressionAnalysis() function also returns the R-squared value. Its output is shown again below.

regressionAnalysis(model1)

##      [,1]             [,2]                  
## [1,] "Intercept coef" "1.82796706977843"    
## [2,] "verbal coef"    "1.108964503063"      
## [3,] "R-Squared"      "0.941396774893267"   
## [4,] "verbal p-value" "3.17503562695812e-31"

Based on the regressionAnalysis() output, verbal scores strongly explain 94% of the variation in math scores, since the R^2 value is 0.94.

Regression Analysis: A Second Example

Now, let’s look at the relationship between SAT scores and expenditure per pupil from the SAT data frame. First, let’s create a plot with a best-fit line.

ggplot(SAT, aes(expend, sat)) + geom_point() + labs(title = "SAT Scores by Expenditure Per Pupil", x = "expend", y = "sat") + stat_smooth(method = "lm", se = TRUE)

plot of chunk unnamed-chunk-11

Although we have drawn a standard curve on this plot, the line does not seem to accurately model the relationship between the two variables. We can build a linear model and find its R^2 value to confirm this.

model2 = lm(sat ~ expend, data = SAT)
regressionAnalysis(model2)

##      [,1]             [,2]                 
## [1,] "Intercept coef" "1089.29371775022"   
## [2,] "expend coef"    "-20.8921737146583"  
## [3,] "R-Squared"      "0.144808410884018"  
## [4,] "expend p-value" "0.00640796491636453"

model2’s R^2 value is 0.145 - a much lower number than model1’s R^2. Hence, simply creating a standard curve does not necessarily imply there is a relationship between the two variables!

Module Recap

The process of creating scatterplots, drawing standard curves, finding their equations, and checking their R^2 values is called regression analysis. Regression analysis derives linear equations between variables of interest. It allows us to predict the value of the response variable based on the value of the explanatory variable using a standard curve. It also provides us with an R^2 value, which is used to measure the usefulness of best-fit lines.