The mosaic package contains a data frame called SAT that contains information on educational expenditures and SAT scores in each state. After loading the Mosaic package, the first six cases in the SAT data frame are shown below.
library(mosaic)
data(SAT)
head(SAT)
## state expend ratio salary frac verbal math sat
## 1 Alabama 4.405 17.2 31.14 8 491 538 1029
## 2 Alaska 8.963 17.6 47.95 47 445 489 934
## 3 Arizona 4.778 19.3 32.17 27 448 496 944
## 4 Arkansas 4.459 17.1 28.93 6 482 523 1005
## 5 California 4.992 24.0 41.08 45 417 485 902
## 6 Colorado 5.443 18.4 34.57 29 462 518 980
Suppose we are interested in the relationship between students’ verbal and math scores. Let’s load ggplot2 and plot these two variables to see if there is a possible relationship between them.
library(ggplot2)
As a reminder, use a scatterplot when examining the relationship between two quantitative variables. There are three steps to make a scatterplot: * Use the ggplot() command to set the data frame and variables of interest in the plot * Use the geom_point() command to tell R to make a scatterplot * Use the labs() command to label your graph. Remember to use the following three arguments: * title = “” * x = “” * y = “”
ggplot(SAT, aes(x=verbal, y=math)) + geom_point() + labs(title = "Plot of Math Scores By Verbal Scores", x = "Verbal", y = "Math")
Based on the plot, there appears to be an approximate linear relationship between the two variables, meaning we could draw a best-fit line to summarize the data. We can draw ** least squares best-fit lines** on graphs by adding an additional function, stat_smooth(), to the commands used to build our initial scatterplot. In Biology, these are also called standard curves. There are two arguments to the stat_smooth() command: * method = “lm” tells R to make the standard curve linear (which is always the case in this course) * se = FALSE is an argument to override a default. Don’t worry about the meaning, but remember to include it!
ggplot(SAT, aes(verbal, math)) + geom_point() + labs(title = "Plot of Math Scores By Verbal Scores", x = "Verbal", y = "Math") + stat_smooth(method = "lm", se=FALSE)
We build linear models to find the equations of standard curves. We create these models using the lm() function, which has the syntax:
The first linear model is built below. It is called model1
model1 = lm(math ~ verbal, data = SAT)
‘math’ is the response variable because it is on the y-axis in our graph, which means we are using verbal scores to explain students’ math scores. We can use the regressionAnalysis() function to find the standard curve’s equation. This function takes one argument, which is the name of a linear model.
regressionAnalysis(model1)
## [,1] [,2]
## [1,] "Intercept coef" "1.82796706977843"
## [2,] "verbal coef" "1.108964503063"
## [3,] "R-Squared" "0.941396774893267"
## [4,] "verbal p-value" "3.17503562695812e-31"
For now, just focus on the ‘Intercept coef’ and ‘verbal coef’ values. How do we interpret these two numbers? We can go back to writing linear equations in the form y = a + bx. In this example, ‘a’ and ‘b’ are the coefficients on the intercept and the verbal variable, respectively. In other words, the verbal value is the slope of the standard curve, which is written as:
math = 1.828 + 1.109*verbal
This equation tells us that for every one unit increase in students’ verbal SAT scores, their math scores rise approximately 1.1 points.
For example, our model predicts that a student with a verbal score of 500 will get a math score of:
math = 1.828 + 1.109*500, which equals 556.328. You can visually confirm this by finding 500 on the Verbal axis, drawing up to the standard curve, then left to the Math axis. This is graphically demonstrated with the red lines below.
Although the relationship between these two variables is linear, this might not be true for the relationships between other variables. Therefore, we have a value called R-Squared (R^2) that tells us the proportion of the variation of the response variable that is explained by the explanatory variable. R^2 is between 0 and 1, with 0 meaning the explanatory variable(s) explain none of the variation in the response values, and 1 meaning the explanatory variable(s) explain all of the variation in the response values.
The regressionAnalysis() function also returns the R-squared value. Its output is shown again below.
regressionAnalysis(model1)
## [,1] [,2]
## [1,] "Intercept coef" "1.82796706977843"
## [2,] "verbal coef" "1.108964503063"
## [3,] "R-Squared" "0.941396774893267"
## [4,] "verbal p-value" "3.17503562695812e-31"
Based on the regressionAnalysis() output, verbal scores strongly explain 94% of the variation in math scores, since the R^2 value is 0.94.
Now, let’s look at the relationship between SAT scores and expenditure per pupil from the SAT data frame. First, let’s create a plot with a best-fit line.
ggplot(SAT, aes(expend, sat)) + geom_point() + labs(title = "SAT Scores by Expenditure Per Pupil", x = "expend", y = "sat") + stat_smooth(method = "lm", se = TRUE)
Although we have drawn a standard curve on this plot, the line does not seem to accurately model the relationship between the two variables. We can build a linear model and find its R^2 value to confirm this.
model2 = lm(sat ~ expend, data = SAT)
regressionAnalysis(model2)
## [,1] [,2]
## [1,] "Intercept coef" "1089.29371775022"
## [2,] "expend coef" "-20.8921737146583"
## [3,] "R-Squared" "0.144808410884018"
## [4,] "expend p-value" "0.00640796491636453"
model2’s R^2 value is 0.145 - a much lower number than model1’s R^2. Hence, simply creating a standard curve does not necessarily imply there is a relationship between the two variables!
The process of creating scatterplots, drawing standard curves, finding their equations, and checking their R^2 values is called regression analysis. Regression analysis derives linear equations between variables of interest. It allows us to predict the value of the response variable based on the value of the explanatory variable using a standard curve. It also provides us with an R^2 value, which is used to measure the usefulness of best-fit lines.