Module: Regression and Logarithmic Relationships

By the end of this module, you should be able to:

Logarithmic Axes

Previously, you learned how to do regression analysis, which is the process of answering the question, “What is the relationship between the X and Y variables?”. This process consists of graphing standard curves, finding their equations, and checking R2 values.

But sometimes, the relationship between the response and explanatory variables is not a linear one.

Let us consider the 'lifedata' data frame, which contains information on countries' GNPs, birth rates, and death rates for different demographics. In this module, we will use two variables:

First, load the Mosaic and ggplot2 packages

library(mosaic)
library(ggplot2)

Now, load the data frame

lifedata = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdE9sUmFmaVNjRzFmdGUzb3JWZDY4TlE&output=csv")
head(lifedata)
##   Birth Death InfantDeath MaleLife FemaleLife  GNP   Region
## 1  24.7   5.7        30.8     69.6       75.5  600 East.Eur
## 2  12.5  11.9        14.4     68.3       74.7 2250 East.Eur
## 3  13.4  11.7        11.3     71.8       77.7 2980 East.Eur
## 4  12.0  12.4         7.6     69.8       75.9   NA East.Eur
## 5  11.6  13.4        14.8     65.4       73.8 2780 East.Eur
## 6  14.3  10.2        16.0     67.2       75.7 1690 East.Eur
##             Country
## 1           Albania
## 2          Bulgaria
## 3    Czechoslovakia
## 4 Former_E._Germany
## 5           Hungary
## 6            Poland

To remind you of a linear relationship between variables, let's start with a plot of female life expectancy by male life expectancy.

ggplot(lifedata, aes(MaleLife, FemaleLife)) + geom_point() + labs(title = "Female Life Expectancy by Male Life Expectancy", x = "Male Life Expectancy", y = "Female Life Expectancy") + stat_smooth(method = "lm", se = FALSE)

plot of chunk unnamed-chunk-4

However, not all variables are linearly related. For example, let's look at a scatterplot of female life expectancy by GNP.

ggplot(lifedata, aes(GNP, FemaleLife)) + geom_point() + labs(title = "Female Life Expectancy by GNP", x = "GNP", y = "Female Life Expectancy") + stat_smooth(method = "lm", se = FALSE)
## Warning: Removed 6 rows containing missing values (stat_smooth).
## Warning: Removed 6 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-5

Based on the above plot, there is not a linear relationship between countries' infant mortality rates and their GNPs. Notice that the line plotted on the graph does not accurately model the variables' relationship In order to create a linear model, we can change the scale of one axis in order to make the relationship a linear one.

How do we know what the scale should be? Again, look at the graph. It shows that as GNP rises, female life expectancies increase. However, the rate of increase seems to level off once GNP is approximately $7500. This means the relationship between the two variables is logarithmic (if you don't remember what this means, search 'logarithmic relationship' in Google). Thus, if you convert the x-axis (GNP) to a logarithmic scale, you will observe a linear relationship between the variables. The R code to do this is below.

ggplot(lifedata, aes(log(GNP), FemaleLife)) + geom_point() + labs(title = "Female Life Expectancy by log(GNP)", x = "log(GNP)", y = "Female Life Expectancy") + stat_smooth(method = "lm", se = FALSE)
## Warning: Removed 6 rows containing missing values (stat_smooth).
## Warning: Removed 6 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-6

It appears the relationship between FemaleLife and log(GNP) is linear, since the standard curve seems to reflect the data. Therefore, we can graph a standard curve on the scatterplot.

We can now build a linear model to find the equation of the standard curve We will call our model 'lifemod1', and then use the regressionAnalysis() function to write the model equation.

lifemod1 = lm(FemaleLife ~ log(GNP), data = lifedata)
regressionAnalysis(lifemod1)
##      [,1]               [,2]                  
## [1,] "Intercept coef"   "24.170500732192"    
## [2,] "log(GNP) coef"    "5.57267327414255"    
## [3,] "R-Squared"        "0.67791975431029"    
## [4,] "log(GNP) p-value" "1.29598121770242e-23"

According to the regressionAnalysis() function, the model equation is:

FemaleLife = 24.2 + 5.6log(GNP)

The inversePredict() function

We can also use the inversePredict() function to predict the value of the explanatory variable based on the value of the response variable (which is why it is an inverse prediction). The function's syntax is:

For example, if we want to find the log(GNP) of a country with a female life expectancy of 70, we find the answer to be 8.22 using the inversePredict() function:

inversePredict(lifemod1, 70)
## $inversePrediction
## [1] 8.224
##
## $Graph
## Error: object 'mod' not found