Let's look at the dataset first:
gasdata = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0AnFamthOzwySdEotQWVNREZQeFVXZEItS2JtYVQzTmc&output=csv")
head(gasdata)
## month year price therms hdd address renovated
## 1 6 2005 35.21 21 0 a no
## 2 7 2005 37.37 21 0 a no
## 3 8 2005 36.93 21 3 a no
## 4 9 2005 62.36 39 61 a no
## 5 10 2005 184.15 120 416 a no
## 6 11 2005 433.35 286 845 a no
This dataset is the Macalester natural gas data. A few years back, Macalester invested in insulating three campus-owned houses (address a, b, and c). Monthly data on energy use and other variables are available for each location.
We will only look at two variables–price and renovated for now.
price is monthly money spent on heating renovated is a categorical variable indicating whether the location had already been renovated/insulated or not (yes/no)
We want to see whether the mean prices are different between the renovated group and the non-renovated group.
We can first draw a boxplot to show whether there is a difference of the mean prices between the two groups.
ggplot(gasdata, aes(x=renovated, y=price)) + geom_boxplot()
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).
Based on the boxplot, there is not much difference of the mean prices between the renovated and non-renovated groups. We can see the renovated groups has a slightly higher price, i.e. more heating is used in the renovated group than the non-renovated group.
To examine the result and check whether the difference is significant, we can use the two-sample t-test.
Before we run the t-test, we need to know how to use the subset command. In a t-test, we need to separate the two groups first using the command subset() and then compare between the two groups. The general form of the subset command is subset(Your_Dataset_Name, Categorical_Variable==“Group_Name”).
For example, in the gasdata, if we want to separate different addresses a, b, and c, we can build 3 subsets, called add_a, add_b, and add_c.
add_a = subset(gasdata, address==“a”)
add_b = subset(gasdata, address==“b”)
add_c = subset(gasdata, address==“c”)
Now let's continue with the t-test example.
# First use subset() to separate two groups
renov=subset(gasdata, renovated=="yes")
norenov=subset(gasdata, renovated=="no")
# Then run a t-test to compare the mean prices between the two groups
mytest = t.test(renov$price, norenov$price)
# Find the p-value of the t-test.
mytest$p.value
## [1] 0.2166
A T-test has two hypotheses:
Since the p-value is 0.0821 > 0.05, we do not reject the null hypothesis of the mean prices of the renovated group and the non-renovated group are the same. Therefore we conclude that renovated does not have a significant effect on the prices, i.e. the amount of heating we use.
We can use two-sample t-tests to see whether the means of the two groups A and B are different. However, if we have three groups, the two-sample t-test would not be the best choice. We will instead use the ANOVA tests to see whether there is any difference among the three group means.
Let's still use the gasdata. Now only look at the two variables–price and address.
price is monthly money spent on heating
address is a categorical variable specifying location a, b, or c
In this case, we want to see whether the mean prices are the same in different addresses.
Again, we can draw a boxplot first to see whether the prices are the same in different addresses.
ggplot(gasdata,aes(x=address, y=price)) + geom_boxplot()
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).
The boxplot shows that the mean prices in different addresses a, b, and c are not the same. We can use the ANOVA tests to examine the result.
# Build a model first to explore the relationship between price and address, call it mod1
mod1 = lm(price~address, data=gasdata)
# Run an ANOVA test using the model we built
anova(mod1)
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## address 2 145954 72977 3.33 0.04 *
## Residuals 96 2106099 21939
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Find the p-value of the ANOVA test (Pr(>F) = 0.04*)
The ANOVA test also has two hypotheses:
Since the p-value is 0.04 < 0.05, we reject the null hypothesis of the mean prices in address a, b, c are all the same. We conclude that there is at least 1 address that has a mean price significantly different to others, that is, the amount of heating used in at least 1 address is significantly different to heating used in other addresses.
But how can we know exactly the difference between each address? Is there a significant difference between address a and b, or b and c, or all of them? To answer this question, we need to do some further tests using the TukeyHSD function.
TukeyHSD means Tukey's Honest Significant Difference method. The function TukeyHSD() creates a set of confidence intervals on the differences between means with the specified family-wise probability of coverage[1].
The general command is TukeyHSD(YOUR_MODEL_NAME).
TukeyHSD(mod1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = x)
##
## $address
## diff lwr upr p adj
## b-a -71.49 -158.3 15.320 0.1277
## c-a -88.67 -175.5 -1.867 0.0441
## c-b -17.19 -104.0 69.619 0.8849
The Tukey summary table shows us the difference between pairs, the 95% confidence intervals and the the p-value of the pairwise comparisons.
The first column of the result table is the difference of the mean prices between each two addresses. We can see that the price in address a is the highest, which means the amount of heating used in address a is the highest.
The lwr and upr are the 95% confidence intervals. You do not need to worry much about this.
The p adj represents the p-value. If p-value < 0.05, then we say the two mean prices between the two addresses are significantly different.
b-a in the first row means comparing b to a. The diff is -71.49, which means the mean price in address b is 71.49 lower than the mean price in address a. The p-value is 0.1277, which means we cannot reject the null hypothesis. Therefore we conclude the mean prices between address a and b are the same.
c-a in the second row means comparing c to a. The diff is -88.67, which means the mean price in address c is 88.67 lower than the mean price in address a. The p-value is 0.0441, which means we can reject the null hypothesis and conclude the mean prices between address a and c are significantly different.
c-b in the third row means comparing c to b. The diff is -17.19, which means the mean price in address c is 17.19 lower than the mean price in address b. The p-value is 0.8849, which means we cannot reject the null. Therefore we conclude the mean prices between address b and c are the same.
According to all differences between groups, we can see that the mean price/energy use in address a is the highest.
Citation [1]:http://stat.ethz.ch/R-manual/R-patched/library/stats/html/TukeyHSD.html