library(mosaic)
library(ggplot2)
R has a built-in data-frame called “Gestation” which was taken from the Child Health and Development Studies that was conducted in 1961 and 1962. Birth-weight of infant, date, gestational period, and additional parental information including age, education, height, weight, and mother's smoking habits are recorded as variables in this data frame.
Because this data-frame is built into R, we can access it by using the command:
data(Gestation)
Now that the data is accessible, let us take a look at the first six cases of the data frame.
head(Gestation)
## id pluralty outcome date gestation sex wt parity race age ed ht wt.1
## 1 15 5 1 1411 284 1 120 1 8 27 5 62 100
## 2 20 5 1 1499 282 1 113 2 0 33 5 64 135
## 3 58 5 1 1576 279 1 128 1 0 28 2 64 115
## 4 61 5 1 1504 NA 1 123 2 0 36 5 69 190
## 5 72 5 1 1425 282 1 108 1 0 23 5 67 125
## 6 100 5 1 1673 286 1 136 4 0 25 2 62 93
## drace dage ded dht dwt marital inc smoke time number
## 1 8 31 5 65 110 1 1 0 0 0
## 2 0 38 5 70 148 1 4 0 0 0
## 3 5 32 1 NA NA 1 2 1 1 1
## 4 3 43 4 68 197 1 8 3 5 5
## 5 0 24 5 NA NA 1 1 1 1 5
## 6 3 28 2 64 130 1 4 2 2 2
When we observe the data-frame, we notice that there may be many different explanatory variables that are associated with infants' birth weights. The first step to noting whether or not there is a relationship between two variables is by observing them graphically.
Let us use infant's weight as the response variable (i.e. what we are interested in) and gestation as the explanatory variable (i.e. what we think might have a strong influence on infant's weight). Gestation is the length of the gestation period (in days) and infant's weight is measured in ounces. It makes intuitive sense that gestation period and infant's weight will be positively correlated, meaning an increase in the infant's weight simultaneously occurs with an increase in gestation period.
ggplot(Gestation, aes(gestation, wt)) + geom_point() +labs(title="Infant's Weight by Gestation Period", x="Gestation Period (in days)", y="Infant's birth weight (in ounces)")
## Warning: Removed 13 rows containing missing values (geom_point).
Based on the above scatter-plot, we can see that there is a relationship between gestation period and infant's weight. We can add a regression line to the above scatter-plot to find the magnitude of the correlation.
ggplot(Gestation, aes(gestation, wt)) + geom_point() +stat_smooth(method="lm", se=FALSE) +labs(title="Infant's Weight by Gestation Period", x="Gestation Period (in days)", y="Infant's birth weight (in ounces)")
## Warning: Removed 13 rows containing missing values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
The standard curve has a positive slope which indicates that gestation period and infant's weight are positively correlated. This means our initial assumption was correct!
To put it in simpler words, both the scatter-plot and regression line illustrate patterns in the Gestation data. Therefore, these patterns provide insight into a larger population of interest.
But how do we know whether these patterns provide us with enough evidence to infer that there is a significant relationship between gestation period and infant's weight in the entire population (and not just our sample data)?
This is where we use the process of hypothesis testing, which is a formal way to make conclusions about the entire population.
Hypothesis tests always involve two hypotheses, a p-value and a conclusion.
Null Hypothesis(H0): This is the default hypothesis, which states there is not a linear relationship between the explanatory and the response variables.
Alternative Hypothesis (Ha): The alternative hypothesis states there is a statistically significant linear relationship between the explanatory and response variables.
For the Gestation example, our hypotheses would be as follows:
Null Hypothesis (H0): There is no linear relationship between gestation period and infant's weight.
Alternative Hyothesis(Ha): There is a linear relationship between gestation period and infant's weight.
We can accept/reject the hypotheses based on a statistic called the “p-value”.
A p-value is the probability that the observed relationship between the variables of interest in our sample is due to chance.A low p-value reflects that the observed strength of the relationship in our sample is unlikely due to chance. Thus, the smaller the p-value, the more evidence we have against the null hypothesis.
In the scientific community, a p-value cut-off of 0.05 is standard. If a p-value is less than 0.05, then we have enough evidence to reject the null hypothesis in favor of the alternative. If a p-value is above 0.05, the data do not provide enough evidence to reject the null.
Let us now go back to the Gestation data. To find the p-value, we need to first construct a linear model between infant's weight (in ounces) and gestation period (in days) and then use the regressionAnalysis() function on the model. Let us name this model1.
model1=lm(wt~gestation, data=Gestation)
regressionAnalysis(model1)
## [,1] [,2]
## [1,] "Intercept coef" "-10.0641842448408"
## [2,] "gestation coef" "0.464262603017409"
## [3,] "R-Squared" "0.166344859631711"
## [4,] "gestation p-value" "3.2243619311291e-50"
To make things simpler, let us rewrite the above equation in y = a + bx form. “y” is the response variable i.e. infant weight; “b” is the slope i.e. the coefficient of gestation; “x” is the value of the explanatory variable i.e. gestation and “a” is the intercept.
The resultant equation is: wt = 0.4643*gestation - 10.064
This means that for every one day increase in gestation period, infant's weight increases by 0.4643 ounces. Now, to determine whether or not this relationship is statistically significant, we look at the p-value of the slope term. This is the bottom right number in the regressionAnalysis() function's output. We see a value of 3.224e-50 (which can also be read as 3.224*10-50), which is much lower than the accepted p-value of 0.05. This means that we have statistically significant evidence to reject the null hypothesis and accept the alternative hypothesis (i.e. there is a statistically significant linear relationship between gestation period and infant's birth weight). The strength of the observed correlation between gestation period and infant's weight is highly unlikely to have occurred by chance.
Now, that you have become pros at p-values, let us examine the relationship between infant's weight and their birth-dates. Birth-date is measured by the variable “date” where 1096 would indicate that the baby was born on January 1, 1961. We can state our hypotheses as follows:
Null Hypothesis(H0): There is no relationship between the infants' birth weights and birth dates
Alternative Hypothesis(Ha): There is a relationship between infants' birth weights and birth dates
Let's start with a plot of the two variables to determine whether there appears to be a correlation between them.
ggplot(Gestation, aes(date, wt)) + geom_point() + stat_smooth(method="lm", se=FALSE) +labs(title="Infants' Birth Weights by Birth Dates", x="Infants' Birth Dates", y="Infants' birth weights (in ounces)")
We can see that the standard curve is almost perfectly horizontal, which suggests that we will accept the null hypothesis in this hypothesis test. To confirm these results, let's build a linear model, called model2, and run the regressionAnalysis() function on this model.
model2=lm(wt~date, data=Gestation)
regressionAnalysis(model2)
## [,1] [,2]
## [1,] "Intercept coef" "105.937641290166"
## [2,] "date coef" "0.00888064045122317"
## [3,] "R-Squared" "0.00270971603106753"
## [4,] "date p-value" "0.0673287571647857"
The p-value for the model is 0.0673 which is higher than the accepted p-value of 0.05. This means that we do not have enough statistical evidence to accept the alternative hypothesis. It is 6.7% likely that the results we have observed are due to chance. Thus, we can conclude that there is no relationship between infant's weight and their birth-date i.e. we fail to reject the null hypothesis.
In theory, the null hypothesis is either true or false- but the output of the statistical test gives us the probability that the null hypothesis is true based on the data we have collected. Therefore, it is possible to make wrong inferences from statistical tests. Hypothesis tests will not always be perfect because the data has been taken from a random sample - there may be times that we reject the null hypothesis when the null hypothesis is actually true, or vice versa. These errors are called Type I and Type II Errors.
This is incorrectly rejecting the null hypothesis i.e. the null hypothesis is actually true, but the statistical test led us to believe that it is false. This situation is analogous to getting a false positive on a test.
Let us go back to the Gestation example once again. In the first model, where we study the relationship between infant's weight and gestation length, making a Type I error would mean that we have falsely concluded that there is a significant relationship between mother's weight and infant's weight.
This is incorrectly accepting the null hypothesis i.e. the alternative hypothesis is actually true, but the statistical test has not picked up on this difference. This error generally occurs due to small sample sizes.
In the context of the Gestation example, making a Type II Error would mean that we have falsely concluded that there is no significant relationship between infant's weight and birth date.
The only way to reduce both types of errors is increasing sample sizes!
To find whether or not there is a significant relationship between two variables:
Choose your variables of interest.
State your null and alternative hypotheses.
Plot a graph with a standard curve
Make a linear model with your variables of interest.
Use the regressionAnalysis() function to obtain the equation of the best-fit line and the p-value