Module 3: Making Graphs

By the end of this lesson, you should be able to:

Plotting Data Frames

In this module you will learn about data visualization with boxplots and scatterplots. Data visualization is an important tool to explore and visualize characteristics of individual variables and relationships between several variables.

This course will use the 'ggplot2' package to visualize data frames, and this module will help you become more familiar with this package. To begin, install the package by going to the Packages tab in the Files/Plots/Packages pane, clicking “Install Packages”, and typing ggplot2 in the text box.

Once the package is installed, load it in the console by finding it in your list of packages, and checking the box next to its name. You can also load the package with the following command:

library(ggplot2)

As always, let's also load the mosaic package.

library(mosaic)

We will use the “Galton” data frame in this module, which accessible in RStudio with the data() function. Let's also look at the first six cases of the data frame, and see how many cases and variables there are.

data(Galton)
head(Galton)
##   family father mother sex height nkids
## 1      1   78.5   67.0   M   73.2     4
## 2      1   78.5   67.0   F   69.2     4
## 3      1   78.5   67.0   F   69.0     4
## 4      1   78.5   67.0   F   69.0     4
## 5      2   75.5   66.5   M   73.5     4
## 6      2   75.5   66.5   M   72.5     4
dim(Galton)
## [1] 898   6

The Galton data frame contains 898 cases and 6 variables; however, we will only use the following three in this module: “height”, an adult child's height in inches; “sex”, the child's sex; and “mother”, the height of the child's mother;

A. Boxplots

Boxplots are graphical summaries of variables' quartiles. Below is an example of a boxplot of the 'height' variable in the Galton data frame. In order to make the boxplot, we must give several pieces of information to R.

ggplot(Galton, aes(x=0,y=height)) + geom_boxplot() + labs(title="Boxplot of Height", x=" ", y="height (inches)") #don't forget the parentheses after geom_boxplot!

plot of chunk unnamed-chunk-4

In the above graph, the bottom and top of the box denote the first and third quartiles, or the 25th and 75th percentiles of the heights in the data frame. The band inside the box is the median, or the 50th percentile. According to the boxplot, the 25th percentile is approximately 64 inches, the 50th percentile is approximately 66 inches, and the 75th percentile is approximately 69 inches. We can numerically confirm the plot's validity using the quantile() function:

quantile(Galton$height)
##   0%  25%  50%  75% 100% 
## 56.0 64.0 66.5 69.7 79.0

Notice that the output of the quantile() function on the height variable is consistent with our visual estimates above.

Almost all of the data falls between the end points of the lines extending from the boxes. The width of the boxplot has no mathematical meaning.

Boxplots are useful in comparing different levels of a categorical variable. Suppose we want to compare the heights of 'Male' and 'Female', the two levels of the 'sex' variable. We can do this by inserting x=sex in both the ggplot() and labs() functions, which is demonstrated below. Don't forget to change the title!

ggplot(Galton, aes(x=sex, y=height)) + geom_boxplot() + labs(title="Boxplot of Height by Sex", y="height (inches)", x="sex (M,F)") #don't forget the parentheses after geom_boxplot!

plot of chunk unnamed-chunk-6

Notice there are now two side-by-side boxplots. One represents males (denoted with an 'M') and the other represents females (denoted with an 'F'). Based on this plot, it appears that males tend to be taller than females.

The dots beyond the ends of the lines correspond to people who are abnormally short or tall for their sex.

B. Scatterplots

Scatterplots are used to examine the relationship between two quantitative variables. A plot of child's height (on the y-axis) by mother's height (on the x-axis) is below. Let's walk through the code to make a scatterplot:

ggplot(Galton, aes(x=mother, y=height)) + geom_point() + labs(title = "Scatterplot of Height by Mother's Height", x = "Mother's Height (inches)", y = "Height (inches)")

plot of chunk unnamed-chunk-7

As we discovered at the beginning of this module, there are 898 cases in the data frame, each of which is represented by a point on the scatterplot. For example, the bottom-most point represents a case in the sample with a height of 56in and a mother with a height of 60in.

We can sometimes use scatterplots to determine if there are relationships between the variables. It appears that as mother's height increases, height slightly increases as well.

Recap

If you are working with a categorical variable and a quantitative variable, use a boxplot. If you are working with two quantiative variables, use a scatterplot. There are three steps to using the ggplot() command:

  1. Use the standard ggplot() function: ggplot(data_name, aes(x=variable_name, y=variable_name))
  2. Specify the type of plot either geom_boxplot() or geom_point()
  3. Label your graphs with the labs() function, which takes three arguments: title=“”, x=“”, and y=“”