Module 2: Basic Functions

By the end of this lesson, you should be able to:

We will use the "iris" data in this module, which is built into RStudio. It is loaded for you below.

data(iris)

These data contain measurements on the following variables for 150 irises:

Below is a sample of the data frame.

name(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Types of Variables

Reminder! Variables are attributes that describe cases. There are two types of variables. Quantitative variables are numerical measures on cases while categorical variables represent group labels on cases. For the iris data frame the variables Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width are quantitative and the variable Species is categorical.

We can access a variable of a data frame by using the syntax "name_of_data_frame$name_of_variable." For example, we can type the following code into the console to see all of the Petal.Length values in the iris data frame.

iris$Petal.Length
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
##  [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
##  [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
##  [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
##  [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
##  [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1

A. Functions on Categorical Variables

Categorical variables assign specific group labels to cases, and these values are referred to as levels. The 'Species' variable in the iris data frame has three levels: 'setosa', 'versicolor', and 'virginica'. These represent the three different species observed in the sample of 150 irises. In R, we can type the following code to summarize the levels of a categorical variable.

levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"

Now that we have confirmed Species has three levels, we can use the "table()" function to determine the number of observations on each level.

table(iris$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50

From the table, we can see the distribution of cases by species level; there are fifty cases for each species type.

B. Functions on Quantitative Variables

In R, there exist functions that return basic summary statistics on quantitative variables. In this course, we will use the functions "mean()" and "median()" to measure typical values of a quantitative variable, and the functions "sd()" and "quantile()" to measure the variability of a quantitative variable.

The "mean()" function returns the arithmetic average value for a variable in the data frame. The following code calculates the mean petal length in our iris sample, a value of 3.758 cm.

mean(iris$Petal.Length)
## [1] 3.758

The "median()" function returns the middle value of a variable. Half of the observed values are above the median and half are below. The following code calculates the median petal length in our iris sample, a value of 4.35 cm.

median(iris$Petal.Length)
## [1] 4.35

The standard deviation measures the amount of variability in the data. We use the "sd()" function to calculate the standard deviation.

sd(iris$Petal.Length)
## [1] 1.765
sd(iris$Petal.Width)
## [1] 0.7622

The standard deviation of petal length in the iris sample is 1.765 cm, while the standard deviation of petal width is 0.76 cm. We can conclude from this information that petal lengths have greater variability than petal widths.

We can also capture variability using quantiles. Using the "quantile()" function:

quantile(iris$Petal.Length)
##   0%  25%  50%  75% 100% 
## 1.00 1.60 4.35 5.10 6.90

These five numbers represent the minimum, 25th percentile, 50th percentile (median) , 75th percentile, and the maximum. From the output we can interpret petal lengths to range from 1.00 cm to 6.90 cm with 25% at or below 1.60 cm, and 75% at or below 5.10 cm.