Module 2: Basic Functions

By the end of this lesson, you should be able to:

Load Project Mosaic
Differentiate between quantitative and categorical variables
Use basic statistical functions on variables of datasets, including:
- Levels
- Table
- Mean
- Standard Deviation
- Quantile
- Median

Project Mosaic

Project Mosaic is a package in RStudio that contains all of the statistical functions you will use in this course. To access these functions, you must load the package. You can check to see if it is loaded by selecting 'Packages' in the Files/Plots/Packages Pane, and making sure the box next to Mosaic is selected.

We can also load Project Mosaic with the following command:

library(mosaic)

We will use the 'iris' data in this module, which is built into RStudio. It can be made available for analysis using the data() function, which is done below.

data(iris)

Now that we have loaded the data frame, we can look at the first six cases with the head() function.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

The columns correspond to the following variables:

Sepal Length (cm)
Sepal Width (cm)
Petal Length (cm)
Petal Width (cm)
Species (I. setosa, I. versicolor, and I. virginica)

Types of Variables

Reminder! Variables are attributes that describe cases. This section further delves into this topic.

Pro Tip: Variable names are case-sensitive, meaning to access a variable, you have to use the same exact spelling.

Quantitative variables are numerical measures on cases while categorical variables represent group labels on cases. In the iris data frame we will be working on Petal.Length, which is a quantitative variable while Species is the categorical variable of interest. To refresh your memory on this data frame, the first 6 cases are printed below.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

For the duration of this module, the '$' is used to access one variable of a data frame. For example, we can type the following command into the console to see all of the Petal.Length values in the iris data frame.

iris$Petal.Length

##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3
##  [18] 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4
##  [35] 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7
##  [52] 4.5 4.9 4.0 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1
##  [69] 4.5 3.9 4.8 4.0 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5
##  [86] 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1
## [103] 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9
## [120] 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1
## [137] 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1

A. Functions on Categorical Variables

Categorical variables assign specific group labels to cases, and these values are referred to as levels. The 'Species' variable in the iris data frame has three levels: 'setosa', 'versicolor', and 'virginica'. These represent the three different species observed in the sample of 150 irises. In R, we can type the following command to determine the various levels of a categorical variable.

levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

Now that we have confirmed Species has three levels, suppose we want to know the number of cases for each level. We can use the table() function to determine these values.

table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

From the table, we can see the distribution of cases by species level; there are fifty cases for each species type.

B. Functions on Quantitative Variables

There are several quantitative measures of variables that will be used throughout the course. Some of these measurements are of typical values of a variable, while others measure variability. This course will use two values to measure typical values: the mean and the median. It will also use two measurements to quantify variability: the standard deviation, and the quartiles.

These measurements can be calculated using functions in R.

The mean() function returns the arithmetic average value for a variable in the dataset. The following command returns a mean of 3.758 cm.

mean(iris$Petal.Length)

## [1] 3.758

The median() function generates the middle value of a variable. Half of the values are above the median and half of the values are below the median. The following function returns a median of 4.35 cm.

median(iris$Petal.Length)

## [1] 4.35

The standard deviation measures the amount of variability in the data. We use the sd() function to calculate the standard deviation.

sd(iris$Petal.Length)

## [1] 1.765

sd(iris$Petal.Width)

## [1] 0.7622

The standard deviation of a petal length in the iris data frame is 1.765 cm, while the standard deviation of a petal width in the iris data frame is 0.76 cm. We can conclude from this information that petal lengths have greater variability than petal widths.

We can also separate the values into quantiles. Using the quantile() function:

quantile(iris$Petal.Length)

##   0%  25%  50%  75% 100% 
## 1.00 1.60 4.35 5.10 6.90

we can see that 25% of our cases are at or below 1.60 cm, and 25% of our cases fall above 5.10 cm.

Recap:

You have now learned to differentiate between quantitative and categorical variables. You have also learned numerous functions to make different calculations on both types of variables

Quantitative variables are numerical measures on cases
- mean(data_name$variable_name)
- median(data_name$variable_name)
- sd(data_name$variable_name)
- quantile(data_name$variable_name)
Categorical variables represent group levels on cases
- levels(data_name$variable_name)
- table(data_name$variable_name)