Module 1: Introduction to R, RStudio, and Datasets

Before you read this lesson, you should:

Read the introduction to R and RStudio document
Watch the following screencasts:
- Introduction
- Functions

By the end of this lesson, you should be able to:

Use R as a calculator
Articulate the differences between functions and arguments
Use R functions to perform arithmetic operations
Differentiate between cases and variables
Import datasets into RStudio
Perform basic functions on datasets

R As a Calculator

You can use R as a calculator. Try typing the following command into the console:

3+5

## [1] 8

Note that the '## [1] 8' is the computer output from the command '3+5'

There are several other commands you can use in R. Try these in the console.

3^2

## [1] 9

9^(.5)

## [1] 3

Note that the “^” sign is a shorthand for the exponent function. We have used it to find both the square of 3 and the square-root of 9.

3*5

## [1] 15

3/5

## [1] 0.6

Functions in R

A function is a set of instructions that perform a specific task. In R, functions are written with names, followed by a set of parentheses. Functions contain statements inside their parentheses, and these statements are called arguments. At the beginning of this module, you practiced basic algebra in R. There are built-in functions that perform these tasks as well! For example:

sum(3,5)

## [1] 8

In this example, sum() is a function that has numbers as arguments, and takes the sum of those numbers. When there is more than one argument, the arguments are separated by commas. Another function is the sqrt() function, which takes the square root of a number. Therefore, sqrt() can only take in one argument, because it only performs an operation on one number. For example:

sqrt(9)

## [1] 3

These both result in the answer 3.

But, if we input more than one number, then we get an error, which is a scenario in which R is unable to perform a given operation.

sqrt(9,16) #This will generate an error

## Error: 2 arguments passed to 'sqrt' which requires 1

Cases, Variables, and Canonical Data Form

Consider an experiment on a set of irises, which is a data set built into RStudio. This data contains the following information on 150 irises:

Petal Width (cm)
Petal Length (cm)
Sepal Width (cm)
Sepal Length (cm)
Species Type (setosa, versicolor, or virginica)

In R, datasets are called data frames. The beginning of the 'iris' data frame is below.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Take a look at the first row in the data frame above. This is the first iris in the data frame, and it has the following attributes:

A sepal length of 5.1 cm
A sepal width of 3.5 cm
A petal length of 1.4 cm
A petal width of 0.2 cm
It is a setosa iris

Each row in the data frame is one iris, and each column is an attribute of the irses. We can more generally define the aspects of the data frame:

A case is a trial in an experiment. Each iris is a distinct trial in our experiment, so each iris is a case.
A variable is a specific attribute of a case. Cases can have one or several variables. Each iris has a petal length, so petal length is a variable. Each iris also has a petal width, so petal width is another variable.

How do we organize this data?

We organize data frames in canonical data form. There are two important features of canonical data form:

Each row in the data sheet represents a case
Each column in the data sheet represents a variable

Look again at the beginning of the data frame above. Before moving on, make sure you can articulate the variables in our data frame, and understand the differences between variables and cases.

Basic Functions on Datasets

Once you have read your data in R, there are a few functions you can use to determine whether you have done so correctly. These functions are all pre-written and saved in the Mosaic Package. A package is a collection of functions and data, and the Mosaic Package is useful for statistical analysis. The Mosaic Package is loaded below.

require(mosaic)

All of the following functions are contained in Mosaic. They each take one argument: the name you have assigned to your data. These functions are demonstrated below:

The dim() function, which is short for dimensions, will return the number of cases (which is the number of rows) and the number of variables (which is the number of columns)

dim(iris)

## [1] 150   5

The names() function returns the names of the variables in the dataset

names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

The head() function returns the first six cases in the dataset

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

The summary() function returns a summary of the dataset

summary(iris)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##