Module 1: Introduction to R, RStudio, and Datasets

Before you read this lesson, you should:

Read the introduction to R and RStudio document
Watch the following screencasts:
- Introduction
- Functions

By the end of this lesson, you should be able to:

Use R as a calculator
Articulate the differences between functions and arguments
Use R functions to perform arithmetic operations
Differentiate between cases and variables
Import datasets into RStudio
Perform basic functions on datasets

R As a Calculator

You can use R as a calculator. Try typing the following command into the console:

3+5

## [1] 8

Note that the '## [1] 8' is the computer output from the command '3+5'

There are several other commands you can use in R. Try these in the console.

3^2

## [1] 9

9^(.5)

## [1] 3

Note that the “^” sign is a shorthand for the exponent function. We have used it to find both the square of 3 and the square-root of 9.

3*5

## [1] 15

3/5

## [1] 0.6

Functions in R

A function is a set of instructions that perform a specific task. In R, functions are written with names, followed by a set of parentheses. Functions contain statements inside their parentheses, and these statements are called arguments. At the beginning of this module, you practiced basic algebra in R. There are built-in functions that perform these tasks as well! For example:

sum(3,5)

## [1] 8

In this example, sum() is a function that has numbers as arguments, and takes the sum of those numbers. When there is more than one argument, the arguments are separated by commas. Another function is the sqrt() function, which takes the square root of a number. Therefore, sqrt() can only take in one argument, because it only performs an operation on one number. For example:

sqrt(9)

## [1] 3

These both result in the answer 3.

But, if we input more than one number, then we get an error, which is a scenario in which R is unable to perform a given operation.

sqrt(9,16) #This will generate an error

## Error: 2 arguments passed to 'sqrt' which requires 1

Cases, Variables, and Canonical Data Form

Consider an experiment in which the following attributes were observed on a sample of 150 irises.

Petal Width (cm)
Petal Length (cm)
Sepal Width (cm)
Sepal Length (cm)
Species Type (setosa, versicolor, or virginica)

There are two important features of the experiment, cases and variables.

A case is a trial in an experiment. In our experiment each iris is a case.
A variable is a specific attribute of a case. Cases can have one or several variables. In our experiment, petal width, petal length, sepal width, sepal length, and species type are the variables.

The results of the experiment are stored in R as a dataset named "iris." Datasets in R are called data frames. The first six rows of the "iris" data frame are below.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

How do we organize this data?

We organize data frames in canonical data form. There are two important features of canonical data form:

Each row in the data sheet represents a case
Each column in the data sheet represents a variable

For example take a look at the first row in the data frame above. This represents the first iris in the sample which has the following attributes:

A sepal length of 5.1 cm
A sepal width of 3.5 cm
A petal length of 1.4 cm
A petal width of 0.2 cm
It is a setosa iris

Basic Functions on Data Frames

There are built-in functions in R that help us explore the structures and basic features of our data frames.

The "dim()" function returns the number of cases and the number of variables in a data frame.

dim(iris)

## [1] 150   5

The "names()" function returns the names of the variables in a data frame.

names(iris)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"
## [5] "Species"

The "head()" function returns the first six cases of a data frame.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

The "summary()" function returns a summary of a data frame. The function provides basic summary statistics for each variable in the data frame.

summary(iris)

##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                
##                
##