Cheatsheet
Pro Tip: On your console, use the up and down arrows to navigate through past commands.
Reading Data into Rstudio:
data_name <- read.csv(“filename_of_dataset.csv”)
Storing variables in R:
x <- 1
(read as x gets 1, and is used to assign values to variables and refer to them by that name)
1. Introduction to R, Rstudio and Data-Sets:
sum(number1 , number2)
: Takes the sum of number1 and number2 and returns the resultsqrt(number)
: Used to calculate the square-root of the number provided within parenthesis. This function takes in only one argumentdata(name_of_dataset)
: This is used to load and store the data-framehead(name_of_dataset)
: Returns the first six cases of the data-setdim(name_of_dataset)
: Returns the number of cases and rows present in the data-set. The first number represents the number of cases, and the second number represents the number of variablesnames(name_of_dataset)
: Returns the names of the variables present in the data-set
Pro Tip: If you run into a
+
sign while typing a command in the console, R Studio is prompting you to finish the command with something like a )
. If you can't find the mistake you can exit by hitting the esc button.
2. Basic Functions:
levels(name_of_dataset$name_of_categorical_variable)
: Used to determine various levels of a categorical variabletable(name_of_dataset$name_of_categorical_variable)
: Used to determine the number of cases for each level of the categorical variablemean(name_of_dataset$name_of_variable)
: For example -> mean(Galton$height) - calculates and returns the mean height from the Galton data-setmedian(name_of_dataset$name_of_variabe)
: Used to find the median (middle) value of a variable.sd(name_of_dataset$name_of_variable)
: For example -> sd(Galton$height) - calculates and returns the standard deviation in height from the Galton data-setquantile(name_of_dataset$name_of_variable)
: Used to separate the values of the given variable into quartilesiqr(name_of_dataset$name_of_variable)
: Used to find the interquartile range-spreadvar(name_of_dataset$name_of_variable)
: For example -> var(Galton$height) - returns the sample varianceView(name_of_dataset)
: Displays the data in a new tab
3. Making Graphs:
- Boxlot:
ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_boxplot() + labs(title="Title_name",
x="x-axis_label", y="y-axis_label")
- Scatter-plot:
ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_point() + labs(title = "Title_name",
x="x-axis_label", y="y-axis label")
4. Making a Linear Model in R:
name_of_model = lm(response_variable ~ explanatory_variable, data= name_of_dataset)
regAnalysis(name_of_linear_model)
: Returns the p-value, coefficients and R-squared value of a linear model
5. P-Values and Hypothesis Testing:
head(name_of_data-set)
ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_point() + labs(title = "Title_name",
x = "x-axis_label", y = "y-axis_label")
ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_point() + labs(title = "Title_name",
x = "x-axis_label", y = "y-axis_label") + stat_smooth(method= “lm”, se=FALSE)
6.T-tests and ANOVA tests:
ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_boxplot() + labs(title="Title_name",
x="x-axis_label", y="y-axis_label")
name_of_subset = subset(name_of_dataset, conditional_statement)
- For example->
Galton_tall = subset(Galton, height > 75)
makes a subset of the Galton data-set called Galton_tall, which includes only those individuals who are taller than 75 inches. -> For categorical variables, example -> Galton_female = subset(Galton, sex==”F”), makes a subset of the Galton dataset, that includes only females - Subsets can be include more than one condition and we use the “&” symbol to join the multiple conditions. For example:
Galton_tallfemales = subset(Galton, height>65 & sex==”F”)
, makes a subset of the Galton dataset, that includes only those females that are taller than 65 inches mytest = t.test(subset1$variable_of_choice, subset2$variable_of_choice)
: Performs a t-test to compare the two groups (in this case, the two subsets) to check for significant difference between them. T-tests are used for comparison between two groupsmytest$p.value
: Returns the p-value of the above t-testanova(name_of_linear_model)
: Performs an ANOVA test to check for statistically significant difference between three or more groupsTukeyHSD(name_of_linear_model)
: The summary table shows the difference between pairs, the 95% confidence intervals and the p-value of the pairwise comparisonTukeyHSD(name_of_linear_model, conf.level=0.95)
: This command does a Tukey test and also returns the 95% confidence interval for the test
7. Standard Curves and Logarithmic Axes:
ggplot(name_of_dataset, aes(explanatory_variable, response_variable)) + geom_point() + labs(title = "Title_name",
x = "x-axis_label", y = "y-axis_label") + stat_smooth(method = "lm", se = FALSE) + coord_cartesian(xlim = c(0,35000),
ylim = c(0,190))
ggplot(name_of_dataset, aes(log(explanatory_variable), response_variable)) + geom_point() + labs(title = "Title_name",
x = "x-axis_label", y = "y-axis_label") + stat_smooth(method = "lm", se = FALSE) + coord_cartesian(xlim = c(0,35000),
ylim = c(0,190)
: to convert the explanatory variable to a logarithmic scaleregAnalysis(name_of_linear_model)