Cheatsheet
Pro Tip: On your console, use the up and down arrows to navigate through past commands.
Reading Data into Rstudio:
data_name <- read.csv(“filename_of_dataset.csv”)
Storing variables in R:
x <- 1(read as x gets 1, and is used to assign values to variables and refer to them by that name)
1. Introduction to R, Rstudio and Data-Sets:
sum(number1 , number2): Takes the sum of number1 and number2 and returns the resultsqrt(number): Used to calculate the square-root of the number provided within parenthesis. This function takes in only one argumentdata(name_of_dataset): This is used to load and store the data-framehead(name_of_dataset): Returns the first six cases of the data-setdim(name_of_dataset): Returns the number of cases and rows present in the data-set. The first number represents the number of cases, and the second number represents the number of variablesnames(name_of_dataset): Returns the names of the variables present in the data-set
Pro Tip: If you run into a + sign while typing a command in the console, R Studio is prompting you to finish the command with something like a ). If you can't find the mistake you can exit by hitting the esc button.
2. Basic Functions:
levels(name_of_dataset$name_of_categorical_variable): Used to determine various levels of a categorical variabletable(name_of_dataset$name_of_categorical_variable): Used to determine the number of cases for each level of the categorical variablemean(name_of_dataset$name_of_variable): For example -> mean(Galton$height) - calculates and returns the mean height from the Galton data-setmedian(name_of_dataset$name_of_variabe): Used to find the median (middle) value of a variable.sd(name_of_dataset$name_of_variable): For example -> sd(Galton$height) - calculates and returns the standard deviation in height from the Galton data-setquantile(name_of_dataset$name_of_variable): Used to separate the values of the given variable into quartilesiqr(name_of_dataset$name_of_variable): Used to find the interquartile range-spreadvar(name_of_dataset$name_of_variable): For example -> var(Galton$height) - returns the sample varianceView(name_of_dataset): Displays the data in a new tab
3. Making Graphs:
- Boxlot:
ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_boxplot() + labs(title="Title_name",x="x-axis_label", y="y-axis_label") - Scatter-plot:
ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_point() + labs(title = "Title_name",x="x-axis_label", y="y-axis label")
4. Making a Linear Model in R:
name_of_model = lm(response_variable ~ explanatory_variable, data= name_of_dataset)regAnalysis(name_of_linear_model): Returns the p-value, coefficients and R-squared value of a linear model
5. P-Values and Hypothesis Testing:
head(name_of_data-set)ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_point() + labs(title = "Title_name",x = "x-axis_label", y = "y-axis_label")ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_point() + labs(title = "Title_name",x = "x-axis_label", y = "y-axis_label") + stat_smooth(method= “lm”, se=FALSE)
6.T-tests and ANOVA tests:
ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_boxplot() + labs(title="Title_name",x="x-axis_label", y="y-axis_label")name_of_subset = subset(name_of_dataset, conditional_statement)- For example->
Galton_tall = subset(Galton, height > 75)makes a subset of the Galton data-set called Galton_tall, which includes only those individuals who are taller than 75 inches. -> For categorical variables, example -> Galton_female = subset(Galton, sex==”F”), makes a subset of the Galton dataset, that includes only females - Subsets can be include more than one condition and we use the “&” symbol to join the multiple conditions. For example:
Galton_tallfemales = subset(Galton, height>65 & sex==”F”), makes a subset of the Galton dataset, that includes only those females that are taller than 65 inches mytest = t.test(subset1$variable_of_choice, subset2$variable_of_choice): Performs a t-test to compare the two groups (in this case, the two subsets) to check for significant difference between them. T-tests are used for comparison between two groupsmytest$p.value: Returns the p-value of the above t-testanova(name_of_linear_model): Performs an ANOVA test to check for statistically significant difference between three or more groupsTukeyHSD(name_of_linear_model): The summary table shows the difference between pairs, the 95% confidence intervals and the p-value of the pairwise comparisonTukeyHSD(name_of_linear_model, conf.level=0.95): This command does a Tukey test and also returns the 95% confidence interval for the test
7. Standard Curves and Logarithmic Axes:
ggplot(name_of_dataset, aes(explanatory_variable, response_variable)) + geom_point() + labs(title = "Title_name",x = "x-axis_label", y = "y-axis_label") + stat_smooth(method = "lm", se = FALSE) + coord_cartesian(xlim = c(0,35000),ylim = c(0,190))ggplot(name_of_dataset, aes(log(explanatory_variable), response_variable)) + geom_point() + labs(title = "Title_name",x = "x-axis_label", y = "y-axis_label") + stat_smooth(method = "lm", se = FALSE) + coord_cartesian(xlim = c(0,35000),ylim = c(0,190): to convert the explanatory variable to a logarithmic scaleregAnalysis(name_of_linear_model)