Cell Bio Gen Lab

Cheatsheet

Pro Tip: On your console, use the up and down arrows to navigate through past commands.

Reading Data into Rstudio:

data_name <- read.csv(“filename_of_dataset.csv”)

Storing variables in R:

x <- 1 (read as x gets 1, and is used to assign values to variables and refer to them by that name)

1. Introduction to R, Rstudio and Data-Sets:

sum(number1 , number2): Takes the sum of number1 and number2 and returns the result
sqrt(number): Used to calculate the square-root of the number provided within parenthesis. This function takes in only one argument
data(name_of_dataset): This is used to load and store the data-frame
head(name_of_dataset): Returns the first six cases of the data-set
dim(name_of_dataset): Returns the number of cases and rows present in the data-set. The first number represents the number of cases, and the second number represents the number of variables
names(name_of_dataset): Returns the names of the variables present in the data-set

Pro Tip: If you run into a + sign while typing a command in the console, R Studio is prompting you to finish the command with something like a ). If you can't find the mistake you can exit by hitting the esc button.

2. Basic Functions:

levels(name_of_dataset$name_of_categorical_variable): Used to determine various levels of a categorical variable
table(name_of_dataset$name_of_categorical_variable): Used to determine the number of cases for each level of the categorical variable
mean(name_of_dataset$name_of_variable): For example -> mean(Galton$height) - calculates and returns the mean height from the Galton data-set
median(name_of_dataset$name_of_variabe): Used to find the median (middle) value of a variable.
sd(name_of_dataset$name_of_variable): For example -> sd(Galton$height) - calculates and returns the standard deviation in height from the Galton data-set
quantile(name_of_dataset$name_of_variable): Used to separate the values of the given variable into quartiles
iqr(name_of_dataset$name_of_variable): Used to find the interquartile range-spread
var(name_of_dataset$name_of_variable): For example -> var(Galton$height) - returns the sample variance
View(name_of_dataset): Displays the data in a new tab

3. Making Graphs:

Boxlot: ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_boxplot() + labs(title="Title_name", x="x-axis_label", y="y-axis_label")
Scatter-plot: ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_point() + labs(title = "Title_name", x="x-axis_label", y="y-axis label")

4. Making a Linear Model in R:

name_of_model = lm(response_variable ~ explanatory_variable, data= name_of_dataset)
regAnalysis(name_of_linear_model): Returns the p-value, coefficients and R-squared value of a linear model

5. P-Values and Hypothesis Testing:

head(name_of_data-set)
ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_point() + labs(title = "Title_name", x = "x-axis_label", y = "y-axis_label")
ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_point() + labs(title = "Title_name", x = "x-axis_label", y = "y-axis_label") + stat_smooth(method= “lm”, se=FALSE)

6.T-tests and ANOVA tests:

ggplot(name_of_dataset, aes(x=explanatory_variable, y=response_variable)) + geom_boxplot() + labs(title="Title_name", x="x-axis_label", y="y-axis_label")
name_of_subset = subset(name_of_dataset, conditional_statement)
For example-> Galton_tall = subset(Galton, height > 75) makes a subset of the Galton data-set called Galton_tall, which includes only those individuals who are taller than 75 inches. -> For categorical variables, example -> Galton_female = subset(Galton, sex==”F”), makes a subset of the Galton dataset, that includes only females
Subsets can be include more than one condition and we use the “&” symbol to join the multiple conditions. For example: Galton_tallfemales = subset(Galton, height>65 & sex==”F”), makes a subset of the Galton dataset, that includes only those females that are taller than 65 inches
mytest = t.test(subset1$variable_of_choice, subset2$variable_of_choice) : Performs a t-test to compare the two groups (in this case, the two subsets) to check for significant difference between them. T-tests are used for comparison between two groups
mytest$p.value: Returns the p-value of the above t-test
anova(name_of_linear_model): Performs an ANOVA test to check for statistically significant difference between three or more groups
TukeyHSD(name_of_linear_model): The summary table shows the difference between pairs, the 95% confidence intervals and the p-value of the pairwise comparison
TukeyHSD(name_of_linear_model, conf.level=0.95): This command does a Tukey test and also returns the 95% confidence interval for the test

7. Standard Curves and Logarithmic Axes:

ggplot(name_of_dataset, aes(explanatory_variable, response_variable)) + geom_point() + labs(title = "Title_name", x = "x-axis_label", y = "y-axis_label") + stat_smooth(method = "lm", se = FALSE) + coord_cartesian(xlim = c(0,35000), ylim = c(0,190))
ggplot(name_of_dataset, aes(log(explanatory_variable), response_variable)) + geom_point() + labs(title = "Title_name", x = "x-axis_label", y = "y-axis_label") + stat_smooth(method = "lm", se = FALSE) + coord_cartesian(xlim = c(0,35000), ylim = c(0,190): to convert the explanatory variable to a logarithmic scale
regAnalysis(name_of_linear_model)

Cell Biology & Genetic Lab

Cheatsheet