## Welcome to R! # If you're reading this then you have decided that you want to learn more about R and how to code in this language. # This workshop will take the beginning steps on how to code in R. # Once we have our feet wet, we will do a short example. # R is an "object-oriented language". What that means is R is a program that is based on "objects" (or "variables") which can contain data. # Objects works with your code to run rudimentary or complex methods or analysis. #### Objects #### # Let's take a look at some Objects in R. An Object can be one of a handful of types: character, numeric, integer, boolean. # To make an object, type the object name and assign it a value using "<-" or "=" # You'll notice that a character object is made with quotations, integers have a "L" at the end, and boolean show up as blue. # Object Naming Convention: Must start with a letter, case sensitive, can add ".", "_" # To view the Object, simply type the Object name in your code and hit run # Object Types # We saw above some of the data types in R # Specific object types are important in R because it allows for certain mathematical, relational, or logical operations to be applied. # For example, you don't want to add a character object to a boolean object NumericObject <- 1920232 Character_Object <- "Hi, I'm a Character Object" Boolean.Object <- TRUE # You can print the object in the console by running just the object name NumericObject Character_Object Boolean.Object #### Activity 1 #### # Make your own Objects called My_Birth_Month_Num and My_Birth_Month_Text # Assign them the appropriate numeric and character values # Remember you need to use "" for character objects #### Data Types #### # Like Objects, Data Structures have different types as well. # As shown in the slides, we have Vectors, Lists, and Data Frames, but there are a few other data types as well such as matrices, arrays, tibbles, and more. # Look at the five data types below and print them to see their differences. Vector.123 <- c(1,2,3) Vector.1to10 <- c(1:10) Vector.Animals <- c("Cat", "Dog", "Fish") List1 <- list(Animal= c("Dalmations", "Dwarves", "Shreks"), Count=c(101, 7, 1)) City <- c("Ottawa", "Montreal", "Calgary", "Edmonton", "Toronto", "Vancouver", "Victoria", "Whitehorse") Province <- c("ON", "QC", "AB", "AB", "ON", "BC", "BC", "YT") Is.Capital <- c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE) Population <- c(994837, 1780000, 1336000, 981280, 2731571, 631486, 85792, 25085) Square.Kilometres <- c(2790.30, 365.65, 825.56, 685.25, 630.20, 114.97, 19.47, 416.54) Median.House.Value <- c(400148, 380419, 450338, 390262, 601922, 1102843, 501002, 384194) DataFrame1 <- data.frame(City, Province, Is.Capital, Population, Square.Kilometres, Median.House.Value) DataFrame1 #### Functions #### # Functions are the lifeblood of R. Each function requires takes some input, does some work to it, and outputs something else. # Each function has a name and requires some arguments. # For example, the mean() function requires the argument of what values to take the mean of. # You can use a new object to store the results of a function. # You can also use the function() function to make your own functions! # Some Functions: print("Hello World!") mean(DataFrame1$Population) Population.Median <- median(DataFrame1$Population) strsplit("Let's Split this String!", split = " ") String <- "Let's Split this String!" strsplit(String) # Getting Help: help(strsplit) ?strsplit #Make your own functions: Fahrenheit.From.Celsius <- function(Celsius){ Fahrenheit <- Celsius*9/5 +32 return(Fahrenheit) } Fahrenheit.From.Celsius(Celsius = 20) #### Reading and Writing Data #### # There is almost no type of data that R can't read and manipulate. # We have seen text files, spreadsheets, geospatial data, word documents, and more already been shown, however, R can do so much more from image files to sound clips and just about anything else. # The main types of data that you will encounter are text files or spreadsheet type files such as CSV (Comma-Seperated Values). # Base R is able to read these files with the readLines() and read.csv() functions. # If there is a data type that you want to read into R and don't know how, do a Google search and there will be a package to help you! # Let's load the data that we will be using later and keep it as an object # It's a csv file so we'll be using the read.csv() FUNCTION GapData2007 <- read.csv("G:/R Workshops/Intro Workshop/Gapminder2007.csv", stringsAsFactors = FALSE) GapData2007 <- read.csv("C:/Users/christophershoniker/Downloads/R May 27/Gapminder2007.csv", stringsAsFactors = FALSE) # Some Functions for chekcing your data: #Let's take a look at our data #To look at the first entries use the head() function, #By default head() will show you the first 6 rows head(GapData2007) #Or to look at the last entries use: # Type tail(GapData2007) #We can also open up the full data in spreadsheet form by: View(GapData2007) #We can get more information by using the summary() function: # Type summary(GapData2007) #We can check the data type of individual columns by using class(): class(GapData2007$year) #We can see the different countries listed using the unique() function: # Type unique(GapData2007$country) #Add to see how many unique countries there are, we can check by looking at how long #unique(GapData$country) is by applying the length() function length(unique(GapData2007$country)) ##Activity 2 #Hint: Google if you need to figure out the correct function for a specific task #Using GapData2007: #Part 1 - Get the average Population (pop) of all the countries in 2007 #Part 2 - What is the largest (maximum) population represented in the dataset? #Part 3 - What is the smallest (minimum) population represented in the dataset? #### Indexing #### # When you want to only grab part of a vector, matrix, dataframe or any other data type, you will need to index it. # The easiest way is to use square brackets: []. # For a vector, you simply type the vector name followed by square brackets and the value index position. # For a two dimensional object, you can grab a specific value by typing in the row value followed by the column value: [row, column]. # You can grab a row or column of the dataframe by typing the row or column number and a blank such as: # row one: [1,] or for column four: [,4] # With dataframes and lists, you can also grab specific columns or items by using the dollar sign $ and typing the column name. #Vector: Vector.123[1] #Dataframes: head(GapData2007) #First Row, 5th Column: GapData2007[1,5] #All Rows, 2nd Column: GapData2007[,2] #4th Row, All Columns: GapData2007[4,] #All Rows, Columns 2 and 6: GapData2007[,c(2,6)] #All Rows, Our Column called gdpPercap GapData2007$gdpPercap #Lists: List1$Animal #### Activity 3 #### # Part 1 - Index GapData2007 to print (show in console) the column country # Part 2 - Print the population for Australia #### Operators #### # R has many operators which allow you to carry out mathematical and relational operations. # The simplest use of R is to use it as a calculator: Add, Subtract, Multiply, Divide, etc # R also has relational operators such greater than, less than, or equals to, etc #Mathematical Operators: 8 + 9 2 * 7 3/9 log(10) 2^3 GapData2007$lifeExp * 10 GapData2007$pop / 1000000 #Relational Operators: 8 == 9 8 != 9 2 < 100 50 >= 18 #### Activity 4 #### #Combining what we learned #Using GapData2007: #Part 1 - Add the column LogGDPperCap to GapData2007 by taking the log() of gdpPercap and saving it to a new column #Hint 1: Use $ to append a new column to a dataframe: The beginning of your statement should look like this: #GapData2007$LogGDPperCap <- #Part 2 - Create a new column called GDP by multiplying gdpPercap with pop #Finishing the analysis #Is there a correlation between log(GDP/Capita) and Life Expectancy? plot(GapData2007$LogGDPperCap,GapData2007$lifeExp) text(GapData2007$LogGDPperCap,GapData2007$lifeExp,labels = GapData2007$country, pos = 3, cex = 0.5) cor(GapData2007$LogGDPperCap,GapData2007$lifeExp) #### Conditional Statements #### # Like in many other programs, R allows for conditional functions such as IF Statements or If-Else. We have seen () and [], now we will see {}. #If Statement: x <- 10 if(x < 100){ print("It's less than 100") } #If-Else Statement: x <- 546 if(x < 100){ print("It's less than 100") } else { print("It's NOT less than 100") } #Ifelse Statement: ?ifelse ifelse(GapData2007$pop > 100000000, "Yes", "No") GapData2007$More.Than.Hundred.Million <- ifelse(GapData2007$pop > 100000000, "Yes", "No") DataFrame1 #### Loops #### # Looping, cycling or iterating are just replicating instructions throughout a set of numbers. It is nothing more than automating a multi-step process by running steps in a batch process. # A "For Loop" runs for a specific numeber of times. # A "While Loop" runs until a condition is met, however, we won't look at While Loops now. #For Loops: for (i in 1:10){ print(i) } for (j in 2:7){ num <- j*j print(num) } #### Working Directory #### # Let's set our Working Directory. # The Working Directory is a main folder that R will look into first for data, or R scripts that you may call upon # Where's my Working Directory? getwd() # To set your Working Directory: setwd("G:/R Workshops/Intro Workshop") setwd("C:/Users/christophershoniker/Downloads/R May 27") # To check your Working Directory: #Again type getwd() #### Packages #### # Packages are what make R so great. # There is pretty much a package for everything, and if there isn't then someone is probably working on one as we speak. # Packages contain functions and sometimes data created by the community. # The CRAN (Comprehenisve R Archive Network) repository stores the most used packages. # When you find a good package online that you want to use, you have to install it the first time but you only have to do this once. # Use the install.packages() function to install a package from CRAN. # Once it's installed, you can load the package using the library() function. # Once a package is loaded, you are free to run the functions that are in that package. # Honestly, there's a package for everything! # Let's install and load a package that we'll be using later # When you install the package, you must put it in quotes, but not necessary when you load the package with library() install.packages("ggplot2") library(ggplot2) #GGplot works by starting with a ggplot() function where we call out our data and aesthetics that we want to plot #Then we add additional information such as the plot type (here it's a histogram), title, themes, labels, and much much more #Let's replot our scatterplot from earlier #First let's load a special file for colouring our countries country_colors2 <- read.csv("Country_Colors.csv") country_colors <- as.vector(country_colors2$x) names(country_colors) <- country_colors2$X ggplot(GapData2007, aes(x = LogGDPperCap, y = lifeExp, color = country, size=pop)) + geom_point(alpha = 0.7) + theme_classic() + scale_colour_manual(values = country_colors) + theme(legend.position = "none") + labs(title = '', x = 'GDP Per Capita', y = '') + ggtitle("2007 GDP per Capita versus Life Expectancy") + theme(plot.title = element_text(hjust = 0.5)) + xlab("GDP per Capita") + ylab("Life Expectancy") #### Statistical Analysis with R #### # ANOVA Table GD2007.Anova <- aov(pop ~ continent, data = GapData2007) summary(GD2007.Anova) # Linear Regression fit1 <- lm(lifeExp ~ pop + gdpPercap, data = GapData2007) summary(fit1) residuals(fit1) coefficients(fit1) #Checking Residuals hist(fit1$residuals) qqnorm(fit1$residuals) #### Introducing GGanimate to Make GIFs #### # Let's install a new package which allows us to animate our plots #What if we want to add our Year variable to our visulaizations? We can make a gif to show the change over time #We'll need another package to help us called gganimate (it's very similar to ggplot) install.packages("gganimate") library(gganimate) install.packages("gifski") library(gifski) install.packages("png") library(png) #Now, lets run our code below to get a plot of GDP per Capita vs Life Expectancy over time p <- ggplot(GapData, aes(gdpPercap, lifeExp, size = pop, colour = country)) + geom_point(alpha = 0.7) + scale_colour_manual(values = country_colors) + scale_size(range = c(2, 12)) + scale_x_log10() + theme(legend.position = 'none') + labs(title = 'Year: {frame_time}', x = 'GDP Per Capita', y = 'Life Expectancy') + transition_time(year) + ease_aes('linear') animate(p, 100, 10) #If we want to add a label for Canada, we can do that with a little bit more code: p2Canada <- ggplot(GapData, aes(gdpPercap, lifeExp, size = pop, colour = country)) + geom_point(alpha = 0.7) + geom_text(data = subset(GapData, GapData$country == "Canada"), aes(gdpPercap, lifeExp, label = country)) + scale_colour_manual(values = country_colors) + scale_size(range = c(2, 12)) + scale_x_log10() + theme(legend.position = 'none') + labs(title = 'Year: {frame_time}', x = 'GDP Per Capita', y = 'Life Expectancy') + transition_time(year) + ease_aes('linear') animate(p2Canada, 100, 10) # That's it! #### BONUS MATERIAL #### #### Subsetting our Data #### # There are a few ways to subset our data. # We have already learned about indexing so we will use that way first. # Secondly, we will use the subset() function to grab a specific subset. # There are many other ways in R to grab a portion of your data, however, we will stay with only these two for this tutorial. #What if we're only interested in some of our data, say only Asian countries #We can take a subset in multiple ways #First, let's use what we learned about indexing to make a new all-Asian dataset Asia <- GapData[GapData$continent == "Asia",] head(Asia, n=15) #Let's take a look at what is actually going on here: head(GapData$continent == "Asia", n=62) #The expression GapData$continent == "Asia" returns a boolean vector of Trues and Falses #If the row is TRUE, then it keeps that row for the new dataframe otherwise it removes it #That seemed to work, we can look at our global environment to see that Asia has only 396 #observations which is far less than the 1704 in GapData #We could also check this using the nrow() function nrow(Asia) nrow(GapData) #Another way of grabbing a subset of data is to use the subset() function #Let's take a look at what the subset() function does first ?subset Asia2 <- subset(GapData, GapData$continent == "Asia") nrow(Asia) == nrow(Asia2) #Lastly, let's say we want a dataframe of specific countries #Again, there are a couple ways to do this, but we'll do it this way #First make a vector of the countries that we want: CountryVector <- c("Canada", "Mexico", "China", "Vietnam", "South Africa") #Now, index like we did before: UniqueCountries <- GapData[GapData$country %in% CountryVector,] head(UniqueCountries) #Lastly, we can index on multiple conditions as well using "&" and "|" for "and" and "or" CanadaRecent <- GapData[GapData$country == "Canada" & GapData$year > 1980,] CanadaRecent #### Analyzing our Data #### # Now that our data is organized and clean, we can start to look at it. # Simple analysis like finding the mean or median for an entire column is quite simple, but the aggregate command lets use get statistics for specific groups. #Let's now analyze some of the data #Let's find the mean and median life expectancy for each year aggregate(GapData$lifeExp, by = list(GapData$year), FUN = mean) aggregate(GapData$lifeExp, by = list(GapData$year, GapData$continent), FUN = mean) aggregate(GapData$lifeExp, by = list(GapData$year), FUN = median) #Or we could do it this way with the ~: aggregate(lifeExp ~ continent, GapData, median) aggregate(lifeExp ~ continent, subset(GapData, GapData$year == 2007), median) aggregate(lifeExp ~ year, GapData, median)