## Welcome to R!


# If you're reading this then you have decided that you want to learn more about R and how to code in this language.
# This workshop will take the beginning steps on how to code in R.
# Once we have our feet wet, we will do a short example.



# R is an "object-oriented language".  What that means is R is a program that is based on "objects" (or "variables") which can contain data.
# Objects works with your code to run rudimentary or complex methods or analysis.



#### Objects ####

# Let's take a look at some Objects in R.  An Object can be one of a handful of types: character, numeric, integer, boolean.  
# To make an object, type the object name and assign it a value using "<-" or "="
# You'll notice that a character object is made with quotations, integers have a "L" at the end, and boolean show up as blue.  
# Object Naming Convention: Must start with a letter, case sensitive, can add ".", "_"  
# To view the Object, simply type the Object name in your code and hit run

# Object Types

# We saw above some of the data types in R  

# Specific object types are important in R because it allows for certain mathematical, relational, or logical operations to be applied.  
# For example, you don't want to add a character object to a boolean object

NumericObject <- 1920232
Character_Object <- "Hi, I'm a Character Object"
Boolean.Object <- TRUE

# You can print the object in the console by running just the object name
NumericObject

Character_Object

Boolean.Object


#### Activity 1 ####
# Make your own Objects called My_Birth_Month_Num and My_Birth_Month_Text
# Assign them the appropriate numeric and character values
# Remember you need to use "" for character objects


#### Data Types ####

# Like Objects, Data Structures have different types as well.  
# As shown in the slides, we have Vectors, Lists, and Data Frames, but there are a few other data types as well such as matrices, arrays, tibbles, and more.  
# Look at the five data types below and print them to see their differences.


Vector.123 <- c(1,2,3)
Vector.1to10 <- c(1:10)
Vector.Animals <- c("Cat", "Dog", "Fish")

List1 <- list(Animal= c("Dalmations", "Dwarves", "Shreks"), Count=c(101, 7, 1))

City <- c("Ottawa", "Montreal", "Calgary", "Edmonton", "Toronto", "Vancouver", "Victoria", "Whitehorse")
Province <- c("ON", "QC", "AB", "AB", "ON", "BC", "BC", "YT")
Is.Capital <- c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE)
Population <- c(994837, 1780000, 1336000, 981280, 2731571, 631486, 85792, 25085)
Square.Kilometres <- c(2790.30, 365.65, 825.56, 685.25, 630.20, 114.97, 19.47, 416.54)
Median.House.Value <- c(400148, 380419, 450338, 390262, 601922, 1102843, 501002, 384194)

DataFrame1 <- data.frame(City, Province, Is.Capital, Population, Square.Kilometres, Median.House.Value)

DataFrame1

#### Functions ####

# Functions are the lifeblood of R.  Each function requires takes some input, does some work to it, and outputs something else.  
# Each function has a name and requires some arguments.  
# For example, the mean() function requires the argument of what values to take the mean of.  
# You can use a new object to store the results of a function.  
# You can also use the function() function to make your own functions!


# Some Functions:
print("Hello World!")
mean(DataFrame1$Population)
Population.Median <- median(DataFrame1$Population)

strsplit("Let's Split this String!", split = " ")
String <- "Let's Split this String!"
strsplit(String)


# Getting Help:
help(strsplit)
?strsplit

#Make your own functions:
Fahrenheit.From.Celsius <- function(Celsius){
  Fahrenheit <- Celsius*9/5 +32
  return(Fahrenheit)
}

Fahrenheit.From.Celsius(Celsius = 20)



#### Reading and Writing Data ####

# There is almost no type of data that R can't read and manipulate.  
# We have seen text files, spreadsheets, geospatial data, word documents, and more already been shown, however, R can do so much more from image files to sound clips and just about anything else.  
# The main types of data that you will encounter are text files or spreadsheet type files such as CSV (Comma-Seperated Values).  
# Base R is able to read these files with the readLines() and read.csv() functions.  
# If there is a data type that you want to read into R and don't know how, do a Google search and there will be a package to help you!  


# Let's load the data that we will be using later and keep it as an object
# It's a csv file so we'll be using the read.csv() FUNCTION

GapData2007 <- read.csv("G:/R Workshops/Intro Workshop/Gapminder2007.csv", stringsAsFactors = FALSE)
GapData2007 <- read.csv("C:/Users/christophershoniker/Downloads/R May 27/Gapminder2007.csv", stringsAsFactors = FALSE)

# Some Functions for chekcing your data:

#Let's take a look at our data
#To look at the first entries use the head() function,
#By default head() will show you the first 6 rows

head(GapData2007)

#Or to look at the last entries use:
# Type tail(GapData2007)



#We can also open up the full data in spreadsheet form by:

View(GapData2007)

#We can get more information by using the summary() function:
# Type summary(GapData2007)



#We can check the data type of individual columns by using class():

class(GapData2007$year)

#We can see the different countries listed using the unique() function:
# Type unique(GapData2007$country)



#Add to see how many unique countries there are, we can check by looking at how long
#unique(GapData$country) is by applying the length() function

length(unique(GapData2007$country))

##Activity 2
#Hint: Google if you need to figure out the correct function for a specific task
#Using GapData2007:
#Part 1 - Get the average Population (pop) of all the countries in 2007
#Part 2 - What is the largest (maximum) population represented in the dataset?
#Part 3 - What is the smallest (minimum) population represented in the dataset?





#### Indexing ####

# When you want to only grab part of a vector, matrix, dataframe or any other data type, you will need to index it.  
# The easiest way is to use square brackets: [].  
# For a vector, you simply type the vector name followed by square brackets and the value index position.  
# For a two dimensional object, you can grab a specific value by typing in the row value followed by the column value: [row, column].  
# You can grab a row or column of the dataframe by typing the row or column number and a blank such as:  
# row one: [1,] or for column four: [,4]  
# With dataframes and lists, you can also grab specific columns or items by using the dollar sign $ and typing the column name.  



#Vector:
Vector.123[1]

#Dataframes:
head(GapData2007)

#First Row, 5th Column:
GapData2007[1,5]
#All Rows, 2nd Column:
GapData2007[,2]
#4th Row, All Columns:
GapData2007[4,]
#All Rows, Columns 2 and 6:
GapData2007[,c(2,6)]
#All Rows, Our Column called gdpPercap
GapData2007$gdpPercap

#Lists:
List1$Animal

#### Activity 3 ####
# Part 1 - Index GapData2007 to print (show in console) the column country
# Part 2 - Print the population for Australia





#### Operators ####

# R has many operators which allow you to carry out mathematical and relational operations.  
# The simplest use of R is to use it as a calculator: Add, Subtract, Multiply, Divide, etc  
# R also has relational operators such greater than, less than, or equals to, etc

#Mathematical Operators:
8 + 9
2 * 7
3/9
log(10)
2^3
GapData2007$lifeExp * 10
GapData2007$pop / 1000000

#Relational Operators:
8 == 9
8 != 9
2 < 100
50 >= 18

#### Activity 4 #### 
#Combining what we learned
#Using GapData2007:
#Part 1 - Add the column LogGDPperCap to GapData2007 by taking the log() of gdpPercap and saving it to a new column
#Hint 1: Use $ to append a new column to a dataframe: The beginning of your statement should look like this:
#GapData2007$LogGDPperCap <- 
#Part 2 - Create a new column called GDP by multiplying gdpPercap with pop





#Finishing the analysis
#Is there a correlation between log(GDP/Capita) and Life Expectancy?

plot(GapData2007$LogGDPperCap,GapData2007$lifeExp)
text(GapData2007$LogGDPperCap,GapData2007$lifeExp,labels = GapData2007$country, pos = 3, cex = 0.5)
cor(GapData2007$LogGDPperCap,GapData2007$lifeExp)


#### Conditional Statements ####

# Like in many other programs, R allows for conditional functions such as IF Statements or If-Else.  We have seen () and [], now we will see {}.

#If Statement:
x <- 10

if(x < 100){
print("It's less than 100")
}


#If-Else Statement:
x <- 546

if(x < 100){
print("It's less than 100")
} else {
print("It's NOT less than 100")
}

#Ifelse Statement:
?ifelse
ifelse(GapData2007$pop > 100000000, "Yes", "No")

GapData2007$More.Than.Hundred.Million <- ifelse(GapData2007$pop > 100000000, "Yes", "No")
DataFrame1


#### Loops ####

# Looping, cycling or iterating are just replicating instructions throughout a set of numbers.  It is nothing more than automating a multi-step process by running steps in a batch process.  
# A "For Loop" runs for a specific numeber of times.  
# A "While Loop" runs until a condition is met, however, we won't look at While Loops now.

#For Loops:
for (i in 1:10){
print(i)
}

for (j in 2:7){
num <- j*j
print(num)
}


#### Working Directory ####

# Let's set our Working Directory.  
# The Working Directory is a main folder that R will look into first for data, or R scripts that you may call upon

# Where's my Working Directory?
getwd()

# To set your Working Directory:

setwd("G:/R Workshops/Intro Workshop")
setwd("C:/Users/christophershoniker/Downloads/R May 27")

# To check your Working Directory:
#Again type getwd()




#### Packages ####

# Packages are what make R so great.  
# There is pretty much a package for everything, and if there isn't then someone is probably working on one as we speak.  
# Packages contain functions and sometimes data created by the community.  
# The CRAN (Comprehenisve R Archive Network) repository stores the most used packages.  
# When you find a good package online that you want to use, you have to install it the first time but you only have to do this once.  
# Use the install.packages() function to install a package from CRAN.  
# Once it's installed, you can load the package using the library() function.  

# Once a package is loaded, you are free to run the functions that are in that package.  

# Honestly, there's a package for everything!  



# Let's install and load a package that we'll be using later
# When you install the package, you must put it in quotes, but not necessary when you load the package with library()

install.packages("ggplot2")
library(ggplot2)



#GGplot works by starting with a ggplot() function where we call out our data and aesthetics that we want to plot
#Then we add additional information such as the plot type (here it's a histogram), title, themes, labels, and much much more

#Let's replot our scatterplot from earlier

#First let's load a special file for colouring our countries
country_colors2 <- read.csv("Country_Colors.csv")

country_colors <- as.vector(country_colors2$x)
names(country_colors) <- country_colors2$X

ggplot(GapData2007,
       aes(x = LogGDPperCap, y = lifeExp, color = country, size=pop)) +
  geom_point(alpha = 0.7) +
  theme_classic() +
  scale_colour_manual(values = country_colors) +
  theme(legend.position = "none") +
  labs(title = '', x = 'GDP Per Capita', y = '') +
  ggtitle("2007 GDP per Capita versus Life Expectancy") +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("GDP per Capita") +
  ylab("Life Expectancy")





#### Statistical Analysis with R ####

# ANOVA Table
GD2007.Anova <- aov(pop ~ continent, data = GapData2007)

summary(GD2007.Anova)

# Linear Regression
fit1 <- lm(lifeExp ~ pop + gdpPercap, data = GapData2007)

summary(fit1)
residuals(fit1)
coefficients(fit1)

#Checking Residuals
hist(fit1$residuals)
qqnorm(fit1$residuals)




#### Introducing GGanimate to Make GIFs ####

# Let's install a new package which allows us to animate our plots

#What if we want to add our Year variable to our visulaizations?  We can make a gif to show the change over time
#We'll need another package to help us called gganimate (it's very similar to ggplot)

install.packages("gganimate")
library(gganimate)

install.packages("gifski")
library(gifski)
install.packages("png")
library(png)


#Now, lets run our code below to get a plot of GDP per Capita vs Life Expectancy over time

p <- ggplot(GapData, aes(gdpPercap, lifeExp, size = pop, colour = country)) +
  geom_point(alpha = 0.7) +
  scale_colour_manual(values = country_colors) +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  theme(legend.position = 'none') +
  labs(title = 'Year: {frame_time}', x = 'GDP Per Capita', y = 'Life Expectancy') +
  transition_time(year) +
  ease_aes('linear')

animate(p, 100, 10)

#If we want to add a label for Canada, we can do that with a little bit more code:


p2Canada <- ggplot(GapData, aes(gdpPercap, lifeExp, size = pop, colour = country)) +
  geom_point(alpha = 0.7) +
  geom_text(data = subset(GapData, GapData$country == "Canada"), aes(gdpPercap, lifeExp, label = country)) +
  scale_colour_manual(values = country_colors) +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  theme(legend.position = 'none') +
  labs(title = 'Year: {frame_time}', x = 'GDP Per Capita', y = 'Life Expectancy') +
  transition_time(year) +
  ease_aes('linear')

animate(p2Canada, 100, 10)



# That's it!





#### BONUS MATERIAL ####


#### Subsetting our Data ####

# There are a few ways to subset our data.  
# We have already learned about indexing so we will use that way first.  
# Secondly, we will use the subset() function to grab a specific subset.  
# There are many other ways in R to grab a portion of your data, however, we will stay with only these two for this tutorial.


#What if we're only interested in some of our data, say only Asian countries
#We can take a subset in multiple ways
#First, let's use what we learned about indexing to make a new all-Asian dataset

Asia <- GapData[GapData$continent == "Asia",]
head(Asia, n=15)

#Let's take a look at what is actually going on here:
head(GapData$continent == "Asia", n=62)

#The expression GapData$continent == "Asia" returns a boolean vector of Trues and Falses
#If the row is TRUE, then it keeps that row for the new dataframe otherwise it removes it


#That seemed to work, we can look at our global environment to see that Asia has only 396
#observations which is far less than the 1704 in GapData
#We could also check this using the nrow() function

nrow(Asia)
nrow(GapData)

#Another way of grabbing a subset of data is to use the subset() function
#Let's take a look at what the subset() function does first
?subset
Asia2 <- subset(GapData, GapData$continent == "Asia")

nrow(Asia)  == nrow(Asia2)
#Lastly, let's say we want a dataframe of specific countries
#Again, there are a couple ways to do this, but we'll do it this way

#First make a vector of the countries that we want:
CountryVector <- c("Canada", "Mexico", "China", "Vietnam", "South Africa")

#Now, index like we did before:
UniqueCountries <- GapData[GapData$country %in% CountryVector,]

head(UniqueCountries)

#Lastly, we can index on multiple conditions as well using "&" and "|" for "and" and "or"

CanadaRecent <- GapData[GapData$country == "Canada" & GapData$year > 1980,]

CanadaRecent


#### Analyzing our Data ####

# Now that our data is organized and clean, we can start to look at it.  
# Simple analysis like finding the mean or median for an entire column is quite simple, but the aggregate command lets use get statistics for specific groups.

#Let's now analyze some of the data
#Let's find the mean and median life expectancy for each year

aggregate(GapData$lifeExp, by = list(GapData$year), FUN = mean)

aggregate(GapData$lifeExp, by = list(GapData$year, GapData$continent), FUN = mean)

aggregate(GapData$lifeExp, by = list(GapData$year), FUN = median)

#Or we could do it this way with the ~:

aggregate(lifeExp ~ continent, GapData, median)
aggregate(lifeExp ~ continent, subset(GapData, GapData$year == 2007), median)
aggregate(lifeExp ~ year, GapData, median)