Programming with R

class: center, middle, inverse, title-slide

# Programming with R
## EPSY 887 - Fall 2019
### Jason Bryer, Ph.D.
### University at Albany
### 2019-09-19

---

# if Statements

`if` statements are a way of doing different operations conditionally. In this example, the `sample` function returns either a 0 or 1.

```r
if(sample(c(0,1), 1) > 0.5) {
	print('heads')
} else {
	print('tails')
}
```

```
## [1] "heads"
```

For simpler problems, the `ifelse` function may be easier.

```r
ifelse(sample(c(0,1), 1) > 0.5, 'heads', 'tails')
```

```
## [1] "heads"
```

---

# for loops

The `for` loop allows you to do some operation a number of times. You define some variable `i` that is set to somve value, then the all the R code within the braces (i.e. `{ }`) will be executed.

```r
for(i in 2:5) {
	print(i)
}
```

```
## [1] 2
## [1] 3
## [1] 4
## [1] 5
```

---

# Using for loop to recode

In this example, we wish to convert each of the Likert responses in the `mass` data frame to an ordered factor.

```r
for(i in 2:15) {
	mass[,i] <- factor(mass[,i], levels=1:5, labels=c('Strongly Disagree', 
			'Disagree', 'Neutral', 'Agree', 'Strongly Agree'), ordered=TRUE)
}
str(mass)
```

```
## Classes 'tbl_df', 'tbl' and 'data.frame':	75 obs. of  16 variables:
##  $ Gender: chr  "Female" "Female" "Male" "Female" ...
##  $ q1    : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q2    : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q3    : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q4    : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q5    : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q6    : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q7    : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q8    : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q9    : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q10   : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q11   : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q12   : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q13   : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ q14   : Ord.factor w/ 5 levels "Strongly Disagree"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Term  : chr  "Summer 2014" "Summer 2014" "Summer 2014" "Summer 2014" ...
```

---

# The apply functions

There are a number of `Xapply` functions that behave like `for` loops. They perform an operation over some data structure.

* `apply` - Used to execute a function to each row (`MARGIN = 1`) or column (`MARGIN = 2`).
* `lapply` - Used to execute a function to each element in a list. Returns a list of the same length of what was passed in.
* `sapply` - Used to execute a function to each element in a list. Returns a vector or array of the same length of what was passed in. This only works if the function executed on each element returns a vector of of the same length for all results.
* `vapply` - Is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.
* `tapply` - Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.
* `mappy` - Is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on.

---

# Examples

```r
apply(mass1[,2:15], 2, mean, na.rm = TRUE)
```

```
##   q1   q2   q3   q4   q5   q6   q7   q8   q9  q10  q11  q12  q13  q14 
## 3.60 3.35 4.35 2.50 3.60 2.90 2.65 3.75 2.90 2.65 2.90 2.45 2.85 3.15
```

```r
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
```

.pull-left[

```r
lapply(x, mean)
```

```
## $a
## [1] 5.5
## 
## $beta
## [1] 4.535125
## 
## $logic
## [1] 0.5
```

]
.pull-right[

```r
sapply(x, mean)
```

```
##        a     beta    logic 
## 5.500000 4.535125 0.500000
```

```r
sapply(x, quantile)
```

```
##          a        beta logic
## 0%    1.00  0.04978707   0.0
## 25%   3.25  0.25160736   0.0
## 50%   5.50  1.00000000   0.5
## 75%   7.75  5.05366896   1.0
## 100% 10.00 20.08553692   1.0
```
]

---

# while loops

`while` loops will run until some condition is met. In this example, consider 1=heads, 0=tails. How many random events would it take before getting 100 heads?

```r
sum <- 0
count <- 0
while(sum < 100) {
	sum <- sum + sample(c(0,1), 1)
	count <- count + 1
}
count # Number of loops until we got 100 head
```

```
## [1] 191
```

---

# Functions

Functions allow for organizing common procedures to easily be used later. You can specify any number of parameters with optional default values. The objects created and/or edited within the function are local to the function (i.e . not available to the calling environment). The `return` or `invisible` function is used to return value(s) to the environment calling the function. The `invisible` function will return a value but will not print it to the console if it is not assigned to an object.

```r
getStanding <- function(credits, 
						breaks=c(30, 60, 90),
						labels=c('Freshman','Sophomore', 'Junior', 'Senior')) {
	results <- cut(credits, breaks=c(-Inf, breaks, Inf), labels=labels, 
				   include.lowest=TRUE, right=FALSE)
	return(results)
}
getStanding(42)
```

```
## [1] Sophomore
## Levels: Freshman Sophomore Junior Senior
```

```r
getStanding(c(27, 60, 101, 200))
```

```
## [1] Freshman Junior   Senior   Senior  
## Levels: Freshman Sophomore Junior Senior
```

---

# The ... Operator

The `...` operator in R (technically an ellipsis, but often referred to as the dots parameter, or dot-dot-dot) allows for flexibility in how parameters are given to R. There are two major ways the `...` parameter is used:

1. To pass parameters to another function without having to explicitly define them.
2. To have an arbitrary number of parameters for a function.

---

# ... : Passing Parameters

```r
print_percent <- function(x, multiply = all(x <= 1.0), ...) {
	if(multiply) {
		x <- x * 100
	}
	x <- round(x, ...)
	return( paste0(x, '%') )
}
```

```r
print_percent(.2112)
```

```
## [1] "21%"
```

```r
print_percent(.2112, digits = 1)
```

```
## [1] "21.1%"
```

```r
print_percent(c(.4, .01, .05, .2112), digits = 2)
```

```
## [1] "40%"    "1%"     "5%"     "21.12%"
```

---

# ... : Arbitrary Arugements

```r
test_function <- function(..., 
		  param1, 
		  param2 = 'Some Default') {
	dots <- list(...)
	print('The ... parameters:')
	print(dots)
	if(missing(param1)) {
		print('param1 was not defined')
	} else {
		print(paste0('param1 = ', param1))
	}
	print(paste0('param2 = ', param2))
}
```

---

# ... : Arbitrary Arugements (cont)

```r
test_function()
```

```
## [1] "The ... parameters:"
## list()
## [1] "param1 was not defined"
## [1] "param2 = Some Default"
```

```r
test_function(param1 = 2112)
```

```
## [1] "The ... parameters:"
## list()
## [1] "param1 = 2112"
## [1] "param2 = Some Default"
```

---

# ... : Arbitrary Arugements (cont)

```r
test_function(a = 'a', b = 'b', 'c', param1 = 2112)
```

```
## [1] "The ... parameters:"
## $a
## [1] "a"
## 
## $b
## [1] "b"
## 
## [[3]]
## [1] "c"
## 
## [1] "param1 = 2112"
## [1] "param2 = Some Default"
```

---

# Object Oriented Programming in R

Object-oriented programming (OOP) is a way of programming where you define not only the data structure (attributes), but also the operations (actions). Consider a *football*. This is an object that *attributes* (e.g. size, color, shape) and *actions* (e.g. throw, catch, kick).

R has a number of OOP frameworks, however we will only consider the S3 OOP framwork since it is part of base R and you have already encountered it many times. It is an ad hoc OOP framwork in that actions are determined informally. **Generic** functions will dispatch to the class specific function using the class name. Consider the implementation of the `print` function:

```r
print
```

```
## function (x, ...) 
## UseMethod("print")
## <bytecode: 0x7f9f768bbbb0>
## <environment: namespace:base>
```

The `UseMethod` will determine the correct `print` method to call for the specified object. Consider an object where `class(object) <- c('MyClass')`. Calling `print(object)` will first look for a function `print.MyClass`, if that is not found then it will use the `print.default` function.

---

# Inheritance

A key feature of OOP frameworks is inheritance. Using our ball analogy, we could have a *super* class of type *ball*. It can define some features and actions. But we could also create a class *football* that inherits from *ball*, whereby we "inherit" certain attributes and actions, but can override and/or define new ones.

Consider the shape attribute of ball. We could represent that a simply the diamater since this is common for most balls. However, footballs are oblong and therefore require two diamaters.

A baseball could also inherit from ball, but we would override the *kick* action since that is not defined for baseball.

---

# OOP Example

```r
baseball <- function(...) {
	ball <- list(...)
	class(ball) <- c('baseball', 'ball')
	return(ball)
}

football <- function(...) {
	ball <- list(...)
	class(ball) <- c('football', 'ball')
	return(ball)
}

kick <- function(x, ...) {
	UseMethod("kick")
}

kick.default <- function(x, ...) {
	print("Kicking the ball like any other ball!")
}
```

---

# OOP Example (cont).

```r
my_baseball <- baseball()
my_football <- football()
kick(my_baseball)
```

```
## [1] "Kicking the ball like any other ball!"
```

```r
kick(my_football)
```

```
## [1] "Kicking the ball like any other ball!"
```

---

# OOP Example (cont).

```r
kick.baseball <- function(x, ...) {
	print("You don't kick baseballs!")
}

kick(my_baseball)
```

```
## [1] "You don't kick baseballs!"
```

```r
kick(my_football)
```

```
## [1] "Kicking the ball like any other ball!"
```

---

# OOP Example (cont).

```r
kick.football <- function(x, ...) {
	print("Kicking for field goal, or 1-point conversion?")
}

kick(my_baseball)
```

```
## [1] "You don't kick baseballs!"
```

```r
kick(my_football)
```

```
## [1] "Kicking for field goal, or 1-point conversion?"
```

---

# The trouble with tibbles

Consider the `mass` data frame, it has the following class definition: tbl_df, tbl, data.frame. Recall that the `[` operator on data frames is a function that has one important parameter, `drop`.

.pull-left[

```r
class(mass)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```

```r
is.data.frame(mass)
```

```
## [1] TRUE
```

]
.pull-right[

We will explicitly convert `mass` to a data.frame.

```r
mass_df <- as.data.frame(mass)
class(mass_df)
```

```
## [1] "data.frame"
```

```r
is.data.frame(mass_df)
```

```
## [1] TRUE
```
]
---

# The trouble with tibbles (cont.)

Consider the situation where we wish to retrieve one column:

.pull-left[

```r
mass[,1]
```

```
## # A tibble: 75 x 1
##    Gender
##    <chr> 
##  1 Female
##  2 Female
##  3 Male  
##  4 Female
##  5 Female
##  6 Female
##  7 Female
##  8 Female
##  9 Female
## 10 Female
## # … with 65 more rows
```

```r
class(mass[,1])
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```
]

.pull-right[

```r
mass_df[,1]
```

```
##  [1] "Female" "Female" "Male"   "Female" "Female" "Female" "Female"
##  [8] "Female" "Female" "Female" "Female" "Female" "Male"   "Female"
## [15] "Male"   "Female" "Male"   "Male"   "Female" "Male"   "Female"
## [22] "Female" "Male"   "Female" "Female" "Female" "Female" "Female"
## [29] "Female" "Female" "Male"   "Male"   "Male"   "Male"   "Female"
## [36] "Female" "Female" "Male"   "Female" "Female" "Female" "Female"
## [43] "Female" "Female" "Female" "Female" "Female" "Female" "Male"  
## [50] "Female" "Female" "Female" "Female" "Female" "Female" "Female"
## [57] "Female" "Male"   "Male"   "Female" "Female" "Female" "Female"
## [64] "Female" "Female" "Female" "Female" "Male"   "Female" "Female"
## [71] "Female" "Female" "Female" "Female" "Male"
```

```r
class(mass_df[,1])
```

```
## [1] "character"
```

]

---

# Extended Example: Merging Multiple data.frames

See: https://github.com/jbryer/EPSY887-Fall2019/blob/master/R/merge_all.R

---

# Bootstrapping

* First introduced by Efron (1979) in [*Bootstrap Methods: Another Look at the Jackknife*](https://projecteuclid.org/euclid.aos/1176344552).
* Estimates confidence of statistics by resampling *with* replacement.
* The *bootstrap sample* provides an estimate of the sampling distribution.
* The `boot` R package provides a framework for doing bootstrapping: https://www.statmethods.net/advstats/bootstrapping.html

---

# Bootstrapping Example (Population)

Define our population with a uniform distribution.

```r
n <- 1e5
pop <- runif(n, 0, 1)
mean(pop)
```

```
## [1] 0.5008973
```

![](2019-09-19-Class_files/figure-html/unnamed-chunk-30-1.png)

---

# Bootstrapping Example (Sample)

We observe one random sample from the population.

```r
samp1 <- sample(pop, size = 50)
```

![](2019-09-19-Class_files/figure-html/unnamed-chunk-32-1.png)

---

# Bootstrapping Example (Estimate)

```r
boot.samples <- numeric(1000) # 1,000 bootstrap samples
for(i in seq_along(boot.samples)) { 
	tmp <- sample(samp1, size = length(samp1), replace = TRUE)
	boot.samples[i] <- mean(tmp)
}
head(boot.samples)
```

```
## [1] 0.5521028 0.5129548 0.5413835 0.5011787 0.5695299 0.5660828
```

---

# Bootstrapping Example (Distribution)

```r
d <- density(boot.samples)
h <- hist(boot.samples, plot=FALSE)
hist(boot.samples, main='Bootstrap Distribution', xlab="", freq=FALSE, 
     ylim=c(0, max(d$y, h$density)+.5), col=COL[1,2], border = "white", 
	 cex.main = 1.5, cex.axis = 1.5, cex.lab = 1.5)
lines(d, lwd=3)
```

![](2019-09-19-Class_files/figure-html/unnamed-chunk-34-1.png)

---

# 95% confidence interval

```r
c(mean(boot.samples) - 1.96 * sd(boot.samples), 
  mean(boot.samples) + 1.96 * sd(boot.samples))
```

```
## [1] 0.4643934 0.6211700
```

---

# Bootstrapping is not just for means!

```r
boot.samples.median <- numeric(1000) # 1,000 bootstrap samples
for(i in seq_along(boot.samples.median)) { 
	tmp <- sample(samp1, size = length(samp1), replace = TRUE)
	# NOTICE WE ARE NOW USING THE median FUNCTION!
	boot.samples.median[i] <- median(tmp) 
}
head(boot.samples.median)
```

```
## [1] 0.6431739 0.5998054 0.4876384 0.5642982 0.5740293 0.5740293
```

95% confidence interval for the median

```r
c(mean(boot.samples.median) - 1.96 * sd(boot.samples.median), 
  mean(boot.samples.median) + 1.96 * sd(boot.samples.median))
```

```
## [1] 0.4751294 0.6726717
```

---

# Your Turn: Create a Boostrap Function

The following R code should run successfully:

```r
# Setup our data
n <- 1e5
pop <- runif(n, 0, 1)
samp1 <- sample(pop, size = 50)

# Create an object that did the bootstrap samples	
boot1 <- bootstrap(samp1)
class(boot1) # Verify the class is "bootstrap"
# Call some S3 generic functions
print(boot1)
print(boot1, ci = 0.95)
summary(boot1)
summary(boot1, median)
summary(boot1, sd)
plot(boot1)
plot(boot1, metric = median)
```

---

# Define the required functions

Write the function that will create our object, and define our class (use `bootstrap` as the class name). The function will need (at least) two parameters: `x` which will be the sample from which the bootstrap distribution will be estimated, and `n.boot.samples` which defines the number of bootstrap samples to generate. You can optionally include the `...` parameter, but we won't use it here.

```r
bootstrap <- function(x, n.boot.samples = 1000, ...) {
	boot <- list()
	# Need to so some work here...
	class(boot) <- 'bootstrap'
	return(boot)
}
```

Define S3 functions/methods for `print`, `summary`, and `plot`:

```r
print.bootstrap <- function(x, ...) { 
}

summary.bootstrap <- function(x, ...) {
}

plot.bootstrap <- function(x, ...) {
}
```

---

# Bootstrap Function

Implement the `bootstrap` function such that it creates a `list` with two elements:
* `samples` - a list of length `n.boot.samples` where each element is a bootstrap sample (i.e. a vector with length equal to `length(x)`).
* `sample` - the value of `x`.

```r
bootstrap <- function(x, 
					  n.boot.samples = 1000,
					  ...) {
	bootstrap <- list()
	bootstrap$samples <- list()
	for(i in seq_len(n.boot.samples)) { 
		bootstrap$samples[[i]] <- sample(x, size = length(x), replace = TRUE)
	}
	bootstrap$sample <- x
	class(bootstrap) <- 'bootstrap'
	return(bootstrap)
}
```

---

# Summary Function

Implement the `summary` function so that it returns the bootstrap value. The user should be able to define what metric to use (e.g. `mean`, `median`, `sd`, etc.)

```r
summary.bootstrap <- function(x, metric = mean, ...) {
	boot_dist <- sapply(x$samples, metric, ...)
	return(mean(boot_dist))
}
```

---

# Print Function

Implement the `print` function so that it prints the following information to the console: (1) the number of bootstrap samples, (2) the bootstrap value, and (3) a confidence interval.

```r
print.bootstrap <- function(x, metric = mean, ci, ...) {
	boot_dist <- sapply(x$samples, metric, ...)
	value <- mean(boot_dist)
	value.sd <- sd(boot_dist)
	ci.str <- ''
	if(!missing(ci)) {
		cv <- abs(qnorm((1 - ci) / 2))
		ci.str <- paste0(' (', round(ci * 100), '% confidence interval: ',
						 round(value - cv * value.sd, digits = 2), ', ', 
						 round(value + cv * value.sd, digits = 2), ')')
	}
	print(paste0('Bootstrap value of ', round(value, digits = 2), ci.str,
			 ' was obtained from a bootstrap distribution (n = ', length(x$samples), 
			 ') of an original sample of ', length(x$sample), ' observations.'))
}
```

---

# Plot Function

Implement the `plot` function so that it plots a histogram of the bootstrap distributions.

```r
plot.bootstrap <- function(x, metric = mean, ...) {
	boot_dist <- sapply(x$samples, metric, ...)
	d <- density(boot_dist)
	h <- hist(boot_dist, plot=FALSE)
	hist(boot_dist, main='Bootstrap Distribution', xlab="", freq=FALSE, 
		 ylim=c(0, max(d$y, h$density)+.5), col='#569BBDC0', border = "white", 
		 cex.main = 1.5, cex.axis = 1.5, cex.lab = 1.5)
	lines(d, lwd=3)
}
```

---

# Putting it all together

```r
boot1 <- bootstrap(samp1)
class(boot1) # Verify the class is "bootstrap"
```

```
## [1] "bootstrap"
```

```r
print(boot1)
```

```
## [1] "Bootstrap value of 0.54 was obtained from a bootstrap distribution (n = 1000) of an original sample of 50 observations."
```

```r
print(boot1, ci = 0.95)
```

```
## [1] "Bootstrap value of 0.54 (95% confidence interval: 0.47, 0.62) was obtained from a bootstrap distribution (n = 1000) of an original sample of 50 observations."
```

```r
summary(boot1)
```

```
## [1] 0.5439282
```

```r
summary(boot1, median)
```

```
## [1] 0.5737131
```

---

```r
plot(boot1)
```

![](2019-09-19-Class_files/figure-html/unnamed-chunk-46-1.png)

---

```r
plot(boot1, metric = median)
```

![](2019-09-19-Class_files/figure-html/unnamed-chunk-47-1.png)