This document is based on a presentation I did for the grad student organization for the Department of Integrative Biology, UW–Madison in Fall 2018. I’ve made a few changes to hopefully make it more transparent as a stand-alone document.
Why use functions?
Two main advantages over copy and paste:
- Create fewer errors
- Improve readability of code
Consider the following example
Small errors are easy to make and can be annoying to find.
## Error in eval(predvars, data, env): object 'cyll' not found
## Error in eval(mf, parent.frame()): object 'mrcars' not found
The problem is even worse when you have lots of copying.
lm(mpg ~ cyl + disp + hp + drat, mtcars)
lm(mpg ~ cyl + disp + hp + wt, mtcars)
lm(mpg ~ cyl + disp + drat + wt, mtcars)
lm(mpg ~ cyl + hp + drat + wt, mtcars)
lm(mpg ~ disp + hp + drat + wt, mtcars)
lm(disp ~ mpg + cyl + hp + drat, mtcars)
lm(disp ~ mpg + cyl + hp + wt, mtcars)
lm(disp ~ mpg + cyl + drat + wt, mtcars)
lm(disp ~ mpg + hp + drat + wt, mtcars)
lm(disp ~ cyl + hp + drat + wt, mtcars)
lm(hp ~ mpg + cyl + disp + drat, mtcars)
lm(hp ~ mpg + cyl + disp + wt, mtcars)
lm(hp ~ mpg + cyl + drat + wt, mtcars)
lm(hp ~ mpg + disp + drat + wt, mtcars)
lm(hp ~ cyl + disp + drat + wt, mtcars)
Which is better?
lm_mpg <- lm(mpg ~ factor(cyl), mtcars)
lm_disp <- lm(disp ~ factor(cyl), mtcars)
lm_hp <- lm(hp ~ factor(cyl), mtcars)
lm_drat <- lm(drat ~ factor(cyl), mtcars)
lm_wt <- lm(wt ~ factor(cyl), mtcars)
lm_qsec <- lm(qsec ~ factor(cyl), mtcars)
lm_vs <- lm(vs ~ factor(cyl), mtcars)
lm_am <- lm(am ~ factor(cyl), mtcars)
lm_gear <- lm(gear ~ factor(cyl), mtcars)
lm_carb <- lm(carb ~ factor(cyl), mtcars)
or
Some R basics
Basics of functions in R
## [1] -3 -3 -3
## [1] 0 1 2
## [1] 3 3 3
Flexibility of lists
## Error in x[[1]] <- matrix(0, 0, 0): replacement has length zero
## [[1]]
## <0 x 0 matrix>
##
## [[2]]
## data frame with 0 columns and 0 rows
##
## [[3]]
## [1] 0.1701993 0.3846910 0.3334314
The apply
functions
- Allows you to apply a function to multiple inputs.
lapply
outputs a list,sapply
coerces to an array.
## [[1]]
## [1] 5
##
## [[2]]
## [1] 6
## [1] 5 6
General process to “functionalize” code
- Break problem into smaller sub-problems.
- For each sub-problem, write a function.
- For writing each function…
- The main function code will include the commonalities between all situations.
- Features that aren’t common should be input to the function as arguments.
Example #1: Cleaning weird files
Suppose we have a folder full of CSV files like this:
## ## Data provided by X
##
## Ozone,Solar.R,Wind,Temp,Month,Day
## 41,190,7.4,67,5,1
## NA,NA,14.3,56,5,5
## --- instrument error
## 28,NA,14.9,66,5,6
## 23,299,8.6,65,5,7
## --- instrument error
## NA,194,8.6,69,5,10
##
## ## Year observed: 1990
Problems:
- Remove unnecessary lines from each file.
- Create a single data frame from multiple cleaned files.
Clean a single CSV file to a string
Clean multiple files then combine them into a single data frame:
clean_df <- function(file_names) {
cleaned_strs <- lapply(file_names, clean_str)
data_frames <- lapply(cleaned_strs, readr::read_csv)
combined_df <- dplyr::bind_rows(data_frames)
return(as.data.frame(combined_df))
}
head(clean_df(file_names))
## Rows: 5 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Ozone, Solar.R, Wind, Temp, Month, Day
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 7 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Ozone, Solar.R, Wind, Temp, Month, Day
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 NA NA 14.3 56 5 5
## 3 28 NA 14.9 66 5 6
## 4 23 299 8.6 65 5 7
## 5 NA 194 8.6 69 5 10
## 6 7 NA 6.9 74 5 11
Example #2: Fitting lots of models
How can we simplify this?
lm(mpg ~ cyl + disp + hp + drat, mtcars)
lm(mpg ~ cyl + disp + hp + wt, mtcars)
lm(mpg ~ cyl + disp + drat + wt, mtcars)
lm(mpg ~ cyl + hp + drat + wt, mtcars)
lm(mpg ~ disp + hp + drat + wt, mtcars)
lm(disp ~ mpg + cyl + hp + drat, mtcars)
lm(disp ~ mpg + cyl + hp + wt, mtcars)
lm(disp ~ mpg + cyl + drat + wt, mtcars)
lm(disp ~ mpg + hp + drat + wt, mtcars)
lm(disp ~ cyl + hp + drat + wt, mtcars)
lm(hp ~ mpg + cyl + disp + drat, mtcars)
lm(hp ~ mpg + cyl + disp + wt, mtcars)
lm(hp ~ mpg + cyl + drat + wt, mtcars)
lm(hp ~ mpg + disp + drat + wt, mtcars)
lm(hp ~ cyl + disp + drat + wt, mtcars)
Problems:
- Create all necessary formulas for each of the multiple Ys.
- Fit
lm
based on each of the created formulas.
Input information:
- Vector of Y variables (
Ys
) - Vector of possible X variables (
Xs
) - Number of X variables to include in each model (
n_Xs
)
Make vector of all necessary formulas:
make_forms <- function(y, Xs, n_Xs) {
poss_Xs <- Xs[Xs != y]
n_poss_Xs <- length(poss_Xs)
# All possible combinations:
combs <- combn(n_poss_Xs, n_Xs, simplify = FALSE)
# Change to names:
names_ <- lapply(combs, function(x) poss_Xs[x])
# Combine each set to single RHS of formula:
rhs <- sapply(names_, paste, collapse = " + ")
# Whole formulas as strings:
form_strings <- paste(y, "~", rhs)
# Convert to formulas:
forms <- sapply(form_strings, as.formula,
USE.NAMES = FALSE)
return(forms)
}
Fit lm()
based on a single formula:
Put both steps together:
fit_models <- function(Ys, Xs, n_Xs) {
forms <- lapply(Ys, make_forms,
Xs = Xs, n_Xs = n_Xs)
forms <- c(forms, recursive = TRUE)
lms <- lapply(forms, single_mod)
return(lms)
}
model_fits <- fit_models(Ys, Xs, n_Xs)
model_fits[[1]]
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat, data = mtcars)
##
## Coefficients:
## (Intercept) cyl disp hp drat
## 23.98524 -0.81402 -0.01390 -0.02317 2.15405
More information
- T Mailund (2017). Functional Programming in R. doi: 10.1007/978-1-4842-2746-6
- Free through UW
- Functional Programming in R with purrr (towardsdatascience.com)
- Functional Programming (in Advanced R)