5 Control Flow and Functions

Control flow determines in what order code is executed. There are two basic concepts of control flow: choices (e.g., if-statements) and loops. These concepts are by no means specific to R – you will find similar implementations in many programming languages. What is specific to R, however, is their syntax and their significance in comparison to other features, such as vectorization.

Because the same concept is expressed with slightly different syntax in each programming language, code snippets come in handy. In R Studio, go to Options > Code > Edit Snippets to get an overview of available snippets. For example, to define a new function, just type “fun” and hit tab to autocomplete using the code snippet.

# load packages (necessary for applications, not control flow itself)
library(tidyverse)
library(palmerpenguins)

5.1 Functions

Before we start looking at choices and loops, we will learn how to write a function. Strictly speaking, the concept of functions is not part of control flow, but choices and loops often make most sense within them. The code snippet already contains boilerplate text that represents the three components of a function definition: the function name, one or more function arguments and the actual code that modifies the input. Curly braces are only necessary if the function body exceeds one line.

name <- function(variables) {
  
}

Let us see an example of a function. In economic analysis, some researchers like adding a small number before taking the logarithm to avoid ending up with missing or undefined values in the case of zero. Although it is debatable whether this is good practice or not, we will implement it as our first function.

Start by thinking of our function as a black box. What are the inputs, what are the outputs? Then move on and flesh out the function’s internals. Our “log regularization”, as it is sometimes called, takes a numeric input and converts it into numeric output. By convention, we might call this numeric argument x but this choice is arbitrary. Now what happens to this argument once it enters the function? In very much detail, our algorithm might look like this: Take the input x, add 0.01 and store it in a new variable y. Then apply the logarithm and store the result in a new variable z. Finally return this output.

# verbose function definition
reg_log <- function(x) {
  y <- x + 0.01
  z <- log(y)
  return(z)
}

reg_log(0)
#> [1] -4.60517

To make our code more concise, we can avoid creating new variables by nesting functions. In our example, that means applying the logarithm directly to the sum of the two numbers: log(x + 0.01). Note that creating additional variables within a function is not necessarily a problem though: y and z in the example above only exist within the black box that is your function and will not clutter up the global environment.

It is also recommended not to use return() unless you have a very complex function with many if-else statements in which case you might want to make an early return explicit. If you do not specify it, the function will return the output of the last line of code.

# shortened function definition
reg_log2 <- function(x) log(x + 0.01)
reg_log2(0)
#> [1] -4.60517

To add more function arguments, simply list their names in the parentheses, separated by a comma. Here, we are making our function more flexible by allowing the user to add an arbitrary number which we call offset. We can specify 0.01 as the default value when no offset argument is provided with a name-value pair:

# additional argument with default value
reg_log3 <- function(x, offset = 0.01) log(x + offset)

reg_log3(0)
#> [1] -4.60517
reg_log3(0, offset = 2)
#> [1] 0.6931472

There is much more to say about functions but a complete discussion is outside the scope of this course. One final word of caution though: our function design is not fool-proof in any way yet. Nothing prevents the user from applying it to a character string, for example: reg_log3("red"). If you plan to share your function with other users (or just your future self), make sure to create proper documentation, checks and error messages. The chapter on functions in R for Data Science provides a good starting point.

5.2 Choices

An if-statement allows us to execute certain code if a logical condition is fulfilled; else specifies the case when it is not fulfilled. In the example below, a variable person is assigned the value "voter" or "non-voter", depending if age is greater or equal to 18.

age <- 17

if (age >= 18) person <- "voter" else person <- "non-voter"
# or shorter:
person <- if (age >= 18) "voter" else "non-voter"

person
#> [1] "non-voter"

Again, longer statements go within curly braces. By nesting multiple if-statements, your case differentiation can become arbitrarily complex. Often, if-statements are used within functions to enable different behavior depending on some condition. In our example from above, we could multiply the offset to make it positive in case the user supplied a negative number (which could of course be achieved more easily by taking the absolute value):

reg_log4 <- function(x, offset) {
  # multiply with -1 if offset is negative
  if (offset < 0) offset <- offset * (-1)
  log(x + offset)
}

reg_log4(0, offset = 0.01)
#> [1] -4.60517
reg_log4(0, offset = -0.01)
#> [1] -4.60517

Note that if-statements as used above are not vectorized, i.e., the logical condition may only have length one. Within a tibble, if_else() and case_when() are useful functions to modify or create entire columns based on some condition.

5.3 Exercises I

5.3.0.1 Question 1

Sometimes it is useful to scale values such that they have a mean of zero and a standard deviation of one. When you apply R’s function for scaling to a vector of numbers, you will end up with a matrix (with additional attributes that we can ignore for now). Write a function my_scale() that modifies the output of scale() internally to yield a vector instead:

# original scale function
scale(1:10)
#>             [,1]
#>  [1,] -1.4863011
#>  [2,] -1.1560120
#>  [3,] -0.8257228
#>  [4,] -0.4954337
#>  [5,] -0.1651446
#>  [6,]  0.1651446
#>  [7,]  0.4954337
#>  [8,]  0.8257228
#>  [9,]  1.1560120
#> [10,]  1.4863011
#> attr(,"scaled:center")
#> [1] 5.5
#> attr(,"scaled:scale")
#> [1] 3.02765

# new scale function
my_scale(1:10)
#>  [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446
#>  [6]  0.1651446  0.4954337  0.8257228  1.1560120  1.4863011
Answer

First, let us make sure we understand the code from the question. 1:10 is a vector that contains the numbers from 1 to 10 as elements. To make this property more transparent, we could also write

x <- 1:10
x
#>  [1]  1  2  3  4  5  6  7  8  9 10

scale(x)
#>             [,1]
#>  [1,] -1.4863011
#>  [2,] -1.1560120
#>  [3,] -0.8257228
#>  [4,] -0.4954337
#>  [5,] -0.1651446
#>  [6,]  0.1651446
#>  [7,]  0.4954337
#>  [8,]  0.8257228
#>  [9,]  1.1560120
#> [10,]  1.4863011
#> attr(,"scaled:center")
#> [1] 5.5
#> attr(,"scaled:scale")
#> [1] 3.02765

The output of scale(x) is a matrix. This fact is obscured by its “degenerate” nature with only one column. We can check whether it is a matrix with is.matrix(scale(x)) but a seasoned R user will recognize immediately from the console output that we are facing a 2-dimensional object with both row and column indices.

To extract a single element from a 2-dimensional object like a matrix, we need to specify two coordinates: m[3,5] would extract the element in the third row and the fifth column of a matrix m. If we omit the row number (like so: m[,1]), we would end up extracting all rows of a given column. This is exactly what we want: from the matrix produced by scale(x), we only want to extract the first column – its only column.

Within the function, we can create an intermediate matrix object m or index the output from scale(x) right away:

# with intermediate object
my_scale <- function(x) {
  m <- scale(x)
  m[,1]
}

# without intermediate object
my_scale <- function(x) scale(x)[,1]

5.3.0.2 Question 2

Re-write the function scale() from scratch. It should take a numeric vector, subtract its mean and divide the result by its standard deviation. The output should be a vector, not a matrix. Hint: You may use mean() and sd() to compute the mean and the standard deviation.

Answer
my_manual_scale <- function(x) {
  (x - mean(x)) / sd(x)
}

my_manual_scale(1:10)
#>  [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446
#>  [6]  0.1651446  0.4954337  0.8257228  1.1560120  1.4863011

5.3.0.3 Question 3

Some researchers favor the inverse hyperbolic sine function asinh() over the logarithm because it does not go towards negative infinity at zero. Explain what arguments the function below accepts and use it to apply an inverse hyperbolic sine transformation to a value of zero.

reglog <- function(x, fun = log) {
  fun(x)
}
Answer

The function above has a second argument that allows the user to pass a function (such as the logarithm or inverse hyperbolic sine) and have it applied to the other input, x. The name-value pair fun = log sets logarithm as the default of fun. To use asinh (the inverse hyperbolic sine) instead of log, simply overwrite the default value with fun = asinh when you run the function. Why not write asinh() instead? Adding parentheses would trigger a function call – however, what we rather want to do is to tell R the name of the function object. This ability to pass functions to other functions is what makes R a so-called functional programming language.

# apply inverse hyperbolic sine to zero
reglog(0, fun = asinh)
#> [1] 0

5.3.0.4 Question 4

Try to figure out what the switch() function does in the example below and write a function that does the same while only using if and else statements.

count_legs <- function(x) {
  switch(x,
    dog = 4,
    chicken = 2,
    plant = 0
  )
}

count_legs("dog")
#> [1] 4
count_legs("chicken")
#> [1] 2
count_legs("plant")
#> [1] 0
Answer

The switch() function allows you to check against a sequence of logical condition, each of which can also be represented by an if-statement. It works best with character strings as input: in the example above, it spits out the value we assigned to each of the character strings "dog", "chicken" and "plant".

Hint: The code snippets “ei”, “if” and “el” will make your life easier when dealing with nested if-conditions.

count_legs_ifelse <- function(x) {
  if (x == "dog") {
    return(4)
  } else if (x == "chicken") {
    return(2)
  } else {
    return(0)
  }
}

count_legs_ifelse("dog")
#> [1] 4
count_legs_ifelse("chicken")
#> [1] 2
count_legs_ifelse("plant")
#> [1] 0

5.4 Loops

Loops allow you to run code repeatedly. In most programming languages, you will encounter two types of loops: for and while loops. while loops will run as long as some logical condition is true whereas for loops cycle through the elements of a vector. Therefore, be careful to run while loops if there is a chance that the logical condition never turns false. The following lines of code (where the logical condition is simply hard-coded as TRUE) will print "hello" to the console without ever stopping:

# do not run
while (TRUE) {
  print("hello")
}

For our purposes, for loops will not only be safer but also more useful. Type “for” and hit tab to activate the corresponding code snippet. As you can see, we can specify the vector over whose elements we want to loop and tell R what name we will use to refer to each element when we execute the loop.

for (variable in vector) {
  
}

For example, we can write a for loop to concatenate (and display) a character string to each element of the vector containing the names of our family members:

family <- c("Hans", "Erwin", "Ingrid")

for (name in family) {
  print(str_c("It's your birthday, ", name, "!"))
}
#> [1] "It's your birthday, Hans!"
#> [1] "It's your birthday, Erwin!"
#> [1] "It's your birthday, Ingrid!"

Instead of looping over the elements from the vector themselves, we can loop over their position. This is necessary if you want to save the output in a new vector. To make this operation efficient, make sure the (pre-defined) output vector has the correct length. seq_along(family) creates a vector of indices matching the length of the vector family:

output <- vector("double", length = length(family))

for(i in seq_along(family)) {
  output[i] <- str_length(family[i])
}

output
#> [1] 4 5 6

5.5 Vectorization

Loops are much less important in R than in other programming languages because most functions are vectorized in R. If you find yourself using a lot of loops, this is usually an indicator for code that is flawed or at least not efficient. For example, to calculate the logarithm of each number from 1 to 10 or to append it to a character string, a single call to the function is sufficient:

log(1:10)
#>  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
#>  [6] 1.7917595 1.9459101 2.0794415 2.1972246 2.3025851
str_c("My age is ", 1:10)
#>  [1] "My age is 1"  "My age is 2"  "My age is 3" 
#>  [4] "My age is 4"  "My age is 5"  "My age is 6" 
#>  [7] "My age is 7"  "My age is 8"  "My age is 9" 
#> [10] "My age is 10"

Even the function we built ourselves works on a vector because all internal operations are vectorized:

reg_log(1:10)
#>  [1] 0.009950331 0.698134722 1.101940079 1.388791241
#>  [5] 1.611435915 1.793424749 1.947337701 2.080690761
#>  [9] 2.198335072 2.303584593

Any function that is not vectorized yet can be applied to multiple elements at once using the function map() (or apply(), in case you are not a friend of the tidyverse). For example, assume we would like to calculate the means of multiple numeric vectors that we have collected in a list called x. We can apply mean() to each element of the list with map(x, mean):

# define list of numeric vectors
x <- list(1:3, c(4, NA, 6), 5:10)
x
#> [[1]]
#> [1] 1 2 3
#> 
#> [[2]]
#> [1]  4 NA  6
#> 
#> [[3]]
#> [1]  5  6  7  8  9 10

# apply mean() to each vector
map(x, mean)
#> [[1]]
#> [1] 2
#> 
#> [[2]]
#> [1] NA
#> 
#> [[3]]
#> [1] 7.5

As you may have noticed, one of the numeric vectors contains a missing value and causes the mean to be undefined as well. To ignore missing values in the calculation of the mean, we can specify na.rm = TRUE. But since we only provide the function name (mean) in our call to map(x, mean), there is no straightforward way to add additional arguments. There are several workarounds for this. The first one is to create a new function with the desired behavior and call it by its name in map():

my_mean <- function(x) {
  mean(x, na.rm = TRUE)
}

map(x, my_mean)
#> [[1]]
#> [1] 2
#> 
#> [[2]]
#> [1] 5
#> 
#> [[3]]
#> [1] 7.5

In case we do not intend to re-use the function ever again, we can avoid giving it a name and define it as an “anonymous function” on the fly. The lambda notation (introduced in R version 4.1.0) provides an even shorter way to define the anonymous function by replacing “function” with a backslash.

# anonymous function
map(x, function(x) mean(x, na.rm = TRUE))
#> [[1]]
#> [1] 2
#> 
#> [[2]]
#> [1] 5
#> 
#> [[3]]
#> [1] 7.5

# using lambda notation
map(x, \(x) mean(x, na.rm = TRUE))
#> [[1]]
#> [1] 2
#> 
#> [[2]]
#> [1] 5
#> 
#> [[3]]
#> [1] 7.5

The map() (and apply()) functions always return list objects but their close relatives (called map_*()) return a data type of your choice. For example, map_dbl(x, my_mean) will return a vector of doubles (i.e., numeric values) instead of a list. map2() and pmap() allow you to iterate over multiple lists at once.

What if the elements you would like to iterate over are not elements of a list but rather columns of a dataframe? The map() family of functions works in this case, too, because a dataframe is simply a list of same-length vectors under the hood. Yet there is another function for exactly this purpose that works well within other tidyverse verbs like mutate() and summarize(): across(). To calculate the mean (ignoring missing values) of the variables bill_length_mm and bill_depth_mm, type:

penguins %>% 
  summarize(across(
    .cols = c(bill_length_mm, bill_depth_mm),
    .fns = my_mean
  ))
#> # A tibble: 1 × 2
#>   bill_length_mm bill_depth_mm
#>            <dbl>         <dbl>
#> 1           43.9          17.2

Because across() is aware of the dataframe it is being called on, we can use tidy-select syntax to make our code more readable. The following code chunk is equivalent to the one above but selects the variables to summarize by their name:

penguins %>% 
  summarize(across(starts_with("bill"), my_mean))
#> # A tibble: 1 × 2
#>   bill_length_mm bill_depth_mm
#>            <dbl>         <dbl>
#> 1           43.9          17.2

Similar to the map() functions, the .fns argument of across() accepts anonymous functions and lambda notation.

5.6 Exercises II

5.6.0.1 Question 1

Write a for loop that iterates through the numbers from 1 to 20 and prints only those that are prime to the console. You may use the following vector of primes in your code:

# define vector of primes
primes <- c(2, 3, 5, 7, 11, 13, 17, 19)

# Example: check if 4 matches an element of the vector
4 %in% primes
#> [1] FALSE
Answer

Note that you could check whether each number between 1 and 20 matches any element from primes element by element like so: i == primes[1] | i == primes[2] | .... The %in% notation simply makes your life easier here.

for (i in 1:20) {
  if (i %in% primes) print(i)
}
#> [1] 2
#> [1] 3
#> [1] 5
#> [1] 7
#> [1] 11
#> [1] 13
#> [1] 17
#> [1] 19

5.6.0.2 Question 2

Use length() instead of seq_along() to produce the same output (a vector of indices).

family <- c("Hans", "Erwin", "Ingrid")
seq_along(family)
#> [1] 1 2 3
Answer
1:length(family)
#> [1] 1 2 3

5.6.0.3 Question 3

Use map() to transform a value of zero with our reg_log3() function three times, using the three different offset values stored in the vector y. Is it necessary to use map() here?

# define function
reg_log3 <- function(x, offset = 0.01) log(x + offset)

# define offset values
y <- c(0.1, 0.01, 0.001)
Answer

The list or vector to which we apply map() can be used to specify any argument of the function that we want to execute repeatedly. In our example, the values from y are supposed to determine the offset of reg_log3() so we define an anonymous function that passes the value from y into the offset argument of reg_log3():

map(y, \(x) reg_log3(0, x))
#> [[1]]
#> [1] -2.302585
#> 
#> [[2]]
#> [1] -4.60517
#> 
#> [[3]]
#> [1] -6.907755

Note that the addition is vectorized in R so we could have achieved the same without map():

reg_log3(0, y)
#> [1] -2.302585 -4.605170 -6.907755

5.6.0.4 Question 4

In this task we will simulate normally distributed random variables with different means using rnorm(). Use map2() to simulate one with mean 1, two with mean 10 and three with mean 100.

Hint: map2() allows you to iterate over two vectors at the same time. The output should look like this:

#> [[1]]
#> [1] -0.607242
#> 
#> [[2]]
#> [1] 10.98899 10.92222
#> 
#> [[3]]
#> [1] 101.00195  98.71317 101.06087
Answer

Whenever you intend to perform an operation repeatedly with map(), try working with a single element first to build intuition. To simulate 5 normally distributed random variables with mean zero using rnorm(), execute the following code:

rnorm(5, mean = 0)
#> [1]  1.1772430 -1.0874512  0.7989446  0.3385811  1.1173001

Now to simulate a different number of variables with different means each, we have to iterate over two vectors at the same time. This is the job of map2():

# solution using map2()
map2(1:3, c(1, 10, 100), \(x, y) rnorm(x, mean = y))
#> [[1]]
#> [1] -0.0309682
#> 
#> [[2]]
#> [1] 10.230438  9.194334
#> 
#> [[3]]
#> [1] 100.10191  98.98828 100.48003

# equivalent to 
list(
  rnorm(1, mean = 1),
  rnorm(2, mean = 10),
  rnorm(3, mean = 100)
)
#> [[1]]
#> [1] -0.5115922
#> 
#> [[2]]
#> [1] 9.848258 8.878448
#> 
#> [[3]]
#> [1] 100.72125 100.81430  99.74266

To make it more intuitive, think about what happens in the first iteration: map2() will pass the first element of the first vector (the value 1) to the rnorm() argument that we have indicated with x. The first element of the second vector (1 as well) enters as the argument indicated with y, so the first element of the output list will be created by rnorm(1, mean = 1).

5.6.0.5 Question 5

Try to figure out what the following code does and adjust the code such that the transformation (“log” and “asinh”) are added as prefixes to the variable names, not suffixes.

penguins %>% 
  transmute(across(
    .cols = where(is.numeric),
    .fns = list(log = log, asinh = asinh)
  ))
#> # A tibble: 344 × 10
#>    bill_length_mm_log bill_length_mm_asinh bill_depth_mm_log
#>                 <dbl>                <dbl>             <dbl>
#>  1               3.67                 4.36              2.93
#>  2               3.68                 4.37              2.86
#>  3               3.70                 4.39              2.89
#>  4              NA                   NA                NA   
#>  5               3.60                 4.30              2.96
#>  6               3.67                 4.36              3.03
#>  7               3.66                 4.35              2.88
#>  8               3.67                 4.36              2.98
#>  9               3.53                 4.22              2.90
#> 10               3.74                 4.43              3.01
#> # ℹ 334 more rows
#> # ℹ 7 more variables: bill_depth_mm_asinh <dbl>,
#> #   flipper_length_mm_log <dbl>,
#> #   flipper_length_mm_asinh <dbl>, body_mass_g_log <dbl>,
#> #   body_mass_g_asinh <dbl>, year_log <dbl>,
#> #   year_asinh <dbl>
Answer

With across() we can apply multiple transformations to the same variables by providing a list of functions to the .fns argument. When the list elements are named as in the example above (the left-hand side of log = log), their names are automatically appended to the original names of the variables you are transforming. To change this behavior, we can pass an arbitrary character string to the .names argument using "{.fn}" and "{.col}" as placeholders for the function and variable names.

Hint: transmute() is an alternative to mutate() that discards all columns that have not been modified. The only reason we use it here is to make the output less cluttered.

penguins %>% 
  transmute(across(
    .cols = where(is.numeric),
    .fns = list(log = log, asinh = asinh),
    # specify new variable names with placeholders for function and original column name
    .names = "{.fn}_{.col}"
  ))
#> # A tibble: 344 × 10
#>    log_bill_length_mm asinh_bill_length_mm log_bill_depth_mm
#>                 <dbl>                <dbl>             <dbl>
#>  1               3.67                 4.36              2.93
#>  2               3.68                 4.37              2.86
#>  3               3.70                 4.39              2.89
#>  4              NA                   NA                NA   
#>  5               3.60                 4.30              2.96
#>  6               3.67                 4.36              3.03
#>  7               3.66                 4.35              2.88
#>  8               3.67                 4.36              2.98
#>  9               3.53                 4.22              2.90
#> 10               3.74                 4.43              3.01
#> # ℹ 334 more rows
#> # ℹ 7 more variables: asinh_bill_depth_mm <dbl>,
#> #   log_flipper_length_mm <dbl>,
#> #   asinh_flipper_length_mm <dbl>, log_body_mass_g <dbl>,
#> #   asinh_body_mass_g <dbl>, log_year <dbl>,
#> #   asinh_year <dbl>