# 5 Control Flow and Functions

Control flow determines in what order code is executed. There are two basic concepts of control flow: choices (e.g., if-statements) and loops. These concepts are by no means specific to R – you will find similar implementations in many programming languages. What *is* specific to R, however, is their syntax and their significance in comparison to other features, such as vectorization.

Because the same concept is expressed with slightly different syntax in each programming language, code snippets come in handy. In R Studio, go to `Options > Code > Edit Snippets`

to get an overview of available snippets. For example, to define a new function, just type “fun” and hit tab to autocomplete using the code snippet.

```
# load packages (necessary for applications, not control flow itself)
library(tidyverse)
library(palmerpenguins)
```

## 5.1 Functions

Before we start looking at choices and loops, we will learn how to write a function. Strictly speaking, the concept of functions is not part of control flow, but choices and loops often make most sense *within* them. The code snippet already contains boilerplate text that represents the three components of a function definition: the function name, one or more function arguments and the actual code that modifies the input. Curly braces are only necessary if the function body exceeds one line.

```
name <- function(variables) {
}
```

Let us see an example of a function. In economic analysis, some researchers like adding a small number before taking the logarithm to avoid ending up with missing or undefined values in the case of zero. Although it is debatable whether this is good practice or not, we will implement it as our first function.

Start by thinking of our function as a black box. What are the inputs, what are the outputs? Then move on and flesh out the function’s internals. Our “log regularization”, as it is sometimes called, takes a numeric input and converts it into numeric output. By convention, we might call this numeric argument `x`

but this choice is arbitrary. Now what happens to this argument once it enters the function? In very much detail, our algorithm might look like this: Take the input `x`

, add 0.01 and store it in a new variable `y`

. Then apply the logarithm and store the result in a new variable `z`

. Finally return this output.

```
# verbose function definition
reg_log <- function(x) {
y <- x + 0.01
z <- log(y)
return(z)
}
reg_log(0)
#> [1] -4.60517
```

To make our code more concise, we can avoid creating new variables by nesting functions. In our example, that means applying the logarithm directly to the sum of the two numbers: `log(x + 0.01)`

. Note that creating additional variables within a function is not necessarily a problem though: `y`

and `z`

in the example above only exist within the black box that is your function and will not clutter up the global environment.

It is also recommended not to use `return()`

unless you have a very complex function with many if-else statements in which case you might want to make an early return explicit. If you do not specify it, the function will return the output of the last line of code.

```
# shortened function definition
reg_log2 <- function(x) log(x + 0.01)
reg_log2(0)
#> [1] -4.60517
```

To add more function arguments, simply list their names in the parentheses, separated by a comma. Here, we are making our function more flexible by allowing the user to add an arbitrary number which we call `offset`

. We can specify 0.01 as the default value when no `offset`

argument is provided with a name-value pair:

```
# additional argument with default value
reg_log3 <- function(x, offset = 0.01) log(x + offset)
reg_log3(0)
#> [1] -4.60517
reg_log3(0, offset = 2)
#> [1] 0.6931472
```

There is much more to say about functions but a complete discussion is outside the scope of this course. One final word of caution though: our function design is not fool-proof in any way yet. Nothing prevents the user from applying it to a character string, for example: `reg_log3("red")`

. If you plan to share your function with other users (or just your future self), make sure to create proper documentation, checks and error messages. The chapter on functions in *R for Data Science* provides a good starting point.

## 5.2 Choices

An if-statement allows us to execute certain code if a logical condition is fulfilled; `else`

specifies the case when it is not fulfilled. In the example below, a variable `person`

is assigned the value `"voter"`

or `"non-voter"`

, depending if `age`

is greater or equal to 18.

```
age <- 17
if (age >= 18) person <- "voter" else person <- "non-voter"
# or shorter:
person <- if (age >= 18) "voter" else "non-voter"
person
#> [1] "non-voter"
```

Again, longer statements go within curly braces. By nesting multiple if-statements, your case differentiation can become arbitrarily complex. Often, if-statements are used within functions to enable different behavior depending on some condition. In our example from above, we could multiply the offset to make it positive in case the user supplied a negative number (which could of course be achieved more easily by taking the absolute value):

```
reg_log4 <- function(x, offset) {
# multiply with -1 if offset is negative
if (offset < 0) offset <- offset * (-1)
log(x + offset)
}
reg_log4(0, offset = 0.01)
#> [1] -4.60517
reg_log4(0, offset = -0.01)
#> [1] -4.60517
```

Note that if-statements as used above are not vectorized, i.e., the logical condition may only have length one. Within a tibble, `if_else()`

and `case_when()`

are useful functions to modify or create entire columns based on some condition.

## 5.3 Exercises I

#### 5.3.0.1 Question 1

Sometimes it is useful to scale values such that they have a mean of zero and a standard deviation of one. When you apply R’s function for scaling to a vector of numbers, you will end up with a matrix (with additional attributes that we can ignore for now). Write a function `my_scale()`

that modifies the output of `scale()`

internally to yield a vector instead:

```
# original scale function
scale(1:10)
#> [,1]
#> [1,] -1.4863011
#> [2,] -1.1560120
#> [3,] -0.8257228
#> [4,] -0.4954337
#> [5,] -0.1651446
#> [6,] 0.1651446
#> [7,] 0.4954337
#> [8,] 0.8257228
#> [9,] 1.1560120
#> [10,] 1.4863011
#> attr(,"scaled:center")
#> [1] 5.5
#> attr(,"scaled:scale")
#> [1] 3.02765
# new scale function
my_scale(1:10)
#> [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446
#> [6] 0.1651446 0.4954337 0.8257228 1.1560120 1.4863011
```

## Answer

First, let us make sure we understand the code from the question. `1:10`

is a vector that contains the numbers from 1 to 10 as elements. To make this property more transparent, we could also write

```
x <- 1:10
x
#> [1] 1 2 3 4 5 6 7 8 9 10
scale(x)
#> [,1]
#> [1,] -1.4863011
#> [2,] -1.1560120
#> [3,] -0.8257228
#> [4,] -0.4954337
#> [5,] -0.1651446
#> [6,] 0.1651446
#> [7,] 0.4954337
#> [8,] 0.8257228
#> [9,] 1.1560120
#> [10,] 1.4863011
#> attr(,"scaled:center")
#> [1] 5.5
#> attr(,"scaled:scale")
#> [1] 3.02765
```

The output of `scale(x)`

is a matrix. This fact is obscured by its “degenerate” nature with only one column. We can check whether it is a matrix with `is.matrix(scale(x))`

but a seasoned R user will recognize immediately from the console output that we are facing a 2-dimensional object with both row and column indices.

To extract a single element from a 2-dimensional object like a matrix, we need to specify two coordinates: `m[3,5]`

would extract the element in the third row and the fifth column of a matrix `m`

. If we omit the row number (like so: `m[,1]`

), we would end up extracting all rows of a given column. This is exactly what we want: from the matrix produced by `scale(x)`

, we only want to extract the first column – its only column.

Within the function, we can create an intermediate matrix object `m`

or index the output from `scale(x)`

right away:

#### 5.3.0.2 Question 2

Re-write the function `scale()`

from scratch. It should take a numeric vector, subtract its mean and divide the result by its standard deviation. The output should be a vector, not a matrix. *Hint:* You may use `mean()`

and `sd()`

to compute the mean and the standard deviation.

#### 5.3.0.3 Question 3

Some researchers favor the inverse hyperbolic sine function `asinh()`

over the logarithm because it does not go towards negative infinity at zero. Explain what arguments the function below accepts and use it to apply an inverse hyperbolic sine transformation to a value of zero.

```
reglog <- function(x, fun = log) {
fun(x)
}
```

## Answer

The function above has a second argument that allows the user to pass a function (such as the logarithm or inverse hyperbolic sine) and have it applied to the other input, `x`

. The name-value pair `fun = log`

sets logarithm as the default of `fun`

. To use `asinh`

(the inverse hyperbolic sine) instead of `log`

, simply overwrite the default value with `fun = asinh`

when you run the function. Why not write `asinh()`

instead? Adding parentheses would trigger a function call – however, what we rather want to do is to tell R the name of the function object. This ability to pass functions to other functions is what makes R a so-called functional programming language.

```
# apply inverse hyperbolic sine to zero
reglog(0, fun = asinh)
#> [1] 0
```

#### 5.3.0.4 Question 4

Try to figure out what the `switch()`

function does in the example below and write a function that does the same while only using `if`

and `else`

statements.

```
count_legs <- function(x) {
switch(x,
dog = 4,
chicken = 2,
plant = 0
)
}
count_legs("dog")
#> [1] 4
count_legs("chicken")
#> [1] 2
count_legs("plant")
#> [1] 0
```

## Answer

The `switch()`

function allows you to check against a sequence of logical condition, each of which can also be represented by an if-statement. It works best with character strings as input: in the example above, it spits out the value we assigned to each of the character strings `"dog"`

, `"chicken"`

and `"plant"`

.

*Hint:* The code snippets “ei”, “if” and “el” will make your life easier when dealing with nested if-conditions.

## 5.4 Loops

Loops allow you to run code repeatedly. In most programming languages, you will encounter two types of loops: `for`

and `while`

loops. `while`

loops will run as long as some logical condition is true whereas `for`

loops cycle through the elements of a vector. Therefore, be careful to run `while`

loops if there is a chance that the logical condition never turns false. The following lines of code (where the logical condition is simply hard-coded as `TRUE`

) will print `"hello"`

to the console without ever stopping:

```
# do not run
while (TRUE) {
print("hello")
}
```

For our purposes, `for`

loops will not only be safer but also more useful. Type “for” and hit tab to activate the corresponding code snippet. As you can see, we can specify the vector over whose elements we want to loop and tell R what name we will use to refer to each element when we execute the loop.

```
for (variable in vector) {
}
```

For example, we can write a `for`

loop to concatenate (and display) a character string to each element of the vector containing the names of our family members:

```
family <- c("Hans", "Erwin", "Ingrid")
for (name in family) {
print(str_c("It's your birthday, ", name, "!"))
}
#> [1] "It's your birthday, Hans!"
#> [1] "It's your birthday, Erwin!"
#> [1] "It's your birthday, Ingrid!"
```

Instead of looping over the elements from the vector themselves, we can loop over their position. This is necessary if you want to save the output in a new vector. To make this operation efficient, make sure the (pre-defined) output vector has the correct length. `seq_along(family)`

creates a vector of indices matching the length of the vector `family`

:

```
output <- vector("double", length = length(family))
for(i in seq_along(family)) {
output[i] <- str_length(family[i])
}
output
#> [1] 4 5 6
```

## 5.5 Vectorization

Loops are much less important in R than in other programming languages because most functions are vectorized in R. If you find yourself using a lot of loops, this is usually an indicator for code that is flawed or at least not efficient. For example, to calculate the logarithm of each number from 1 to 10 or to append it to a character string, a single call to the function is sufficient:

```
log(1:10)
#> [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
#> [6] 1.7917595 1.9459101 2.0794415 2.1972246 2.3025851
str_c("My age is ", 1:10)
#> [1] "My age is 1" "My age is 2" "My age is 3"
#> [4] "My age is 4" "My age is 5" "My age is 6"
#> [7] "My age is 7" "My age is 8" "My age is 9"
#> [10] "My age is 10"
```

Even the function we built ourselves works on a vector because all internal operations are vectorized:

```
reg_log(1:10)
#> [1] 0.009950331 0.698134722 1.101940079 1.388791241
#> [5] 1.611435915 1.793424749 1.947337701 2.080690761
#> [9] 2.198335072 2.303584593
```

Any function that is not vectorized yet can be applied to multiple elements at once using the function `map()`

(or `apply()`

, in case you are not a friend of the tidyverse). For example, assume we would like to calculate the means of multiple numeric vectors that we have collected in a list called `x`

. We can apply `mean()`

to each element of the list with `map(x, mean)`

:

```
# define list of numeric vectors
x <- list(1:3, c(4, NA, 6), 5:10)
x
#> [[1]]
#> [1] 1 2 3
#>
#> [[2]]
#> [1] 4 NA 6
#>
#> [[3]]
#> [1] 5 6 7 8 9 10
# apply mean() to each vector
map(x, mean)
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] NA
#>
#> [[3]]
#> [1] 7.5
```

As you may have noticed, one of the numeric vectors contains a missing value and causes the mean to be undefined as well. To ignore missing values in the calculation of the mean, we can specify `na.rm = TRUE`

. But since we only provide the function name (`mean`

) in our call to `map(x, mean)`

, there is no straightforward way to add additional arguments. There are several workarounds for this. The first one is to create a new function with the desired behavior and call it by its name in `map()`

:

```
my_mean <- function(x) {
mean(x, na.rm = TRUE)
}
map(x, my_mean)
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 5
#>
#> [[3]]
#> [1] 7.5
```

In case we do not intend to re-use the function ever again, we can avoid giving it a name and define it as an “anonymous function” on the fly. The lambda notation (introduced in R version 4.1.0) provides an even shorter way to define the anonymous function by replacing “function” with a backslash.

```
# anonymous function
map(x, function(x) mean(x, na.rm = TRUE))
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 5
#>
#> [[3]]
#> [1] 7.5
# using lambda notation
map(x, \(x) mean(x, na.rm = TRUE))
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 5
#>
#> [[3]]
#> [1] 7.5
```

The `map()`

(and `apply()`

) functions always return list objects but their close relatives (called `map_*()`

) return a data type of your choice. For example, `map_dbl(x, my_mean)`

will return a vector of doubles (i.e., numeric values) instead of a list. `map2()`

and `pmap()`

allow you to iterate over multiple lists at once.

What if the elements you would like to iterate over are not elements of a list but rather columns of a dataframe? The `map()`

family of functions works in this case, too, because a dataframe is simply a list of same-length vectors under the hood. Yet there is another function for exactly this purpose that works well within other tidyverse verbs like `mutate()`

and `summarize()`

: `across()`

. To calculate the mean (ignoring missing values) of the variables `bill_length_mm`

and `bill_depth_mm`

, type:

```
penguins %>%
summarize(across(
.cols = c(bill_length_mm, bill_depth_mm),
.fns = my_mean
))
#> # A tibble: 1 × 2
#> bill_length_mm bill_depth_mm
#> <dbl> <dbl>
#> 1 43.9 17.2
```

Because `across()`

is aware of the dataframe it is being called on, we can use tidy-select syntax to make our code more readable. The following code chunk is equivalent to the one above but selects the variables to summarize by their name:

```
penguins %>%
summarize(across(starts_with("bill"), my_mean))
#> # A tibble: 1 × 2
#> bill_length_mm bill_depth_mm
#> <dbl> <dbl>
#> 1 43.9 17.2
```

Similar to the `map()`

functions, the `.fns`

argument of `across()`

accepts anonymous functions and lambda notation.

## 5.6 Exercises II

#### 5.6.0.1 Question 1

Write a `for`

loop that iterates through the numbers from 1 to 20 and prints only those that are prime to the console. You may use the following vector of primes in your code:

```
# define vector of primes
primes <- c(2, 3, 5, 7, 11, 13, 17, 19)
# Example: check if 4 matches an element of the vector
4 %in% primes
#> [1] FALSE
```

## Answer

Note that you could check whether each number between 1 and 20 matches any element from `primes`

element by element like so: `i == primes[1] | i == primes[2] | ...`

. The `%in%`

notation simply makes your life easier here.

#### 5.6.0.2 Question 2

Use `length()`

instead of `seq_along()`

to produce the same output (a vector of indices).

## Answer

```
1:length(family)
#> [1] 1 2 3
```

#### 5.6.0.3 Question 3

Use `map()`

to transform a value of zero with our `reg_log3()`

function three times, using the three different offset values stored in the vector `y`

. Is it necessary to use `map()`

here?

```
# define function
reg_log3 <- function(x, offset = 0.01) log(x + offset)
# define offset values
y <- c(0.1, 0.01, 0.001)
```

## Answer

The list or vector to which we apply `map()`

can be used to specify any argument of the function that we want to execute repeatedly. In our example, the values from `y`

are supposed to determine the offset of `reg_log3()`

so we define an anonymous function that passes the value from `y`

into the `offset`

argument of `reg_log3()`

:

```
map(y, \(x) reg_log3(0, x))
#> [[1]]
#> [1] -2.302585
#>
#> [[2]]
#> [1] -4.60517
#>
#> [[3]]
#> [1] -6.907755
```

Note that the addition is vectorized in R so we could have achieved the same without `map()`

:

```
reg_log3(0, y)
#> [1] -2.302585 -4.605170 -6.907755
```

#### 5.6.0.4 Question 4

In this task we will simulate normally distributed random variables with different means using `rnorm()`

. Use `map2()`

to simulate one with mean 1, two with mean 10 and three with mean 100.

*Hint:* `map2()`

allows you to iterate over two vectors at the same time. The output should look like this:

```
#> [[1]]
#> [1] -0.607242
#>
#> [[2]]
#> [1] 10.98899 10.92222
#>
#> [[3]]
#> [1] 101.00195 98.71317 101.06087
```

## Answer

Whenever you intend to perform an operation repeatedly with `map()`

, try working with a single element first to build intuition. To simulate 5 normally distributed random variables with mean zero using `rnorm()`

, execute the following code:

```
rnorm(5, mean = 0)
#> [1] 1.1772430 -1.0874512 0.7989446 0.3385811 1.1173001
```

Now to simulate a different number of variables with different means each, we have to iterate over two vectors at the same time. This is the job of `map2()`

:

```
# solution using map2()
map2(1:3, c(1, 10, 100), \(x, y) rnorm(x, mean = y))
#> [[1]]
#> [1] -0.0309682
#>
#> [[2]]
#> [1] 10.230438 9.194334
#>
#> [[3]]
#> [1] 100.10191 98.98828 100.48003
# equivalent to
list(
rnorm(1, mean = 1),
rnorm(2, mean = 10),
rnorm(3, mean = 100)
)
#> [[1]]
#> [1] -0.5115922
#>
#> [[2]]
#> [1] 9.848258 8.878448
#>
#> [[3]]
#> [1] 100.72125 100.81430 99.74266
```

To make it more intuitive, think about what happens in the first iteration: `map2()`

will pass the first element of the first vector (the value 1) to the `rnorm()`

argument that we have indicated with `x`

. The first element of the second vector (1 as well) enters as the argument indicated with `y`

, so the first element of the output list will be created by `rnorm(1, mean = 1)`

.

#### 5.6.0.5 Question 5

Try to figure out what the following code does and adjust the code such that the transformation (“log” and “asinh”) are added as prefixes to the variable names, not suffixes.

```
penguins %>%
transmute(across(
.cols = where(is.numeric),
.fns = list(log = log, asinh = asinh)
))
#> # A tibble: 344 × 10
#> bill_length_mm_log bill_length_mm_asinh bill_depth_mm_log
#> <dbl> <dbl> <dbl>
#> 1 3.67 4.36 2.93
#> 2 3.68 4.37 2.86
#> 3 3.70 4.39 2.89
#> 4 NA NA NA
#> 5 3.60 4.30 2.96
#> 6 3.67 4.36 3.03
#> 7 3.66 4.35 2.88
#> 8 3.67 4.36 2.98
#> 9 3.53 4.22 2.90
#> 10 3.74 4.43 3.01
#> # ℹ 334 more rows
#> # ℹ 7 more variables: bill_depth_mm_asinh <dbl>,
#> # flipper_length_mm_log <dbl>,
#> # flipper_length_mm_asinh <dbl>, body_mass_g_log <dbl>,
#> # body_mass_g_asinh <dbl>, year_log <dbl>,
#> # year_asinh <dbl>
```

## Answer

With `across()`

we can apply multiple transformations to the same variables by providing a list of functions to the `.fns`

argument. When the list elements are named as in the example above (the left-hand side of `log = log`

), their names are automatically appended to the original names of the variables you are transforming. To change this behavior, we can pass an arbitrary character string to the `.names`

argument using `"{.fn}"`

and `"{.col}"`

as placeholders for the function and variable names.

*Hint:* `transmute()`

is an alternative to `mutate()`

that discards all columns that have not been modified. The only reason we use it here is to make the output less cluttered.

```
penguins %>%
transmute(across(
.cols = where(is.numeric),
.fns = list(log = log, asinh = asinh),
# specify new variable names with placeholders for function and original column name
.names = "{.fn}_{.col}"
))
#> # A tibble: 344 × 10
#> log_bill_length_mm asinh_bill_length_mm log_bill_depth_mm
#> <dbl> <dbl> <dbl>
#> 1 3.67 4.36 2.93
#> 2 3.68 4.37 2.86
#> 3 3.70 4.39 2.89
#> 4 NA NA NA
#> 5 3.60 4.30 2.96
#> 6 3.67 4.36 3.03
#> 7 3.66 4.35 2.88
#> 8 3.67 4.36 2.98
#> 9 3.53 4.22 2.90
#> 10 3.74 4.43 3.01
#> # ℹ 334 more rows
#> # ℹ 7 more variables: asinh_bill_depth_mm <dbl>,
#> # log_flipper_length_mm <dbl>,
#> # asinh_flipper_length_mm <dbl>, log_body_mass_g <dbl>,
#> # asinh_body_mass_g <dbl>, log_year <dbl>,
#> # asinh_year <dbl>
```