Code
options(tidyverse.quiet = TRUE) # Silences messages
library(tidyverse)
for
, do.call
, vapply
August 5, 2020
I can spend a lot of time trying in futility to optimize custom functions and code. I’ll take the time to rewrite the same thing in two or more ways and run some basic benchmarks for performance differences. Most of the time I’m just exploring the simplest possible way to perform an action and what would be the most generalizable solution that could extend to other problems. I ran into an issue where I had to use a quick function to find the total amount of time a subject in a clinical trial was on a specific dose. However, subjects had doses adjusted through the study for optimization. Further, we also needed to count the dosing based on the lowest dose provided.
For example, how long was Subject A on a >= 10mg dose?
So, let’s explore three different ways to solve this problem and why all three are valid and just as efficient.
Loading our tidyverse
packages first. Because we will be grouping and summarising data in a data.frame, I really don’t feel like going through the trouble of using base
solutions.
Let’s create some random data to represent our subjects, our doses, and an arbitrary time metric. We’re also going to make this harder by using the difftime
.
We’re also going to grab some names using a function from the wakefield
package which you can use to create fake data, appropriately named. 1
1 I’m not too savvy with this package yet but would like to experiment more for building dummy data sets.
subj <- sort(sample(wakefield::name(500), 1e4, TRUE))
dose <- sample(seq.int(10, 80, 10), 1e4, TRUE)
time <- as.difftime(runif(1e4) * 100, units = "days")
df <- tibble(subj, dose, time)
df
#> # A tibble: 10,000 × 3
#> subj dose time
#> <chr> <dbl> <drtn>
#> 1 Aadhya 10 81.430584 days
#> 2 Aadhya 10 80.324186 days
#> 3 Aadhya 40 64.917350 days
#> 4 Aadhya 80 17.631323 days
#> 5 Aadhya 40 8.970711 days
#> 6 Aadhya 20 96.518079 days
#> 7 Aadhya 60 61.493537 days
#> 8 Aadhya 20 29.073014 days
#> 9 Aadhya 80 6.221622 days
#> 10 Aaditri 50 54.387006 days
#> # … with 9,990 more rows
Finding the total sum of time at each dose is easy:
df %>%
group_by(subj, dose) %>%
summarise(time = sum(time), .groups = "drop") %>%
pivot_wider(
names_from = "dose",
names_sort = TRUE,
values_from = "time"
)
#> # A tibble: 500 × 9
#> subj `10` `20` `30` `40` `50` `60` `70` `80`
#> <chr> <drtn> <drtn> <drtn> <drt> <drt> <drt> <drt> <drt>
#> 1 Aadhya 161.754770 days 125.59109 days … 73.… … 61.… … 23.…
#> 2 Aaditri 30.671230 days 172.20370 days … 224.… 131.… 193.… 67.… 93.…
#> 3 Aaleeya 174.678539 days 115.27412 days 159.963… 198.… 184.… 77.… 214.… 292.…
#> 4 Aaleyah 183.860531 days 88.26322 days … 9.… 261.… 259.… 214.… 100.…
#> 5 Aamia 70.429895 days 149.78110 days 122.847… 40.… 32.… 74.… 116.… 161.…
#> 6 Aarish 232.188970 days 234.42468 days 70.517… … 7.… 111.… 153.… 118.…
#> 7 Abdulah 8.890604 days 37.66665 days 130.246… … … 215.… 159.… 162.…
#> 8 Abhinav 221.372920 days 108.87628 days 114.363… 280.… 136.… 100.… 97.… 95.…
#> 9 Abhiram 73.235939 days 182.81309 days 295.836… 107.… 96.… 137.… 123.… 68.…
#> 10 Abyade 107.638230 days 596.18206 days 168.711… 146.… 61.… 34.… 281.… 149.…
#> # … with 490 more rows
Oh, it looks like we have some missing values (see the first line where we are missing doses of 30
, 50
, and 70
for Aadhya
.). tidyr::pivot_wider()
allows us to fill in the missing values but like many tidyverse
functions, we’ll have to be explicit with the type so as to not case any accidental issues.
difftime0 <- as.difftime(0, units = "days")
res <- df %>%
group_by(subj, dose) %>%
summarise(time = sum(time), .groups = "drop") %>%
pivot_wider(
names_from = "dose",
names_sort = TRUE,
values_from = "time",
values_fill = difftime0
)
res
#> # A tibble: 500 × 9
#> subj `10` `20` `30` `40` `50` `60` `70` `80`
#> <chr> <drtn> <drtn> <drtn> <drt> <drt> <drt> <drt> <drt>
#> 1 Aadhya 161.754770 days 125.59109 days 0.000… 73.… 0.… 61.… 0.… 23.…
#> 2 Aaditri 30.671230 days 172.20370 days 0.000… 224.… 131.… 193.… 67.… 93.…
#> 3 Aaleeya 174.678539 days 115.27412 days 159.963… 198.… 184.… 77.… 214.… 292.…
#> 4 Aaleyah 183.860531 days 88.26322 days 0.000… 9.… 261.… 259.… 214.… 100.…
#> 5 Aamia 70.429895 days 149.78110 days 122.847… 40.… 32.… 74.… 116.… 161.…
#> 6 Aarish 232.188970 days 234.42468 days 70.517… 0.… 7.… 111.… 153.… 118.…
#> 7 Abdulah 8.890604 days 37.66665 days 130.246… 0.… 0.… 215.… 159.… 162.…
#> 8 Abhinav 221.372920 days 108.87628 days 114.363… 280.… 136.… 100.… 97.… 95.…
#> 9 Abhiram 73.235939 days 182.81309 days 295.836… 107.… 96.… 137.… 123.… 68.…
#> 10 Abyade 107.638230 days 596.18206 days 168.711… 146.… 61.… 34.… 281.… 149.…
#> # … with 490 more rows
This is much more troublesome as we have to create single length difftime
vector.
Now, we need to find the total length of time a subject was at each dose or greater.. This I wasn’t able to do with any built in functions (although I could have missed it). I’m not claiming this is the best function or anything, it’s not, but I have 3 different solutions and thought it was enough to spend my evening writing a blog post.
Never use a for loop, except if you have to. We’ll start with
When having to play with dates before, I found that the way I could retain the date values and still use a function from the apply
family was to stick with the lapply()
and then use the do.call()
function to apply the c
ombine function over my list. lapply()
retains the original classes and using the do.call(c, ...)
method will turn my list into a vector without removing the structure of the output
Let’s see how this plays out. If we use other methods, we lose the difftime class, which is noticeable as we don’t get our message of Time differences in days
before our results.
x <- lapply(dose, function(xx) sum(time[dose >= xx])) %>% unlist() %>% head()
print_with_class_type(x)
#> class numeric
#> typeof double
#> [1] 501497.92 501497.92 312102.52 61765.11 312102.52 437797.16
x <- sapply(dose, function(xx) sum(time[dose >= xx])) %>% head()
print_with_class_type(x)
#> class numeric
#> typeof double
#> [1] 501497.92 501497.92 312102.52 61765.11 312102.52 437797.16
x <- foo2(dose, time) %>% head()
print_with_class_type(x)
#> class difftime
#> typeof double
#> Time differences in days[1] 501497.92 501497.92 312102.52 61765.11 312102.52 437797.16
Here’s another neat little trick that with a good use case. We’re going to subset our out
object (again, a copy of the y
intput) and assign over it the result of our vapply
. We’re also going to cheat and use the first position of y
as our FUN.VALUE
.
Let’s take look just like before. vapply()
will try to simplify the FUN.VALUE
but as long as we use a single vector from the original input we can safely assign it back into our mock subset without worrying about losing our classes.
x <- vapply(dose, function(xx) sum(time[dose >= xx]), time[1]) %>% head()
print_with_class_type(x)
#> class numeric
#> typeof double
#> [1] 501497.92 501497.92 312102.52 61765.11 312102.52 437797.16
x < -foo3(dose, time) %>% head()
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE
print_with_class_type(x)
#> class numeric
#> typeof double
#> [1] 501497.92 501497.92 312102.52 61765.11 312102.52 437797.16
Now, all three of these solutions produce the same results and are fairly equivalent in human legibility. This means, for me at least, that the function which runs the fastest would be the result I keep. We’ll employ the bench
package and eponymous name for our consideration.
Of course, for this we’ll be running on the vectors first. We’ll also make certain that all of our outputs are the same with check = TRUE
(default) to make sure we didn’t miss anything either.
bench::mark(
`1` = foo1(dose, time),
`2` = foo2(dose, time),
`3` = foo3(dose, time),
check = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 1 3.01s 3.01s 0.332 2.96GB 35.2
#> 2 2 1.84s 1.84s 0.545 2.22GB 30.0
#> 3 3 1.91s 1.91s 0.524 2.21GB 24.1
And, well, they all run about the same. That kind of just leaves us with the sinking feeling that all of this was futile and that for
loops really aren’t that bad. In fact, the for
loop solution may be the easiest to read and doesn’t use any tricks that someone reviewing your code may not understand at first.
We know we have some missing values in our data, so we’re going to use the the tidyr::complete()
function to help with that.
res
#> # A tibble: 500 × 9
#> subj `10` `20` `30` `40` `50` `60` `70` `80`
#> <chr> <drtn> <drtn> <drtn> <drt> <drt> <drt> <drt> <drt>
#> 1 Aadhya 161.754770 days 125.59109 days 0.000… 73.… 0.… 61.… 0.… 23.…
#> 2 Aaditri 30.671230 days 172.20370 days 0.000… 224.… 131.… 193.… 67.… 93.…
#> 3 Aaleeya 174.678539 days 115.27412 days 159.963… 198.… 184.… 77.… 214.… 292.…
#> 4 Aaleyah 183.860531 days 88.26322 days 0.000… 9.… 261.… 259.… 214.… 100.…
#> 5 Aamia 70.429895 days 149.78110 days 122.847… 40.… 32.… 74.… 116.… 161.…
#> 6 Aarish 232.188970 days 234.42468 days 70.517… 0.… 7.… 111.… 153.… 118.…
#> 7 Abdulah 8.890604 days 37.66665 days 130.246… 0.… 0.… 215.… 159.… 162.…
#> 8 Abhinav 221.372920 days 108.87628 days 114.363… 280.… 136.… 100.… 97.… 95.…
#> 9 Abhiram 73.235939 days 182.81309 days 295.836… 107.… 96.… 137.… 123.… 68.…
#> 10 Abyade 107.638230 days 596.18206 days 168.711… 146.… 61.… 34.… 281.… 149.…
#> # … with 490 more rows
df %>%
complete(subj, dose, fill = list(time = difftime0)) %>%
group_by(subj, dose) %>%
summarise(time = sum(time), .groups = "drop_last") %>%
mutate(time = foo1(dose, time)) %>%
pivot_wider(
names_from = "dose",
names_sort = TRUE,
values_from = "time"
)
#> # A tibble: 500 × 9
#> # Groups: subj [500]
#> subj `10` `20` `30` `40` `50` `60` `70` `80`
#> <chr> <drtn> <drtn> <drtn> <drt> <drt> <drt> <drt> <drt>
#> 1 Aadhya 446.5804 days 284.8256 days 159.234… 159.… 85.… 85.… 23.… 23.…
#> 2 Aaditri 913.9830 days 883.3118 days 711.108… 711.… 486.… 354.… 161.… 93.…
#> 3 Aaleeya 1417.4559 days 1242.7773 days 1127.503… 967.… 769.… 584.… 507.… 292.…
#> 4 Aaleyah 1117.0842 days 933.2237 days 844.960… 844.… 835.… 573.… 314.… 100.…
#> 5 Aamia 769.4857 days 699.0558 days 549.274… 426.… 385.… 352.… 278.… 161.…
#> 6 Aarish 927.1830 days 694.9940 days 460.569… 390.… 390.… 382.… 271.… 118.…
#> 7 Abdulah 714.1323 days 705.2417 days 667.575… 537.… 537.… 537.… 321.… 162.…
#> 8 Abhinav 1154.8195 days 933.4466 days 824.570… 710.… 429.… 292.… 192.… 95.…
#> 9 Abhiram 1086.0283 days 1012.7924 days 829.979… 534.… 426.… 329.… 192.… 68.…
#> 10 Abyade 1545.7654 days 1438.1272 days 841.945… 673.… 526.… 465.… 431.… 149.…
#> # … with 490 more rows
And there you go. Three solutions, all the same.
Sometimes it’s useful to try to optimize code. Other times it’s just results in a blog post.
As long as your code is easy to read and not apparently slow, it’s probably fine.
---
title: Solving one task with three equivalent solutions
subtitle: "`for`, `do.call`, `vapply`"
date: "2020-08-05"
categories: ["R", "{bench}"]
---
I can spend a lot of time trying in futility to optimize custom functions and code.
I'll take the time to rewrite the same thing in two or more ways and run some basic benchmarks for performance differences.
Most of the time I'm just exploring the simplest possible way to perform an action and what would be the most generalizable solution that could extend to other problems.
I ran into an issue where I had to use a quick function to find the total amount of time a subject in a clinical trial was on a specific dose.
However, subjects had doses adjusted through the study for optimization.
Further, we also needed to count the dosing based on the lowest dose provided.
> For example, how long was Subject A on a >= 10mg dose?
So, let's explore three different ways to solve this problem and why all three are valid and just as efficient.
## Set up
Loading our `tidyverse` packages first.
Because we will be grouping and summarising data in a data.frame, I really don't feel like going through the trouble of using `base` solutions.
```{r setup}
#| include: false
set.seed(42)
print_with_class_type <- function(x) {
co <- capture.output(x)
cl <- class(x)
to <- typeof(x)
cat("class ", cl, "\n",
"typeof ", to, "\n",
co, sep = "")
invisible(x)
}
```
```{r}
options(tidyverse.quiet = TRUE) # Silences messages
library(tidyverse)
```
Let's create some random data to represent our subjects, our doses, and an arbitrary time metric.
We're also going to make this harder by using the `difftime`.
We're also going to grab some names using a function from the [`wakefield` package](https://github.com/trinker/wakefield) which you can use to create fake data, appropriately named. ^[I'm not too savvy with this package yet but would like to experiment more for building dummy data sets.]
```{r generating the data}
subj <- sort(sample(wakefield::name(500), 1e4, TRUE))
dose <- sample(seq.int(10, 80, 10), 1e4, TRUE)
time <- as.difftime(runif(1e4) * 100, units = "days")
df <- tibble(subj, dose, time)
df
```
Finding the total sum of time at each dose is easy:
```{r time for each dose}
df %>%
group_by(subj, dose) %>%
summarise(time = sum(time), .groups = "drop") %>%
pivot_wider(
names_from = "dose",
names_sort = TRUE,
values_from = "time"
)
```
Oh, it looks like we have some missing values (see the first line where we are missing doses of `30`, `50`, and `70` for `Aadhya`.).
`tidyr::pivot_wider()` allows us to fill in the missing values but like many `tidyverse` functions, we'll have to be explicit with the type so as to not case any accidental issues.
```{r time for each dose fill}
difftime0 <- as.difftime(0, units = "days")
res <- df %>%
group_by(subj, dose) %>%
summarise(time = sum(time), .groups = "drop") %>%
pivot_wider(
names_from = "dose",
names_sort = TRUE,
values_from = "time",
values_fill = difftime0
)
res
```
This is much more troublesome as we have to create single length `difftime` vector.
Now, we need to find the total length of time a subject was at each dose or greater..
This I wasn't able to do with any built in functions (although I could have missed it).
I'm not claiming this is the best function or anything, it's not, but I have 3 different solutions and thought it was enough to spend my evening writing a blog post.
### Solution 1: The for Loop
Never use a for loop, except if you have to.
We'll start with
```{r for loop}
foo1 <- function(x, y) {
out <- y
for (i in seq_along(y)) {
out[i] <- sum(y[x >= x[i]])
}
out
}
```
### Solution 2: Combining lapply
When having to play with dates before, I found that the way I could retain the date values and still use a function from the `apply` family was to stick with the `lapply()` and then use the `do.call()` function to apply the `c`ombine function over my list.
`lapply()` retains the original classes and using the `do.call(c, ...)` method will turn my list into a vector without removing the structure of the output
```{r lapply}
foo2 <- function(x, y) {
out <- lapply(x, function(xx) sum(y[x >= xx]))
do.call(c, out)
}
```
Let's see how this plays out.
If we use other methods, we lose the difftime class, which is noticeable as we don't get our message of `Time differences in days` before our results.
```{r}
x <- lapply(dose, function(xx) sum(time[dose >= xx])) %>% unlist() %>% head()
print_with_class_type(x)
x <- sapply(dose, function(xx) sum(time[dose >= xx])) %>% head()
print_with_class_type(x)
x <- foo2(dose, time) %>% head()
print_with_class_type(x)
```
### Solution 3: vapply with subset assigning
Here's another neat little trick that with a good use case.
We're going to subset our `out` object (again, a copy of the `y` intput) and assign over it the result of our `vapply`.
We're also going to _cheat_ and use the first position of `y` as our `FUN.VALUE`.
```{r vapply}
foo3 <- function(x, y) {
out <- y
out[] <- vapply(x, function(xx) sum(y[x >= xx]), y[1], USE.NAMES = FALSE)
out
}
```
Let's take look just like before.
`vapply()` will try to simplify the `FUN.VALUE` but as long as we use a single vector from the original input we can safely assign it back into our mock subset without worrying about losing our classes.
```{r vapply example}
x <- vapply(dose, function(xx) sum(time[dose >= xx]), time[1]) %>% head()
print_with_class_type(x)
x < -foo3(dose, time) %>% head()
print_with_class_type(x)
```
## Benchmarks
Now, all three of these solutions produce the same results and are fairly equivalent in human legibility.
This means, for me at least, that the function which runs the fastest would be the result I keep.
We'll employ the `bench` package and eponymous name for our consideration.
Of course, for this we'll be running on the vectors first.
We'll also make certain that all of our outputs are the same with `check = TRUE` (default) to make sure we didn't miss anything either.
```{r benchmarking}
bench::mark(
`1` = foo1(dose, time),
`2` = foo2(dose, time),
`3` = foo3(dose, time),
check = TRUE
)
```
And, well, they all run about the same.
That kind of just leaves us with the sinking feeling that all of this was futile and that `for` loops really aren't that bad.
In fact, the `for` loop solution may be the easiest to read and doesn't use any tricks that someone reviewing your code may not understand at first.
We know we have some missing values in our data, so we're going to use the the `tidyr::complete()` function to help with that.
```{r final output}
res
df %>%
complete(subj, dose, fill = list(time = difftime0)) %>%
group_by(subj, dose) %>%
summarise(time = sum(time), .groups = "drop_last") %>%
mutate(time = foo1(dose, time)) %>%
pivot_wider(
names_from = "dose",
names_sort = TRUE,
values_from = "time"
)
```
And there you go.
Three solutions, all the same.
Sometimes it's useful to try to optimize code.
Other times it's just results in a blog post.
As long as your code is easy to read and not apparently slow, it's probably fine.