Solving one task with three equivalent solutions

for, do.call, vapply

R
{bench}
Published

August 5, 2020

I can spend a lot of time trying in futility to optimize custom functions and code. I’ll take the time to rewrite the same thing in two or more ways and run some basic benchmarks for performance differences. Most of the time I’m just exploring the simplest possible way to perform an action and what would be the most generalizable solution that could extend to other problems. I ran into an issue where I had to use a quick function to find the total amount of time a subject in a clinical trial was on a specific dose. However, subjects had doses adjusted through the study for optimization. Further, we also needed to count the dosing based on the lowest dose provided.

For example, how long was Subject A on a >= 10mg dose?

So, let’s explore three different ways to solve this problem and why all three are valid and just as efficient.

Set up

Loading our tidyverse packages first. Because we will be grouping and summarising data in a data.frame, I really don’t feel like going through the trouble of using base solutions.

Code
options(tidyverse.quiet = TRUE) # Silences messages
library(tidyverse)

Let’s create some random data to represent our subjects, our doses, and an arbitrary time metric. We’re also going to make this harder by using the difftime.

We’re also going to grab some names using a function from the wakefield package which you can use to create fake data, appropriately named. 1

1 I’m not too savvy with this package yet but would like to experiment more for building dummy data sets.

Code
subj <- sort(sample(wakefield::name(500), 1e4, TRUE))
dose <- sample(seq.int(10, 80, 10), 1e4, TRUE)
time <- as.difftime(runif(1e4) * 100, units = "days")

df <- tibble(subj, dose, time)
df
#> # A tibble: 10,000 × 3
#>    subj     dose time          
#>    <chr>   <dbl> <drtn>        
#>  1 Aadhya     10 81.430584 days
#>  2 Aadhya     10 80.324186 days
#>  3 Aadhya     40 64.917350 days
#>  4 Aadhya     80 17.631323 days
#>  5 Aadhya     40  8.970711 days
#>  6 Aadhya     20 96.518079 days
#>  7 Aadhya     60 61.493537 days
#>  8 Aadhya     20 29.073014 days
#>  9 Aadhya     80  6.221622 days
#> 10 Aaditri    50 54.387006 days
#> # … with 9,990 more rows

Finding the total sum of time at each dose is easy:

Code
df %>% 
  group_by(subj, dose) %>% 
  summarise(time = sum(time), .groups = "drop") %>% 
  pivot_wider(
    names_from = "dose",
    names_sort = TRUE,
    values_from = "time"
    )
#> # A tibble: 500 × 9
#>    subj    `10`            `20`           `30`     `40`  `50`  `60`  `70`  `80` 
#>    <chr>   <drtn>          <drtn>         <drtn>   <drt> <drt> <drt> <drt> <drt>
#>  1 Aadhya  161.754770 days 125.59109 days        …  73.…     …  61.…     …  23.…
#>  2 Aaditri  30.671230 days 172.20370 days        … 224.… 131.… 193.…  67.…  93.…
#>  3 Aaleeya 174.678539 days 115.27412 days 159.963… 198.… 184.…  77.… 214.… 292.…
#>  4 Aaleyah 183.860531 days  88.26322 days        …   9.… 261.… 259.… 214.… 100.…
#>  5 Aamia    70.429895 days 149.78110 days 122.847…  40.…  32.…  74.… 116.… 161.…
#>  6 Aarish  232.188970 days 234.42468 days  70.517…     …   7.… 111.… 153.… 118.…
#>  7 Abdulah   8.890604 days  37.66665 days 130.246…     …     … 215.… 159.… 162.…
#>  8 Abhinav 221.372920 days 108.87628 days 114.363… 280.… 136.… 100.…  97.…  95.…
#>  9 Abhiram  73.235939 days 182.81309 days 295.836… 107.…  96.… 137.… 123.…  68.…
#> 10 Abyade  107.638230 days 596.18206 days 168.711… 146.…  61.…  34.… 281.… 149.…
#> # … with 490 more rows

Oh, it looks like we have some missing values (see the first line where we are missing doses of 30, 50, and 70 for Aadhya.). tidyr::pivot_wider() allows us to fill in the missing values but like many tidyverse functions, we’ll have to be explicit with the type so as to not case any accidental issues.

Code
difftime0 <- as.difftime(0, units = "days")

res <- df %>% 
  group_by(subj, dose) %>% 
  summarise(time = sum(time), .groups = "drop") %>% 
  pivot_wider(
    names_from = "dose",
    names_sort = TRUE,
    values_from = "time",
    values_fill = difftime0
  )
res
#> # A tibble: 500 × 9
#>    subj    `10`            `20`           `30`     `40`  `50`  `60`  `70`  `80` 
#>    <chr>   <drtn>          <drtn>         <drtn>   <drt> <drt> <drt> <drt> <drt>
#>  1 Aadhya  161.754770 days 125.59109 days   0.000…  73.…   0.…  61.…   0.…  23.…
#>  2 Aaditri  30.671230 days 172.20370 days   0.000… 224.… 131.… 193.…  67.…  93.…
#>  3 Aaleeya 174.678539 days 115.27412 days 159.963… 198.… 184.…  77.… 214.… 292.…
#>  4 Aaleyah 183.860531 days  88.26322 days   0.000…   9.… 261.… 259.… 214.… 100.…
#>  5 Aamia    70.429895 days 149.78110 days 122.847…  40.…  32.…  74.… 116.… 161.…
#>  6 Aarish  232.188970 days 234.42468 days  70.517…   0.…   7.… 111.… 153.… 118.…
#>  7 Abdulah   8.890604 days  37.66665 days 130.246…   0.…   0.… 215.… 159.… 162.…
#>  8 Abhinav 221.372920 days 108.87628 days 114.363… 280.… 136.… 100.…  97.…  95.…
#>  9 Abhiram  73.235939 days 182.81309 days 295.836… 107.…  96.… 137.… 123.…  68.…
#> 10 Abyade  107.638230 days 596.18206 days 168.711… 146.…  61.…  34.… 281.… 149.…
#> # … with 490 more rows

This is much more troublesome as we have to create single length difftime vector.

Now, we need to find the total length of time a subject was at each dose or greater.. This I wasn’t able to do with any built in functions (although I could have missed it). I’m not claiming this is the best function or anything, it’s not, but I have 3 different solutions and thought it was enough to spend my evening writing a blog post.

Solution 1: The for Loop

Never use a for loop, except if you have to. We’ll start with

Code
foo1 <- function(x, y) {
  out <- y
  
  for (i in seq_along(y)) {
    out[i] <- sum(y[x >= x[i]])
  }
  
  out
}

Solution 2: Combining lapply

When having to play with dates before, I found that the way I could retain the date values and still use a function from the apply family was to stick with the lapply() and then use the do.call() function to apply the combine function over my list. lapply() retains the original classes and using the do.call(c, ...) method will turn my list into a vector without removing the structure of the output

Code
foo2 <- function(x, y) {
  out <- lapply(x, function(xx) sum(y[x >= xx]))
  do.call(c, out)
}

Let’s see how this plays out. If we use other methods, we lose the difftime class, which is noticeable as we don’t get our message of Time differences in days before our results.

Code
x <- lapply(dose, function(xx) sum(time[dose >= xx])) %>% unlist() %>% head()
print_with_class_type(x)
#> class  numeric
#> typeof double
#> [1] 501497.92 501497.92 312102.52  61765.11 312102.52 437797.16

x <- sapply(dose, function(xx) sum(time[dose >= xx])) %>% head()
print_with_class_type(x)
#> class  numeric
#> typeof double
#> [1] 501497.92 501497.92 312102.52  61765.11 312102.52 437797.16

x <- foo2(dose, time) %>% head()
print_with_class_type(x)
#> class  difftime
#> typeof double
#> Time differences in days[1] 501497.92 501497.92 312102.52  61765.11 312102.52 437797.16

Solution 3: vapply with subset assigning

Here’s another neat little trick that with a good use case. We’re going to subset our out object (again, a copy of the y intput) and assign over it the result of our vapply. We’re also going to cheat and use the first position of y as our FUN.VALUE.

Code
foo3 <- function(x, y) {
  out <- y
  out[] <- vapply(x, function(xx) sum(y[x >= xx]), y[1], USE.NAMES = FALSE)
  out
}

Let’s take look just like before. vapply() will try to simplify the FUN.VALUE but as long as we use a single vector from the original input we can safely assign it back into our mock subset without worrying about losing our classes.

Code
x <- vapply(dose, function(xx) sum(time[dose >= xx]), time[1]) %>% head()
print_with_class_type(x)
#> class  numeric
#> typeof double
#> [1] 501497.92 501497.92 312102.52  61765.11 312102.52 437797.16

x < -foo3(dose, time) %>% head()
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE
print_with_class_type(x)
#> class  numeric
#> typeof double
#> [1] 501497.92 501497.92 312102.52  61765.11 312102.52 437797.16

Benchmarks

Now, all three of these solutions produce the same results and are fairly equivalent in human legibility. This means, for me at least, that the function which runs the fastest would be the result I keep. We’ll employ the bench package and eponymous name for our consideration.

Of course, for this we’ll be running on the vectors first. We’ll also make certain that all of our outputs are the same with check = TRUE (default) to make sure we didn’t miss anything either.

Code
bench::mark(
  `1` = foo1(dose, time),
  `2` = foo2(dose, time),
  `3` = foo3(dose, time),
  check = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 1             3.01s    3.01s     0.332    2.96GB     35.2
#> 2 2             1.84s    1.84s     0.545    2.22GB     30.0
#> 3 3             1.91s    1.91s     0.524    2.21GB     24.1

And, well, they all run about the same. That kind of just leaves us with the sinking feeling that all of this was futile and that for loops really aren’t that bad. In fact, the for loop solution may be the easiest to read and doesn’t use any tricks that someone reviewing your code may not understand at first.

We know we have some missing values in our data, so we’re going to use the the tidyr::complete() function to help with that.

Code
res
#> # A tibble: 500 × 9
#>    subj    `10`            `20`           `30`     `40`  `50`  `60`  `70`  `80` 
#>    <chr>   <drtn>          <drtn>         <drtn>   <drt> <drt> <drt> <drt> <drt>
#>  1 Aadhya  161.754770 days 125.59109 days   0.000…  73.…   0.…  61.…   0.…  23.…
#>  2 Aaditri  30.671230 days 172.20370 days   0.000… 224.… 131.… 193.…  67.…  93.…
#>  3 Aaleeya 174.678539 days 115.27412 days 159.963… 198.… 184.…  77.… 214.… 292.…
#>  4 Aaleyah 183.860531 days  88.26322 days   0.000…   9.… 261.… 259.… 214.… 100.…
#>  5 Aamia    70.429895 days 149.78110 days 122.847…  40.…  32.…  74.… 116.… 161.…
#>  6 Aarish  232.188970 days 234.42468 days  70.517…   0.…   7.… 111.… 153.… 118.…
#>  7 Abdulah   8.890604 days  37.66665 days 130.246…   0.…   0.… 215.… 159.… 162.…
#>  8 Abhinav 221.372920 days 108.87628 days 114.363… 280.… 136.… 100.…  97.…  95.…
#>  9 Abhiram  73.235939 days 182.81309 days 295.836… 107.…  96.… 137.… 123.…  68.…
#> 10 Abyade  107.638230 days 596.18206 days 168.711… 146.…  61.…  34.… 281.… 149.…
#> # … with 490 more rows

df %>% 
  complete(subj, dose, fill = list(time = difftime0)) %>% 
  group_by(subj, dose) %>% 
  summarise(time = sum(time), .groups = "drop_last") %>%
  mutate(time = foo1(dose, time)) %>% 
  pivot_wider(
    names_from = "dose",
    names_sort = TRUE,
    values_from = "time"
  )
#> # A tibble: 500 × 9
#> # Groups:   subj [500]
#>    subj    `10`           `20`           `30`      `40`  `50`  `60`  `70`  `80` 
#>    <chr>   <drtn>         <drtn>         <drtn>    <drt> <drt> <drt> <drt> <drt>
#>  1 Aadhya   446.5804 days  284.8256 days  159.234… 159.…  85.…  85.…  23.…  23.…
#>  2 Aaditri  913.9830 days  883.3118 days  711.108… 711.… 486.… 354.… 161.…  93.…
#>  3 Aaleeya 1417.4559 days 1242.7773 days 1127.503… 967.… 769.… 584.… 507.… 292.…
#>  4 Aaleyah 1117.0842 days  933.2237 days  844.960… 844.… 835.… 573.… 314.… 100.…
#>  5 Aamia    769.4857 days  699.0558 days  549.274… 426.… 385.… 352.… 278.… 161.…
#>  6 Aarish   927.1830 days  694.9940 days  460.569… 390.… 390.… 382.… 271.… 118.…
#>  7 Abdulah  714.1323 days  705.2417 days  667.575… 537.… 537.… 537.… 321.… 162.…
#>  8 Abhinav 1154.8195 days  933.4466 days  824.570… 710.… 429.… 292.… 192.…  95.…
#>  9 Abhiram 1086.0283 days 1012.7924 days  829.979… 534.… 426.… 329.… 192.…  68.…
#> 10 Abyade  1545.7654 days 1438.1272 days  841.945… 673.… 526.… 465.… 431.… 149.…
#> # … with 490 more rows

And there you go. Three solutions, all the same.

Sometimes it’s useful to try to optimize code. Other times it’s just results in a blog post.

As long as your code is easy to read and not apparently slow, it’s probably fine.