Selecting columns in {dplyr}

One feature of report building that has always bothered me was adjusting column names in final so I could have something prettier than a series of columns_with_underscores that people who use too much Excel find repulsive. The same is true of anyType of CamelCase. There is a functionality with dplyr that let’s me manage updating these names through a clean, reversible, and friendly manner. I discovered this like a toddler just seeing what could happen if I passed a named vector into a select() function and delighted with the result. Weird though because I didn’t remember seeing this in any of the documentation, and when I searched harder through dplyr and tidyselect I found nothing except for a sort of close but not really close enough reference in an faq in from tidyselect which warns against the use of external vectors. However, we should be safeguarded against accidents (and warnings) if we employ all_of() (and any_of()).

Let’s walk through an example.

Code

library(dplyr, warn.conflicts = FALSE)

We’ll use a data set from the psych package that has relatively, very short column names. We’ll want to take these and make them a bit more specific without incurring much penalty with ourselves. Trying to manage column names with spaces and other special characters can be a real thorn in the index finger.

Code

data("bfi", package = "psych")
sapa <- tibble::as_tibble(bfi)[1:100, c(1:2, 6:7, 11:12, 16:17, 26:28)] # shorter
sapa
#> # A tibble: 100 × 11
#>       A1    A2    C1    C2    E1    E2    N1    N2 gender education   age
#>    <int> <int> <int> <int> <int> <int> <int> <int>  <int>     <int> <int>
#>  1     2     4     2     3     3     3     3     4      1        NA    16
#>  2     2     4     5     4     1     1     3     3      2        NA    18
#>  3     5     4     4     5     2     4     4     5      2        NA    17
#>  4     4     4     4     4     5     3     2     5      2        NA    17
#>  5     2     3     4     4     2     2     2     3      1        NA    17
#>  6     6     6     6     6     2     1     3     5      2         3    21
#>  7     2     5     5     4     4     3     1     2      1        NA    18
#>  8     4     3     3     2     3     6     6     3      1         2    19
#>  9     4     3     6     6     5     3     5     5      1         1    19
#> 10     2     5     6     5     2     2     5     5      2        NA    17
#> # … with 90 more rows

We can create a named vector to help keep track of the longer, more specific names of our output data. The first 8 columns will be renamed and the demographic information will be moved to the start.

Code

long_names <- c(
  "gender",
  "education",
  "age",
  "Indifferent to feelings"      = "A1",
  "Inquire about well-being"     = "A2",
  "Exacting about work"          = "C1",
  "Continue until perfection"    = "C2",
  "Don't talk a lot"             = "E1",
  "Difficult to approach others" = "E2",
  "Get angry easily"             = "N1",
  "Get irritated easily"         = "N2"
)

Here’s the typical solution.

Code

sapa %>% 
  select(
    gender,
    education,
    age,
    `Indifferent to feelings`      = A1,
    `Inquire about well-being`     = A2,
    `Exacting about work`          = C1,
    `Continue until perfection`    = C2,
    `Don't talk a lot`             = E1,
    `Difficult to approach others` = E2,
    `Get angry easily`             = N1,
    `Get irritated easily`         = N2
  )
#> # A tibble: 100 × 11
#>    gender education   age `Indifferent to fe…` `Inquire about…` `Exacting abou…`
#>     <int>     <int> <int>                <int>            <int>            <int>
#>  1      1        NA    16                    2                4                2
#>  2      2        NA    18                    2                4                5
#>  3      2        NA    17                    5                4                4
#>  4      2        NA    17                    4                4                4
#>  5      1        NA    17                    2                3                4
#>  6      2         3    21                    6                6                6
#>  7      1        NA    18                    2                5                5
#>  8      1         2    19                    4                3                3
#>  9      1         1    19                    4                3                6
#> 10      2        NA    17                    2                5                6
#> # … with 90 more rows, and 5 more variables: `Continue until perfection` <int>,
#> #   `Don't talk a lot` <int>, `Difficult to approach others` <int>,
#> #   `Get angry easily` <int>, `Get irritated easily` <int>

We can use the tidyselect::all_of() function without as it is reexported with dplyr.

Code

sapa %>% 
  select(all_of(long_names))
#> # A tibble: 100 × 11
#>    gender education   age `Indifferent to fe…` `Inquire about…` `Exacting abou…`
#>     <int>     <int> <int>                <int>            <int>            <int>
#>  1      1        NA    16                    2                4                2
#>  2      2        NA    18                    2                4                5
#>  3      2        NA    17                    5                4                4
#>  4      2        NA    17                    4                4                4
#>  5      1        NA    17                    2                3                4
#>  6      2         3    21                    6                6                6
#>  7      1        NA    18                    2                5                5
#>  8      1         2    19                    4                3                3
#>  9      1         1    19                    4                3                6
#> 10      2        NA    17                    2                5                6
#> # … with 90 more rows, and 5 more variables: `Continue until perfection` <int>,
#> #   `Don't talk a lot` <int>, `Difficult to approach others` <int>,
#> #   `Get angry easily` <int>, `Get irritated easily` <int>

But is it faster?

Code

foo_select_all_of <- function() {
  long_names <- c(
    "gender",
    "education",
    "age",
    "Indifferent to feelings"      = "A1",
    "Inquire about well-being"     = "A2",
    "Exacting about work"          = "C1",
    "Continue until perfection"    = "C2",
    "Don't talk a lot"             = "E1",
    "Difficult to approach others" = "E2",
    "Get angry easily"             = "N1",
    "Get irritated easily"         = "N2"
  )
  
  sapa %>% 
  select(all_of(long_names))
}

foo_select <- function() {
  sapa %>% 
  select(
    gender,
    education,
    age,
    `Indifferent to feelings`      = A1,
    `Inquire about well-being`     = A2,
    `Exacting about work`          = C1,
    `Continue until perfection`    = C2,
    `Don't talk a lot`             = E1,
    `Difficult to approach others` = E2,
    `Get angry easily`             = N1,
    `Get irritated easily`         = N2
  )
}

bench::mark(
  foo_select_all_of(),
  foo_select()
)
#> # A tibble: 2 × 6
#>   expression               min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 foo_select_all_of()   1.51ms   1.81ms      446.    8.67KB     19.1
#> 2 foo_select()          5.97ms   6.77ms      143.   41.45KB     28.6

Yes.

We get the same result and don’t need to clog up the piped if we need to do some mutation, grouping, summarising, etc. This also lets us separate out definitions of the data in case we need to change things:

Code

long_names_less <- long_names[c(1, 3, grep("about", names(long_names)))]

sapa %>% 
  select(all_of(long_names_less))
#> # A tibble: 100 × 4
#>    gender   age `Inquire about well-being` `Exacting about work`
#>     <int> <int>                      <int>                 <int>
#>  1      1    16                          4                     2
#>  2      2    18                          4                     5
#>  3      2    17                          4                     4
#>  4      2    17                          4                     4
#>  5      1    17                          3                     4
#>  6      2    21                          6                     6
#>  7      1    18                          5                     5
#>  8      1    19                          3                     3
#>  9      1    19                          3                     6
#> 10      2    17                          5                     6
#> # … with 90 more rows

Using any_of() instead we could essentially pre-define more “programming” and “output” names and pass it to whatever you are working with. This has been useful by establishing a saved vector of names and using it across multiple reports to keep our naming convention consistent.

We can even write some short functions in case we need to use an output we’ve created before:

Code

names_fill <- function(x) {
  nm <- names(x)
  blanks <- nm == ""
  names(x)[blanks] <- x[blanks]
  x
}

Code

sapa2 <- sapa %>% select(all_of(long_names))
long_names_switched <- mark::names_switch(names_fill(long_names))
long_names
#>                                                           
#>                     "gender"                  "education" 
#>                                   Indifferent to feelings 
#>                        "age"                         "A1" 
#>     Inquire about well-being          Exacting about work 
#>                         "A2"                         "C1" 
#>    Continue until perfection             Don't talk a lot 
#>                         "C2"                         "E1" 
#> Difficult to approach others             Get angry easily 
#>                         "E2"                         "N1" 
#>         Get irritated easily 
#>                         "N2"
long_names_switched
#>                         gender                      education 
#>                       "gender"                    "education" 
#>                            age                             A1 
#>                          "age"      "Indifferent to feelings" 
#>                             A2                             C1 
#>     "Inquire about well-being"          "Exacting about work" 
#>                             C2                             E1 
#>    "Continue until perfection"             "Don't talk a lot" 
#>                             E2                             N1 
#> "Difficult to approach others"             "Get angry easily" 
#>                             N2 
#>         "Get irritated easily"

sapa2 %>% 
  select(all_of(long_names_switched))
#> # A tibble: 100 × 11
#>    gender education   age    A1    A2    C1    C2    E1    E2    N1    N2
#>     <int>     <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#>  1      1        NA    16     2     4     2     3     3     3     3     4
#>  2      2        NA    18     2     4     5     4     1     1     3     3
#>  3      2        NA    17     5     4     4     5     2     4     4     5
#>  4      2        NA    17     4     4     4     4     5     3     2     5
#>  5      1        NA    17     2     3     4     4     2     2     2     3
#>  6      2         3    21     6     6     6     6     2     1     3     5
#>  7      1        NA    18     2     5     5     4     4     3     1     2
#>  8      1         2    19     4     3     3     2     3     6     6     3
#>  9      1         1    19     4     3     6     6     5     3     5     5
#> 10      2        NA    17     2     5     6     5     2     2     5     5
#> # … with 90 more rows

And now our names are back to normal.