Code
library(dplyr, warn.conflicts = FALSE)
September 24, 2020
One feature of report building that has always bothered me was adjusting column names in final so I could have something prettier than a series of columns_with_underscores that people who use too much Excel find repulsive. The same is true of anyType of CamelCase. There is a functionality with dplyr
that let’s me manage updating these names through a clean, reversible, and friendly manner. I discovered this like a toddler just seeing what could happen if I passed a named vector into a select()
function and delighted with the result. Weird though because I didn’t remember seeing this in any of the documentation, and when I searched harder through dplyr
and tidyselect
I found nothing except for a sort of close but not really close enough reference in an faq in from tidyselect
which warns against the use of external vectors. However, we should be safeguarded against accidents (and warnings) if we employ all_of()
(and any_of()
).
Let’s walk through an example.
We’ll use a data set from the psych
package that has relatively, very short column names. We’ll want to take these and make them a bit more specific without incurring much penalty with ourselves. Trying to manage column names with spaces and other special characters can be a real thorn in the index finger.
data("bfi", package = "psych")
sapa <- tibble::as_tibble(bfi)[1:100, c(1:2, 6:7, 11:12, 16:17, 26:28)] # shorter
sapa
#> # A tibble: 100 × 11
#> A1 A2 C1 C2 E1 E2 N1 N2 gender education age
#> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 2 4 2 3 3 3 3 4 1 NA 16
#> 2 2 4 5 4 1 1 3 3 2 NA 18
#> 3 5 4 4 5 2 4 4 5 2 NA 17
#> 4 4 4 4 4 5 3 2 5 2 NA 17
#> 5 2 3 4 4 2 2 2 3 1 NA 17
#> 6 6 6 6 6 2 1 3 5 2 3 21
#> 7 2 5 5 4 4 3 1 2 1 NA 18
#> 8 4 3 3 2 3 6 6 3 1 2 19
#> 9 4 3 6 6 5 3 5 5 1 1 19
#> 10 2 5 6 5 2 2 5 5 2 NA 17
#> # … with 90 more rows
We can create a named vector to help keep track of the longer, more specific names of our output data. The first 8 columns will be renamed and the demographic information will be moved to the start.
long_names <- c(
"gender",
"education",
"age",
"Indifferent to feelings" = "A1",
"Inquire about well-being" = "A2",
"Exacting about work" = "C1",
"Continue until perfection" = "C2",
"Don't talk a lot" = "E1",
"Difficult to approach others" = "E2",
"Get angry easily" = "N1",
"Get irritated easily" = "N2"
)
Here’s the typical solution.
sapa %>%
select(
gender,
education,
age,
`Indifferent to feelings` = A1,
`Inquire about well-being` = A2,
`Exacting about work` = C1,
`Continue until perfection` = C2,
`Don't talk a lot` = E1,
`Difficult to approach others` = E2,
`Get angry easily` = N1,
`Get irritated easily` = N2
)
#> # A tibble: 100 × 11
#> gender education age `Indifferent to fe…` `Inquire about…` `Exacting abou…`
#> <int> <int> <int> <int> <int> <int>
#> 1 1 NA 16 2 4 2
#> 2 2 NA 18 2 4 5
#> 3 2 NA 17 5 4 4
#> 4 2 NA 17 4 4 4
#> 5 1 NA 17 2 3 4
#> 6 2 3 21 6 6 6
#> 7 1 NA 18 2 5 5
#> 8 1 2 19 4 3 3
#> 9 1 1 19 4 3 6
#> 10 2 NA 17 2 5 6
#> # … with 90 more rows, and 5 more variables: `Continue until perfection` <int>,
#> # `Don't talk a lot` <int>, `Difficult to approach others` <int>,
#> # `Get angry easily` <int>, `Get irritated easily` <int>
We can use the tidyselect::all_of()
function without as it is reexported with dplyr
.
sapa %>%
select(all_of(long_names))
#> # A tibble: 100 × 11
#> gender education age `Indifferent to fe…` `Inquire about…` `Exacting abou…`
#> <int> <int> <int> <int> <int> <int>
#> 1 1 NA 16 2 4 2
#> 2 2 NA 18 2 4 5
#> 3 2 NA 17 5 4 4
#> 4 2 NA 17 4 4 4
#> 5 1 NA 17 2 3 4
#> 6 2 3 21 6 6 6
#> 7 1 NA 18 2 5 5
#> 8 1 2 19 4 3 3
#> 9 1 1 19 4 3 6
#> 10 2 NA 17 2 5 6
#> # … with 90 more rows, and 5 more variables: `Continue until perfection` <int>,
#> # `Don't talk a lot` <int>, `Difficult to approach others` <int>,
#> # `Get angry easily` <int>, `Get irritated easily` <int>
foo_select_all_of <- function() {
long_names <- c(
"gender",
"education",
"age",
"Indifferent to feelings" = "A1",
"Inquire about well-being" = "A2",
"Exacting about work" = "C1",
"Continue until perfection" = "C2",
"Don't talk a lot" = "E1",
"Difficult to approach others" = "E2",
"Get angry easily" = "N1",
"Get irritated easily" = "N2"
)
sapa %>%
select(all_of(long_names))
}
foo_select <- function() {
sapa %>%
select(
gender,
education,
age,
`Indifferent to feelings` = A1,
`Inquire about well-being` = A2,
`Exacting about work` = C1,
`Continue until perfection` = C2,
`Don't talk a lot` = E1,
`Difficult to approach others` = E2,
`Get angry easily` = N1,
`Get irritated easily` = N2
)
}
bench::mark(
foo_select_all_of(),
foo_select()
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 foo_select_all_of() 1.51ms 1.81ms 446. 8.67KB 19.1
#> 2 foo_select() 5.97ms 6.77ms 143. 41.45KB 28.6
Yes.
We get the same result and don’t need to clog up the piped if we need to do some mutation, grouping, summarising, etc. This also lets us separate out definitions of the data in case we need to change things:
long_names_less <- long_names[c(1, 3, grep("about", names(long_names)))]
sapa %>%
select(all_of(long_names_less))
#> # A tibble: 100 × 4
#> gender age `Inquire about well-being` `Exacting about work`
#> <int> <int> <int> <int>
#> 1 1 16 4 2
#> 2 2 18 4 5
#> 3 2 17 4 4
#> 4 2 17 4 4
#> 5 1 17 3 4
#> 6 2 21 6 6
#> 7 1 18 5 5
#> 8 1 19 3 3
#> 9 1 19 3 6
#> 10 2 17 5 6
#> # … with 90 more rows
Using
any_of()
instead we could essentially pre-define more “programming” and “output” names and pass it to whatever you are working with. This has been useful by establishing a saved vector of names and using it across multiple reports to keep our naming convention consistent.
We can even write some short functions in case we need to use an output we’ve created before:
sapa2 <- sapa %>% select(all_of(long_names))
long_names_switched <- mark::names_switch(names_fill(long_names))
long_names
#>
#> "gender" "education"
#> Indifferent to feelings
#> "age" "A1"
#> Inquire about well-being Exacting about work
#> "A2" "C1"
#> Continue until perfection Don't talk a lot
#> "C2" "E1"
#> Difficult to approach others Get angry easily
#> "E2" "N1"
#> Get irritated easily
#> "N2"
long_names_switched
#> gender education
#> "gender" "education"
#> age A1
#> "age" "Indifferent to feelings"
#> A2 C1
#> "Inquire about well-being" "Exacting about work"
#> C2 E1
#> "Continue until perfection" "Don't talk a lot"
#> E2 N1
#> "Difficult to approach others" "Get angry easily"
#> N2
#> "Get irritated easily"
sapa2 %>%
select(all_of(long_names_switched))
#> # A tibble: 100 × 11
#> gender education age A1 A2 C1 C2 E1 E2 N1 N2
#> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 1 NA 16 2 4 2 3 3 3 3 4
#> 2 2 NA 18 2 4 5 4 1 1 3 3
#> 3 2 NA 17 5 4 4 5 2 4 4 5
#> 4 2 NA 17 4 4 4 4 5 3 2 5
#> 5 1 NA 17 2 3 4 4 2 2 2 3
#> 6 2 3 21 6 6 6 6 2 1 3 5
#> 7 1 NA 18 2 5 5 4 4 3 1 2
#> 8 1 2 19 4 3 3 2 3 6 6 3
#> 9 1 1 19 4 3 6 6 5 3 5 5
#> 10 2 NA 17 2 5 6 5 2 2 5 5
#> # … with 90 more rows
And now our names are back to normal.
---
title: "Selecting columns in `{dplyr}`"
subtitle: Quick tips
date: "2020-09-24"
categories: ["R", "{dplyr}", "{tidyselect}"]
---
One feature of report building that has always bothered me was adjusting column names in final so I could have something prettier than a series of columns_with_underscores that people who use too much Excel find repulsive.
The same is true of anyType of CamelCase.
There is a functionality with `dplyr` that let's me manage updating these names through a clean, reversible, and friendly manner.
I discovered this like a toddler just seeing what could happen if I passed a named vector into a `select()` function and delighted with the result.
Weird though because I didn't remember seeing this in any of the documentation, and when I searched harder through `dplyr` and `tidyselect` I found nothing except for a sort of close but not really close enough reference in an [faq in from `tidyselect`](https://github.com/r-lib/tidyselect/blob/master/man/faq/external-vector.Rmd) which warns against the use of external vectors.
However, we should be safeguarded against accidents (and warnings) if we employ `all_of()` (and `any_of()`).
Let's walk through an example.
```{r}
library(dplyr, warn.conflicts = FALSE)
```
We'll use a data set from the `psych` package that has relatively, very short column names.
We'll want to take these and make them a bit more specific without incurring much penalty with ourselves.
Trying to manage column names with spaces and other special characters can be a real thorn in the index finger.
```{r}
data("bfi", package = "psych")
sapa <- tibble::as_tibble(bfi)[1:100, c(1:2, 6:7, 11:12, 16:17, 26:28)] # shorter
sapa
```
We can create a named vector to help keep track of the longer, more specific names of our output data.
The first 8 columns will be renamed and the demographic information will be moved to the start.
```{r}
long_names <- c(
"gender",
"education",
"age",
"Indifferent to feelings" = "A1",
"Inquire about well-being" = "A2",
"Exacting about work" = "C1",
"Continue until perfection" = "C2",
"Don't talk a lot" = "E1",
"Difficult to approach others" = "E2",
"Get angry easily" = "N1",
"Get irritated easily" = "N2"
)
```
Here's the typical solution.
```{r}
sapa %>%
select(
gender,
education,
age,
`Indifferent to feelings` = A1,
`Inquire about well-being` = A2,
`Exacting about work` = C1,
`Continue until perfection` = C2,
`Don't talk a lot` = E1,
`Difficult to approach others` = E2,
`Get angry easily` = N1,
`Get irritated easily` = N2
)
```
We can use the `tidyselect::all_of()` function without as it is reexported with `dplyr`.
```{r}
sapa %>%
select(all_of(long_names))
```
<details>
<summary>But is it faster?</summary>
```{r benchmark}
foo_select_all_of <- function() {
long_names <- c(
"gender",
"education",
"age",
"Indifferent to feelings" = "A1",
"Inquire about well-being" = "A2",
"Exacting about work" = "C1",
"Continue until perfection" = "C2",
"Don't talk a lot" = "E1",
"Difficult to approach others" = "E2",
"Get angry easily" = "N1",
"Get irritated easily" = "N2"
)
sapa %>%
select(all_of(long_names))
}
foo_select <- function() {
sapa %>%
select(
gender,
education,
age,
`Indifferent to feelings` = A1,
`Inquire about well-being` = A2,
`Exacting about work` = C1,
`Continue until perfection` = C2,
`Don't talk a lot` = E1,
`Difficult to approach others` = E2,
`Get angry easily` = N1,
`Get irritated easily` = N2
)
}
bench::mark(
foo_select_all_of(),
foo_select()
)
```
Yes.
</details>
We get the same result and don't need to clog up the piped if we need to do some mutation, grouping, summarising, etc.
This also lets us separate out definitions of the data in case we need to change things:
```{r}
long_names_less <- long_names[c(1, 3, grep("about", names(long_names)))]
sapa %>%
select(all_of(long_names_less))
```
> Using `any_of()` instead we could essentially pre-define more "programming" and "output" names and pass it to whatever you are working with.
This has been useful by establishing a saved vector of names and using it across multiple reports to keep our naming convention consistent.
We can even write some short functions in case we need to use an output we've created before:
```{r}
names_fill <- function(x) {
nm <- names(x)
blanks <- nm == ""
names(x)[blanks] <- x[blanks]
x
}
```
```{r}
sapa2 <- sapa %>% select(all_of(long_names))
long_names_switched <- mark::names_switch(names_fill(long_names))
long_names
long_names_switched
sapa2 %>%
select(all_of(long_names_switched))
```
And now our names are back to normal.