wat? Character NAs – J M Barbone

Note

This interesting behavior was related to {fuj} v0.2.0 lst() behavior. An issue was created to track the fix: jmbarbone/fuj#60.

Short answer: Yes. Feel free to read through the journey of discovery.

The original post showed the behavior with substitute(), which made this feel a little more confusing that it really is.

When trying to compare two values, R makes an attempt to find a common data type. This shouldn’t be new:

Code

# double and integer
0.0 == 0L
#> [1] TRUE

# double and logical
0 == FALSE
#> [1] TRUE

# double and character
0 == "0"
#> [1] TRUE

# date and character
as.Date("2024-05-08") == "2024-05-08"
#> [1] TRUE

In the above examples, all the values are coerced to the same type before the comparison is made. These all resolve in the values being equal. However, there is a special behavior invoked when we compare lists.

Here’s the wat simplified:

Code

# NA resolves to NA
NA == ""
#> [1] NA
NA_character_ == ""
#> [1] NA

# but not in a list
list(NA) == ""
#> [1] FALSE

# unless it's a character
list(NA_character_) == ""
#> [1] NA

Let’s rewrite this as direct character conversions:

Code

as.character(NA)
#> [1] NA
as.character(NA_character_)
#> [1] NA
as.character(list(NA))
#> [1] "NA"
as.character(list(NA_character_))
#> [1] NA

There it is: list(NA) turns into "NA". wat? Well, these two help files provide the relevant information on what is happening here:

Language objects such as symbols and calls are deparsed to character strings before comparison.
?base::Comparison, Details

For lists and pairlists (including language objects such as calls) it deparses the elements individually, except that it extracts the first element of length-one character vectors.
?base::character, Value

The extraction of length 1 character vectors is the described behavior that I was missing. When calling deparse() on a non-list vector, we get a different result for NA and NA_character_: So, here’s what happens:

R detects that one comparison is a list
R converts the list vector to a character vector
For each element in the list, R converts checks if it is a single length character vector or not
If it is a single length character vector, R simply returns the (first element of the) value
If it is not a single length character vector, R deparses the value

We can replicate this with a custom function:

Code

my_convert <- function(x) {
  do <- function(i) {
    if (is.character(i) && length(i) == 1L) {
      # extracts the first element of length-one character vectors
      return(i[1L])
    }
    
    # deparses elements individually
    deparse1(i, control = NULL)
  }
  
  # determine routine for each element
  vapply(x, do, "", USE.NAMES = FALSE)
}

x <- list(1:3, NA, NA_real_, NA_character_)
waldo::compare(as.character(x), sapply(x, deparse, control = NULL))
#> `old`: "1:3" "NA" "NA" NA  
#> `new`: "1:3" "NA" "NA" "NA"
waldo::compare(as.character(x), my_convert(x))
#> ✔ No differences

Tip

There is a default control of "keepNA", which retains additional information about our NA values. The list to character conversion doesn’t seem to use this as all the values are resolved to "NA" rather than "NA_character_", or the like.

Code

sapply(x, deparse, control = "keepNA")
#> [1] "1:3"           "NA"            "NA_real_"      "NA_character_"

--- title: "wat? Character NAs" subtitle: "But it is documented" date: 2024-05-20 categories: ["R", "{fuj}", "wat"] draft: false --- ::: {.callout-note} This interesting behavior was related to [`{fuj} v0.2.0 lst()`](https://github.com/jmbarbone/fuj/blob/v0.2.0/R/list.R) behavior. An issue was created to track the fix: [jmbarbone/fuj#60](https://github.com/jmbarbone/fuj/issues/60). ::: <iframe src="https://fosstodon.org/@barbone/112401680792851241/embed" class="mastodon-embed" style="max-width: 100%; border: 0" width="800" allowfullscreen="allowfullscreen"></iframe><script src="https://fosstodon.org/embed.js" async="async"></script> Short answer: Yes. Feel free to read through the journey of discovery. The original post showed the behavior with [`substitute()`](https://rdrr.io/r/base/substitute.html), which made this feel a little more confusing that it really is. When trying to compare two values, **R** makes an attempt to find a common data type. This shouldn't be new: ```{r} # double and integer 0.0 == 0L # double and logical 0 == FALSE # double and character 0 == "0" # date and character as.Date("2024-05-08") == "2024-05-08" ``` In the above examples, all the values are coerced to the same type before the comparison is made. These all resolve in the values being _equal_. However, there is a special behavior invoked when we compare lists. Here's the _wat_ simplified: ```{r} # NA resolves to NA NA == "" NA_character_ == "" # but not in a list list(NA) == "" # unless it's a character list(NA_character_) == "" ``` Let's rewrite this as direct character conversions: ```{r} as.character(NA) as.character(NA_character_) as.character(list(NA)) as.character(list(NA_character_)) ``` There it is: `list(NA)` turns into `"NA"`. _wat_? Well, these two help files provide the relevant information on what is happening here: > Language objects such as symbols and calls are deparsed to character strings before comparison. [`?base::Comparison`, Details](https://rdrr.io/r/base/Comparison.html) > For lists and pairlists (including language objects such as calls) it deparses the elements individually, except that it extracts the first element of length-one character vectors. [`?base::character`, Value](https://rdrr.io/r/base/character.html) The extraction of length 1 character vectors is the described behavior that I was missing. When calling [`deparse()`](https://rdrr.io/r/base/deparse.html) on a non-list vector, we get a different result for `NA` and `NA_character_`: So, here's what happens: 1. **R** detects that one comparison is a list 2. **R** converts the list vector to a character vector 3. For each element in the list, **R** converts checks if it is a single length character vector or not 4. If it is a single length character vector, **R** simply returns the (first element of the) value 5. If it is not a single length character vector, **R** _deparses_ the value We can replicate this with a custom function: ```{r} my_convert <- function(x) { do <- function(i) { if (is.character(i) && length(i) == 1L) { # extracts the first element of length-one character vectors return(i[1L]) } # deparses elements individually deparse1(i, control = NULL) } # determine routine for each element vapply(x, do, "", USE.NAMES = FALSE) } x <- list(1:3, NA, NA_real_, NA_character_) waldo::compare(as.character(x), sapply(x, deparse, control = NULL)) waldo::compare(as.character(x), my_convert(x)) ``` ::: {.callout-tip} There is a default `control` of `"keepNA"`, which retains additional information about our `NA` values. The `list` to `character` conversion doesn't seem to use this as all the values are resolved to `"NA"` rather than `"NA_character_"`, or the like. ```{r} sapply(x, deparse, control = "keepNA") ``` :::