Excluding missing or bad data in R: not as easy as it should be!
On-going series of posts about functions in my R package (https://github.com/Deleetdk/kirkegaard ). Suppose you have a list or a simple vector (lists are vectors) with some data. However, some of it is missing or bad in various ways: NA, NULL, NaN, Inf (or -Inf). Usually, we want to get rid of these datapoints, but it can be difficult with the built-in functions. R's built-in functions for handling missing (or bad) data are:
is.na
is.nan
is.infinite / is.finite
is.null
Unfortunately, they are not consistently vectorized and some of them match multiple types. For instance:
x = list(1, NA, 2, NULL, 3, NaN, 4, Inf) #example list
is.na(x)
#> [1] FALSEĀ TRUE FALSE FALSE FALSEĀ TRUE FALSE FALSE
So, is.na actually matches NaN as well. What about is.nan?
is.nan(x)
#> Error in is.nan(x) : default method not implemented for type 'list'
But that turns out not to be vectorized. But it gets worse:
sapply(x, is.nan)
#> [[1]]
#> [1] FALSE
#>
#> [[2]]
#> [1] FALSE
#>
#> [[3]]
#> [1] FALSE
#>
#> [[4]]
#> logical(0)
#>
#> [[5]]
#> [1] FALSE
#>
#> [[6]]
#> [1] TRUE
#>
#> [[7]]
#> [1] FALSE
#>
#> [[8]]
#> [1] FALSE
Note that calling is.nan on NULL returns an empty logical vector (logical(0)) instead of FALSE. This also changes the output from sapply to a list instead of a vector we can subset with. is.infinite behaves the same way: not vectorized and gives logical(0) for NULL. But suppose you want a robust function for handling missing data and one that has specificity. I could not find such a function, so I wrote one. Testing it:
are_equal(exclude_missing(x), list(1, 2, 3, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .NA = F), list(1, NA, 2, 3, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .NULL = F), list(1, 2, NULL, 3, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .NaN = F), list(1, 2, 3, NaN, 4))
#> [1] TRUE
are_equal(exclude_missing(x, .Inf = F), list(1, 2, 3, 4, Inf))
#> [1] TRUE
So, in all cases does it exclude only the type that we want to exclude, and it does not fail due to lack of vectorization in the base-r functions. Edited Turns out that there are more problems:
is.na(list(NA)) #> [1] TRUE
So, for some reason, is.na returns TRUE when given a list with NA. This shouldn't happen I think.