Discover more from Just Emil Kirkegaard Things
Excluding missing or bad data in R: not as easy as it should be!
On-going series of posts about functions in my R package (https://github.com/Deleetdk/kirkegaard ). Suppose you have a list or a simple vector (lists are vectors) with some data. However, some of it is missing or bad in various ways: NA, NULL, NaN, Inf (or -Inf). Usually, we want to get rid of these datapoints, but it can be difficult with the built-in functions. R's built-in functions for handling missing (or bad) data are:
is.infinite / is.finite
Unfortunately, they are not consistently vectorized and some of them match multiple types. For instance:
x = list(1, NA, 2, NULL, 3, NaN, 4, Inf) #example list is.na(x) #>  FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
So, is.na actually matches NaN as well. What about is.nan?
is.nan(x) #> Error in is.nan(x) : default method not implemented for type 'list'
But that turns out not to be vectorized. But it gets worse:
sapply(x, is.nan) #> [] #>  FALSE #> #> [] #>  FALSE #> #> [] #>  FALSE #> #> [] #> logical(0) #> #> [] #>  FALSE #> #> [] #>  TRUE #> #> [] #>  FALSE #> #> [] #>  FALSE
Note that calling is.nan on NULL returns an empty logical vector (logical(0)) instead of FALSE. This also changes the output from sapply to a list instead of a vector we can subset with. is.infinite behaves the same way: not vectorized and gives logical(0) for NULL. But suppose you want a robust function for handling missing data and one that has specificity. I could not find such a function, so I wrote one. Testing it:
are_equal(exclude_missing(x), list(1, 2, 3, 4)) #>  TRUE are_equal(exclude_missing(x, .NA = F), list(1, NA, 2, 3, 4)) #>  TRUE are_equal(exclude_missing(x, .NULL = F), list(1, 2, NULL, 3, 4)) #>  TRUE are_equal(exclude_missing(x, .NaN = F), list(1, 2, 3, NaN, 4)) #>  TRUE are_equal(exclude_missing(x, .Inf = F), list(1, 2, 3, 4, Inf)) #>  TRUE
So, in all cases does it exclude only the type that we want to exclude, and it does not fail due to lack of vectorization in the base-r functions. Edited Turns out that there are more problems:
is.na(list(NA)) #>  TRUE
So, for some reason, is.na returns TRUE when given a list with NA. This shouldn't happen I think.