kirkegaard: conditional recoding with conditional_change()
Usually working with large public datasets requires that one recode variables. This can be quite repetitive. When variables only have a few possible values, one can use something like plyr's mapvalues() for great benefit (see my answer at SO). However, when there is an indefinite number of different values, it is not useful. What one wants to do is conditional recoding, that is, apply some test to each value and use the result to determine whether to change the value or not. It can be somewhat messy with the base r approach:
d$R0006500[d$R0006500<0] <- NA
d$R0007900[d$R0007900<0] <- NA
d$R0002200[d$R0002200<0] <- NA
d$R0002500[d$R0002500<0] <- NA
d$R0217900[d$R0217900<0] <- NA
d$R0618301[d$R0618301<0] <- NA
d$R7007300[d$R7007300<0] <- NA
Above we recode 7 variables from values below 0 to NA (variables are from the NLSY dataset). There is a clear violation of DRY. So, one should make a better approach. I came up with this:
v_vars = c("R0006500", "R0007900", "R0002200", "R0002500", "R0217900", "R0618301", "R7007300")
for (var in v_vars){
d[var] = conditional_change(d[var], func_str = "<0", new_value = NA)
}
Which is somewhat shorter (222 vs. 191 chars) and could easily handle more variables. It is build using a functional programming approach to remapping, where one supplies a function that gives a boolean output which is then used to remap values. To understand how it works, I will go over the functions it makes use of.
Condition functions
These are simple functions that return a boolean when applied to a vector. For instance, I have made is_negatve(), which works as expected:
> is_negative(-1:1)
[1] TRUE FALSE FALSE
There are also companion functions: is_positive() and is_zero(). However, those are only three options, and optimally, we would not want to have to write anonymous functions every time we want something slightly different, like a test of whether a value is below 5. Therefore, we need a function factory: math_to_function(). This function takes a string and returns a condition function like the ones above. For instance:
> less_than_five = math_to_function("<5")
> less_than_five(1:10)
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
First we make a new function that tests for less than five, then we test it on a vector.
After that, we are ready to understand conditional_change(). It takes some kind of object whose values to change, either a function or a function string (passed to math_to_function()), and the new value to be used. For instance:
> conditional_change(1:10, func_str = "<5", new_value = NA)
[1] NA NA NA NA 5 6 7 8 9 10
> conditional_change(1:10, func_str = "> 9", new_value = NA)
[1] 1 2 3 4 5 6 7 8 9 NA
> conditional_change(data.frame(1:10), func_str = "> 9", new_value = NA)
X1.10
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 NA
> conditional_change(list(1:10, 1:10), func_str = "> 9", new_value = NA)
[[1]]
[1] 1 2 3 4 5 6 7 8 9 NA
[[2]]
[1] 1 2 3 4 5 6 7 8 9 NA
And so on.
Future improvements
There are various ways one could improve upon the approach taken here. First, one could vectorize the function such that conditional_change() could take a list of functions/vector of function strings and a vector of return values and apply each in turn. This would save some writing in some cases. Second, instead of condition functions, one could use functions that return the new value. So one could use functions like:
recode_outliers = function(x) {
x[x < -2] = -2
x[x > 2] = 2
return(x)
}
> recode_outliers(-5:5)
[1] -2 -2 -2 -2 -1 0 1 2 2 2 2
> recode_outliers(matrix(-4:4, nrow=3))
[,1] [,2] [,3]
[1,] -2 -1 2
[2,] -2 0 2
[3,] -2 1 2
Perhaps this is a better approach.