Converting a data.frame to a numerical matrix in R, not so easy!
Edit 2019
You don't really need the below. Look at the `model.matrix()` function, which converts data frames into matrices for glmnet and similar function.
Original post
Sometimes you need to use a function that wants a numeric matrix as input. One such function is glmnet.cv() which performs lasso regression with cross validation, which is very cool. Unfortunately, it is picky about how it wants the input data. Here's some lines of my code:
fit_cv = cv.glmnet(x = as.matrix(temp_df[predictors]), #predictor vars matrix
y = as.matrix(temp_df[dependent]), #dep var matrix
weights = weights_, #weights
alpha = alpha_) #type of shrinkage
We see that x must be a matrix of the predictors, y must be a matrix with the dependent (usually just one), the weights and alpha are optional, but since I am working with aggregate data I am almost always using weights. Alpha controls the kind of shrinkage used. All well and good, until it isn't. In my case, the predictor data.frame contains some factor variables. R actually uses numeric values as its internal representation of these, but displays them with strings. For instance:
DF = data.frame(a = 1:3, b = letters[10:12],
c = seq(as.Date("2004-01-01"), by = "week", len = 3),
stringsAsFactors = TRUE)
Which prints out like this:
> DF
a b c
1 1 j 2004-01-01
2 2 k 2004-01-08
3 3 l 2004-01-15
However, suppose we use my as.matrix solution above, then we get:
> as.matrix(DF)
a b c
[1,] "1" "j" "2004-01-01"
[2,] "2" "k" "2004-01-08"
[3,] "3" "l" "2004-01-15"
Which is not what we wanted. It gave us a character matrix which glmnet.cv() will then throw a nonsensical error about. Their bad error made me spend some time finding the actual error. Save yourself and others time. Always write good error messages for functions that will be used more than a couple of times!
Is there some easy built in way to solve the problem?
> as.numeric(DF)
Error: (list) object cannot be coerced to type 'double'
The easiest solution did not work.
> as.numeric(DF$b)
[1] 1 2 3
However, it does work for a single column. So maybe we can just try using it on all the columns:
> apply(DF, 2, as.numeric)
a b c
[1,] 1 NA NA
[2,] 2 NA NA
[3,] 3 NA NA
Warning messages:
1: In apply(DF, 2, as.numeric) : NAs introduced by coercion
2: In apply(DF, 2, as.numeric) : NAs introduced by coercion
What? What is going on?
> apply(as.matrix(DF), 2, as.numeric)
a b c
[1,] 1 NA NA
[2,] 2 NA NA
[3,] 3 NA NA
Warning messages:
1: In apply(as.matrix(DF), 2, as.numeric) : NAs introduced by coercion
2: In apply(as.matrix(DF), 2, as.numeric) : NAs introduced by coercion
It looks apply() does a silent as.matrix() which then causes the NAs. OK. How do we convert just the factor columns then? Maybe try some of the more fancy built in conversion calls:
> as.matrix.data.frame(DF)
a b c
[1,] "1" "j" "2004-01-01"
[2,] "2" "k" "2004-01-08"
[3,] "3" "l" "2004-01-15"
Nope.
> as.data.frame.matrix(DF)
a b c
1 1 j 12418
2 2 k 12425
3 3 l 12432
Closer, this time the date got converted, but the factor got converted to character, not integers. We could do a loop:
> for (col_idx in seq_along(DF)) {
+ DF[col_idx] = as.numeric(DF[[col_idx]])
+ }
> DF = as.matrix(DF)
> str(DF)
num [1:3, 1:3] 1 2 3 1 2 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "a" "b" "c"
Which works, but now it is getting silly. Maybe some implicit loops:
> lapply(DF, as.numeric)
$a
[1] 1 2 3
$b
[1] 1 2 3
$c
[1] 12418 12425 12432
Closer, but it returns a list, not a matrix. Maybe just try converting:
> as.matrix(lapply(DF, as.numeric))
[,1]
a Numeric,3
b Numeric,3
c Numeric,3
But no no, life isn't that easy. What about as.data.frame?
> as.data.frame(lapply(DF, as.numeric))
a b c
1 1 1 12418
2 2 2 12425
3 3 3 12432
Huh, that works, but as.matrix didn't. Oh well, just one final step:
> DF = as.matrix(as.data.frame(lapply(DF, as.numeric)))
> str(DF)
num [1:3, 1:3] 1 2 3 1 2 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "a" "b" "c"
We got what we wanted!
Sometimes, R does not make your life easy.