R functions for analyzing missing data
I'm reading Missing Data: A Gentle Introduction and it mentions various methods to understand how data are missing in a given dataset. The book, however, is light on actual tools. So, since I have already implemented a few functions in my package for handling missing data, I decided to implement a few more. These have the prefix miss_ to easily find them. I have renamed the existing functions to also follow this pattern if necessary (keeping the old names working as well so as not to break existing code).
Amount and distribution of missing data
The authors distinguish between three ways of measuring the amount of missing data: 1) by case, 2) by variable, and 3) by cell. Or in plain English: which proportion of cases/vars/cells have at least one missing variable? miss_amount calculates these three proportions:
> miss_amount(iris_miss)
cases with missing data vars with missing data cells with missing data
0.42 1.00 0.10
The dataset used, iris_miss, is the famous iris dataset (N=150) where I have removed 10% of the data at random. Luckily, the results tell us that indeed 10% of the cells have missing values. 42% of the cases have at least one missing datapoint.
However, it is usually more interesting to know more than just whether or not there is missing data by case/var, but also how many. miss_by_case calculates the number of missing datapoints by case:
> miss_by_case(iris_miss)
[1] 0 1 0 1 0 0 0 0 0 2 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 2 0 0 0 3 0 0 0 1 1 0 0 1 1 1 0 0 1 0 2 1 1 0 1 0 0 0 0 0 3 2 0
[59] 1 0 0 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 0 0 0 1 1 0 4 0 0
[117] 1 0 1 1 1 0 0 0 0 2 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 1
This is most useful when combined with table:
> miss_by_case(iris_miss) %>% table()
.
0 1 2 3 4
87 55 5 2 1
Thus, we see that most cases that miss data actually only miss 1 datapoint. One can count the missing datapoints by variable with miss_by_var:
> miss_by_var(iris_miss)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
16 14 16 12 17
The above information is most easily understood with the help of a plot, made with miss_plot:
> miss_plot(iris_miss)
> ggsave("miss_plot_case.png")
Saving 8.95 x 5.67 in image
> miss_plot(iris_miss, case = F)
> ggsave("miss_plot_var.png")
Saving 8.95 x 5.67 in image
The first has number of missing datapoints by case and the second has by variable.
Patterns of missing data
Statisticians talk about three ways data can be missing that is related to the statistical 'mechanism' that caused them to be missing. They have confusing names, so I will just call them type 1, 2 and 3 (MCAR, MAR and MNAR). Type 1 is data that is just missing at random. For instance, if we delete datapoints at random from a dataset as we did above, they are missing at random. Or, if we instead of giving participants all items, we gave them a random subset, they would also be missing at random.
Type 2 is when the data are missing and this is some (possibly probabilistic) function of the other data in the dataset. In this way, the data are missing but not at random. For instance, if younger children tend not to report their own weights because they don't know them. However, they report their age and the missingness of the weight is related to their age.
Type 3 is when the data are missing but that this is a function of the very same datapoints that are missing. For instance, if we ask people whether they have a criminal record and people with criminal records tend to not want to acknowledge this.
Type 1 missing data we can just ignore. There is no particular pattern in how it is missing. This reduces our sample size, but does not bias the estimates of the relationships in any particular direction. Type 2 missing data, we can perhaps deal with because we can impute it based on the other variable(s) in the dataset. Type 3 missing data, however, there is not much we can directly do about. If we have other datasets with the same variable (e.g. police records), then we can compare the distributions. This would allow us to see that in our sample, the proportion of people with a criminal record is too low, so we suspect that the persons with criminal records tend to not report them.
How can we figure out which type of missingness we have? First, we can create a matrix of the missing data. This is not so useful in itself, but it is useful for further analyses. miss_matrix creates this:
> iris_miss[1:10, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 <NA>
3 4.7 3.2 1.3 0.2 setosa
4 4.6 NA 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 NA 3.1 1.5 0.1 <NA>
> miss_matrix(iris_miss[1:10, ])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 0 0 0 0 0
2 0 0 0 0 1
3 0 0 0 0 0
4 0 1 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 1 0 0 0 1
To save save, we only look at the first 10 rows. We see that some data are missing (NA). In the matrix below, we see that these cells have 1's and the others have 0's. Simple.
If we inspect the first ten cases with missing data, we can see that they tend to miss data in different ways:
> iris_miss[miss_by_case(iris_miss) >= 1, ] %>% head(10)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2 4.9 3.0 1.4 0.2 <NA>
4 4.6 NA 1.5 0.2 setosa
10 NA 3.1 1.5 0.1 <NA>
12 4.8 3.4 NA 0.2 setosa
14 4.3 3.0 1.1 0.1 <NA>
16 5.7 NA 1.5 0.4 setosa
19 5.7 NA 1.7 0.3 setosa
27 NA 3.4 1.6 NA setosa
31 NA 3.1 NA 0.2 <NA>
35 NA 3.1 1.5 0.2 setosa
We see that case 2 is missing 1 datapoint in the 5th variable, case 4 is missing 1 datapoint in the 2nd variable. Cases 27 and 31 are missing multiple datapoints but in different ways. How can we measure this? We can use the powers of 2 method. We create a vector of powers of 2 that is the same length as the number of variables, so in this case {2, 4, 8, 16, 32}. Then we multiply this vector row-wise with the missing data matrix, and this gives us:
> miss_matrix(iris_miss)[miss_by_case(iris_miss) >= 1, ] %>% head(10) %>% t() %>% multiply_by(c(2, 4, 8, 16, 32)) %>% t()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
2 0 0 0 0 32
4 0 4 0 0 0
10 2 0 0 0 32
12 0 0 8 0 0
14 0 0 0 0 32
16 0 4 0 0 0
19 0 4 0 0 0
27 2 0 0 16 0
31 2 0 8 0 32
35 2 0 0 0 0
Compare this with the missing matrix above. It is the same, except that the cells with 1's have been replaced with powers of 2 based on the variable they are in. Cells in the first are replaced with 2's (2^1 = 2), those in the second with 4's (2^2 = 4) and so on. This itself is not useful, but when we sum the numbers by row, we get a unique ID for each way the data can be missing. These values are calculated by miss_pattern:
> miss_pattern(iris_miss)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0 32 0 4 0 0 0 0 0 34 0 8 0 32 0 4 0 0 4 0 0 0 0 0 0 0 18 0 0 0
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
42 0 0 0 2 4 0 0 32 32 8 0 0 32 0 18 2 2 0 2 0 0 0 0 0 52 20 0 8 0
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
0 2 0 2 8 8 32 2 0 8 32 32 0 2 16 0 0 0 2 0 0 0 0 0 0 16 4 0 0 4
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
0 16 0 4 32 8 0 2 2 0 0 8 32 2 16 4 0 0 0 0 32 4 0 60 0 0 8 0 8 4
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
4 0 0 0 0 40 0 0 0 0 0 0 0 32 0 16 8 0 0 0 8 16 0 0 0 8 0 0 0 16
The results given this way are not so understandable, but we can calculate the table from them:
> miss_pattern(iris_miss) %>% table()
.
0 2 4 8 16 18 20 32 34 40 42 52 60
87 12 11 13 7 2 1 12 1 1 1 1 1
Now we see it more clearly. These are the patterns of missing data; the ways data can be missing. The powers of 2 are the simple cases of 1 datapoint missing in different variables which we saw earlier. The others are composites. For instance, 18 is 2+16, thus representing cases where data are missing in variable 1 (2) and variable 4 (16).
This provides us with a measure of the complexity of the missing data, namely, we count the number of patterns found in the data and the number of cases. The fraction of this represents the complexity. miss_complexity calculates this:
> miss_complexity(iris_miss)
[1] 0.04666667
Thus, in our case, the pattern of missing data is relatively simple.
Type of missing data?
Sometimes we don't know how the data is missing. It may be type 1, 2, 3, or some combination. How do we find out? We can distinguish between type 1 and 2 by analyzing how the missing data relates to other variables. If the data are missing at random, there should be no observable relationships between data missing in one variable and values of the other variables aside from chance. Examining an entire dataset for non-random missingness thus requires doing a lot of calculations because we must compare the missing data in each variable with each other variable's data, thus p*(p-1) comparisons. miss_analyze does this for us:
> miss_analyze(iris_miss)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Sepal.Length NA 0.06736091 0.41567496 0.46781290 0.40
Sepal.Width -0.02591574 NA -0.08697338 0.02125442 0.01
Petal.Length -0.37577044 0.09577354 NA -0.42815844 0.25
Petal.Width -0.55242759 -0.06717756 -0.40272417 NA 0.15
Species 0.27082506 -0.01113144 0.22816556 0.25833418 NA
Each cell is the effect size (Cohen's d) calculated as splitting the values of the other variable into 2 groups based on whether they miss data in the first variable. The grouping variable is on the left and the outcome variable is on the top. Thus, we see that there seems to be some patterns in how the data are missing for this dataset. For instance, cases that have missing data for Petal.Length have smaller Sepal.Lengths (d = -38). Since we know this dataset actually has data missing at random, these must represent chance findings. If we look at all the numbers together, we generally do see that the values are smallish.
Note that because one cannot calculate Cohen's d/correlations for nominal variables (like Species), I have used Cramer's V converted to Cohen's d instead. This has the implication that the value cannot be negative.
Our example dataset has N=150. On average, Cohen's d values will then be calculated from groups with sizes 10 and 140, so if we do 20 comparisons as we did above, we are likely to observe some largeish values, just by chance. The largest value above is .55, representing a medium sized effect. The mean absolute d is .23.
If we examined a larger dataset that also has data missing at random, we should see much smaller values:
> set.seed(1)
> rnorm(5e3) %>% matrix(nrow = 1000) %>% as.data.frame() %>% df_addNA() %>% miss_analyze()
V1 V2 V3 V4 V5
V1 NA -0.07951442 0.19249439 -0.01830883 -0.11540971
V2 -0.09698717 NA -0.11780557 -0.05403185 0.09778738
V3 0.11954180 0.03310052 NA -0.01917229 0.13087361
V4 -0.06356078 0.12931941 0.02131987 NA 0.26977611
V5 -0.16705158 -0.01179713 0.01402070 0.03015518 NA
Here, we generated a dataset with 1000 cases and 5 variables, then added 10% missing data at random as before. We do see that the values have become much smaller. The mean abs. d is .07.
R code
The functions can be found in my github package: https://github.com/Deleetdk/kirkegaard
The example code: https://gist.github.com/Deleetdk/fcfcc6ba6f1c39a234717dc3e0c83369