Handling Missing Data in R

2019-10-17

Evaluating the Nature of Missingness

Classifications of Missing Data

MCAR - Missing completely at random if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. When data are MCAR, the analysis performed on the data is unbiased; however, data are rarely MCAR.
MAR - Missing at random occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information.
NMAR - Not missing at random (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR (i.e. the value of the variable that’s missing is related to the reason it’s missing).

Methods for Handling Missing Data Kabacoff, 2011, p. 355

Mammal Sleep Data

Allison, T. & Chichetti, D. (1976). Sleep in mammals: ecological and constitutional correlates. Science 194 (4266), 732–734.

Abstract The interrelationships between sleep, ecological, and constitutional variables were assessed statistically for 39 mammalian species. Slow-wave sleep is negatively associated with a factor related to body size, which suggests that large amounts of this sleep phase are disadvantageous in large species. Paradoxical sleep is associated with a factor related to predatory danger, which suggests that large amounts of this sleep phase are disadvantageous in prey species.

BodyWgt - Body weight in kilograms
BrainWgt - Brain weight in grams
NonD - Nondreaming sleep (hours / day)
Dream - Dreaming sleep (hours / day)
Sleep - Total sleep
Span - Life-span in years
Gest - Gestation time in days
Pred - Predation index: 1 = minimum (least likely to be preyed upon) to 5 = maximum (most likely to be preyed upon)
Exp - Sleep exposure index: 1 = least exposed (e.g. animal sleeps in a well-protected den) to 5 = most exposed
Danger - Overall danger index (based on the above two indices and other information): 1 = least danger (from other animals) to 5 = most danger (from other animals)

Mammal Sleep Data

data(sleep, package="VIM")
str(sleep)

## 'data.frame':    62 obs. of  10 variables:
##  $ BodyWgt : num  6654 1 3.38 0.92 2547 ...
##  $ BrainWgt: num  5712 6.6 44.5 5.7 4603 ...
##  $ NonD    : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
##  $ Dream   : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
##  $ Sleep   : num  3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
##  $ Span    : num  38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
##  $ Gest    : num  645 42 60 25 624 180 35 392 63 230 ...
##  $ Pred    : int  3 3 1 5 3 4 1 4 1 1 ...
##  $ Exp     : int  5 1 1 2 5 4 1 5 2 1 ...
##  $ Danger  : int  3 3 1 3 4 4 1 4 1 1 ...

Complete Cases

complete.cases(sleep)

##  [1] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
## [23]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [34]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [45]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## [56] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

head(sleep[complete.cases(sleep),])

##    BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
## 2    1.000      6.6  6.3   2.0   8.3  4.5   42    3   1      3
## 5 2547.000   4603.0  2.1   1.8   3.9 69.0  624    3   5      4
## 6   10.550    179.5  9.1   0.7   9.8 27.0  180    4   4      4
## 7    0.023      0.3 15.8   3.9  19.7 19.0   35    1   1      1
## 8  160.000    169.0  5.2   1.0   6.2 30.4  392    4   5      4
## 9    3.300     25.6 10.9   3.6  14.5 28.0   63    1   2      1

Incomplete Cases

head(sleep[!complete.cases(sleep),])

##     BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
## 1  6654.000   5712.0   NA    NA   3.3 38.6  645    3   5      3
## 3     3.385     44.5   NA    NA  12.5 14.0   60    1   1      1
## 4     0.920      5.7   NA    NA  16.5   NA   25    5   2      3
## 13    0.550      2.4  7.6   2.7  10.3   NA   NA    2   1      2
## 14  187.100    419.0   NA    NA   3.1 40.0  365    5   5      5
## 19    1.410     17.5  4.8   1.3   6.1 34.0   NA    1   2      1

How much is missing?

Number of missing values

sum(is.na(sleep$Dream))

## [1] 12

Percent missing

mean(is.na(sleep$Dream))

## [1] 0.1935484

Percent of rows with missing data

mean(!complete.cases(sleep))

## [1] 0.3225806

Pattern of Missingness

md.pattern(sleep, rotate.names = TRUE)

##    BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD   
## 42       1        1    1   1      1     1    1    1     1    1  0
## 9        1        1    1   1      1     1    1    1     0    0  2
## 3        1        1    1   1      1     1    1    0     1    1  1
## 2        1        1    1   1      1     1    0    1     1    1  1
## 1        1        1    1   1      1     1    0    1     0    0  3
## 1        1        1    1   1      1     1    0    0     1    1  2
## 2        1        1    1   1      1     0    1    1     1    0  2
## 2        1        1    1   1      1     0    1    1     0    0  3
##          0        0    0   0      0     4    4    4    12   14 38

Visualizing Missingness

aggr(sleep, prop=FALSE, numbers=TRUE)

Visualizing Missingness

matrixplot(sleep)

## 
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.

Visualizing Missingness

marginplot(sleep[,c('Gest','Dream')], pch=c(20), col=c('darkgray','red','blue'))

Shadow Matrix

sm <- as.data.frame(abs(is.na(sleep)))
head(sleep)

##    BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
## 1 6654.000   5712.0   NA    NA   3.3 38.6  645    3   5      3
## 2    1.000      6.6  6.3   2.0   8.3  4.5   42    3   1      3
## 3    3.385     44.5   NA    NA  12.5 14.0   60    1   1      1
## 4    0.920      5.7   NA    NA  16.5   NA   25    5   2      3
## 5 2547.000   4603.0  2.1   1.8   3.9 69.0  624    3   5      4
## 6   10.550    179.5  9.1   0.7   9.8 27.0  180    4   4      4

head(sm)

##   BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
## 1       0        0    1     1     0    0    0    0   0      0
## 2       0        0    0     0     0    0    0    0   0      0
## 3       0        0    1     1     0    0    0    0   0      0
## 4       0        0    1     1     0    1    0    0   0      0
## 5       0        0    0     0     0    0    0    0   0      0
## 6       0        0    0     0     0    0    0    0   0      0

Correlation of Missingness

Examine the correlation of missingness between variables

# Extract varabibles that have any missingness
y <- sm[which(sapply(sm, sd) > 0)] 
cor(y)

##              NonD       Dream       Sleep        Span        Gest
## NonD   1.00000000  0.90711474  0.48626454  0.01519577 -0.14182716
## Dream  0.90711474  1.00000000  0.20370138  0.03752394 -0.12865350
## Sleep  0.48626454  0.20370138  1.00000000 -0.06896552 -0.06896552
## Span   0.01519577  0.03752394 -0.06896552  1.00000000  0.19827586
## Gest  -0.14182716 -0.12865350 -0.06896552  0.19827586  1.00000000

Relationship between missingness and observed variables

cor(sleep, y, use='pairwise.complete.obs')

##                 NonD       Dream        Sleep        Span        Gest
## BodyWgt   0.22682614  0.22259108  0.001684992 -0.05831706 -0.05396818
## BrainWgt  0.17945923  0.16321105  0.007859438 -0.07921370 -0.07332961
## NonD              NA          NA           NA -0.04314514 -0.04553485
## Dream    -0.18895206          NA -0.188952059  0.11699247  0.22774685
## Sleep    -0.08023157 -0.08023157           NA  0.09638044  0.03976464
## Span      0.08336361  0.05981377  0.005238852          NA -0.06527277
## Gest      0.20239201  0.05140232  0.159701523 -0.17495305          NA
## Pred      0.04758438 -0.06834378  0.202462711  0.02313860 -0.20101655
## Exp       0.24546836  0.12740768  0.260772984 -0.19291879 -0.19291879
## Danger    0.06528387 -0.06724755  0.208883617 -0.06666498 -0.20443928

Rows are observed variables, columns missing indicators. Nondreaming (NonD) sleep scores are more likely to be missing with larger body weights (BodyWgt) with r=0.227. Since the correlations are not very large suggests that the nature of the missingness deviates minimally from the MCAR and MAR assumptions.

Understanding missingness

Kabacoff (2011, p. 362) suggests the following questions to address:

What percentage of the data is missing?
Is it concentrated in a few variables, or widely distributed?
Does it appear to be random?
Does the covariation of missing data with each other or with the observed data suggest a possible mechanism that’s producing the missing values.

Options for analyzing data with missing values

Complete case analysis (listwise deletiong) - Use the na.omit function to remove any rows with missing values.
Pairwise deletion
Multiple imputation
Simple imputation - replace values with a single value (e.g. mean, median, mode)

Imputing Missing Values

Multiple Imputation

First introduced by Rubin (1987) as way of handling missing data. Steps for multiple imputation:

Impute the missing values by using an appropriate model which incorporates random variation.
Repeat the first step multiple times (typicall 5 to 10 times is sufficient).
Perform the desired analysis on each imputed dataset.
Pool the results from step 3.

The most common approach is multiple imputation by chained equations (MICE). This has been shown to work very well when data is missing at random, but also in when data is missing not at random (Sulis, Porcu, & Mariano, 2017).

See volume 45 of the Journal of Statistical Software which is a special volume on multiple imputation: http://www.jstatsoft.org/v45/.

Steps for Multiple Imputation Kabacoff, 2011, p. 366

Imputing

Using the mice package to impute missing values.

imp <- mice(sleep, printFlag=FALSE, seed=1234)
imp

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##  BodyWgt BrainWgt     NonD    Dream    Sleep     Span     Gest     Pred 
##       ""       ""    "pmm"    "pmm"    "pmm"    "pmm"    "pmm"       "" 
##      Exp   Danger 
##       ""       "" 
## PredictorMatrix:
##          BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
## BodyWgt        0        1    1     1     1    1    1    1   1      1
## BrainWgt       1        0    1     1     1    1    1    1   1      1
## NonD           1        1    0     1     1    1    1    1   1      1
## Dream          1        1    1     0     1    1    1    1   1      1
## Sleep          1        1    1     1     0    1    1    1   1      1
## Span           1        1    1     1     1    0    1    1   1      1
## Number of logged events:  7 
##   it im  dep meth   out
## 1  1  1 Span  pmm Sleep
## 2  1  1 Gest  pmm Sleep
## 3  2  4 Span  pmm Sleep
## 4  2  4 Gest  pmm Sleep
## 5  2  5 Span  pmm Sleep
## 6  2  5 Gest  pmm Sleep

Checking Distributions of Imputated Values

If we assume the data is MCAR, then the imputations should have the same distribution as the observed data.

stripplot(imp, Dream~.imp, pch=20, cex=2)

Checking Distributions of Imputated Values

stripplot(imp)

Getting Complete Dataset

dataset5 <- complete(imp, 5)
head(dataset5)

##    BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
## 1 6654.000   5712.0  3.3   0.0   3.3 38.6  645    3   5      3
## 2    1.000      6.6  6.3   2.0   8.3  4.5   42    3   1      3
## 3    3.385     44.5 11.0   1.4  12.5 14.0   60    1   1      1
## 4    0.920      5.7 15.2   1.2  16.5  3.9   25    5   2      3
## 5 2547.000   4603.0  2.1   1.8   3.9 69.0  624    3   5      4
## 6   10.550    179.5  9.1   0.7   9.8 27.0  180    4   4      4

Pooling the imputed datasets

fit <- with(imp, lm(Dream ~ Span + Gest))
pooled <- pool(fit) 
summary(pooled)

##                 estimate   std.error  statistic       df      p.value
## (Intercept)  2.531560872 0.260063277  9.7344035 44.41076 1.379785e-12
## Span        -0.004478330 0.011716849 -0.3822128 55.79406 7.037555e-01
## Gest        -0.004022046 0.001481345 -2.7151303 53.29365 8.909314e-03

Not Missing at Random

How much missingness across groups matters?

data(lalonde, package='Matching')
Tr <- lalonde$treat
X <- lalonde[,c('age','educ','black','hisp','married','nodegr','re74','re75')]
lalonde.glm <- glm(treat ~ ., family=binomial, data=cbind(treat=Tr, X))
summary(lalonde.glm)

## 
## Call:
## glm(formula = treat ~ ., family = binomial, data = cbind(treat = Tr, 
##     X))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4358  -0.9904  -0.9071   1.2825   1.6946  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  1.178e+00  1.056e+00   1.115  0.26474   
## age          4.698e-03  1.433e-02   0.328  0.74297   
## educ        -7.124e-02  7.173e-02  -0.993  0.32061   
## black       -2.247e-01  3.655e-01  -0.615  0.53874   
## hisp        -8.528e-01  5.066e-01  -1.683  0.09228 . 
## married      1.636e-01  2.769e-01   0.591  0.55463   
## nodegr      -9.035e-01  3.135e-01  -2.882  0.00395 **
## re74        -3.161e-05  2.584e-05  -1.223  0.22122   
## re75         6.161e-05  4.358e-05   1.414  0.15744   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 604.20  on 444  degrees of freedom
## Residual deviance: 587.22  on 436  degrees of freedom
## AIC: 605.22
## 
## Number of Fisher Scoring iterations: 4

Data Preparation

Create a copy of the covariates to simulate missing at random (mar) and not missing at random (nmar).

lalonde.mar <- X
lalonde.nmar <- X
missing.rate <- .2                  # What percent of rows will have missing data
missing.cols <- c('nodegr', 're75') # The columns we will add missing values to
missing.ratio <- 1.5                # Ratio of missingness for treatment-to-control

Vectors indiciating which rows are treatment and control.

treat.rows <- which(lalonde$treat == 1)
control.rows <- which(lalonde$treat == 0)

Add missingness to the existing data. For the not missing at random data treatment units will have twice as many missing values as the control group.

for(i in missing.cols) {
    lalonde.mar[sample(nrow(lalonde), nrow(lalonde) * missing.rate), i] <- NA
    lalonde.nmar[sample(treat.rows, length(treat.rows) * missing.rate * missing.ratio), i] <- NA
    lalonde.nmar[sample(control.rows, length(control.rows) * missing.rate), i] <- NA
}

Proportion of missing values

For our data missing at random.

prop.table(table(is.na(lalonde.mar[,missing.cols[1]]), Tr, useNA='ifany'), 2)

##        Tr
##                 0         1
##   FALSE 0.7923077 0.8108108
##   TRUE  0.2076923 0.1891892

For our data not missing at random.

prop.table(table(is.na(lalonde.nmar[,missing.cols[1]]), Tr, useNA='ifany'), 2)

##        Tr
##                 0         1
##   FALSE 0.8000000 0.7027027
##   TRUE  0.2000000 0.2972973

Multiple Imputation

Create a shadow matrix. This is a logical vector where each cell is TRUE if the value is missing in the original data frame.

shadow.matrix.mar <- as.data.frame(is.na(lalonde.mar[,missing.cols,drop=FALSE]))
shadow.matrix.nmar <- as.data.frame(is.na(lalonde.nmar[,missing.cols,drop=FALSE]))

Change the column names to include "_miss" in their name.

names(shadow.matrix.mar) <- names(shadow.matrix.nmar) <- paste0(names(shadow.matrix.mar), '_miss')

Impute the missing values using the mice package

mice.mar <- mice(lalonde.mar, m=1, printFlag=FALSE, seed = 2112)
mice.nmar <- mice(lalonde.nmar, m=1, printFlag=FALSE, seed = 2112)

Get the imputed data set.

complete.mar <- complete(mice.mar)
complete.nmar <- complete(mice.nmar)

Check the regerssion results (MAR)

lalonde.mar.glm <- glm(treat~., data=cbind(treat=Tr, complete.mar, shadow.matrix.mar))
summary(lalonde.mar.glm)

## 
## Call:
## glm(formula = treat ~ ., data = cbind(treat = Tr, complete.mar, 
##     shadow.matrix.mar))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6429  -0.3945  -0.3367   0.5665   0.7940  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      7.046e-01  2.466e-01   2.857  0.00448 **
## age              1.248e-03  3.420e-03   0.365  0.71546   
## educ            -1.263e-02  1.704e-02  -0.741  0.45911   
## black           -3.303e-02  9.013e-02  -0.366  0.71418   
## hisp            -1.825e-01  1.170e-01  -1.561  0.11936   
## married          3.578e-02  6.776e-02   0.528  0.59775   
## nodegr          -1.883e-01  7.491e-02  -2.513  0.01232 * 
## re74            -6.253e-06  5.876e-06  -1.064  0.28788   
## re75             1.093e-05  1.001e-05   1.091  0.27577   
## nodegr_missTRUE -3.520e-02  5.867e-02  -0.600  0.54884   
## re75_missTRUE   -2.190e-02  5.844e-02  -0.375  0.70805   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2409686)
## 
##     Null deviance: 108.09  on 444  degrees of freedom
## Residual deviance: 104.58  on 434  degrees of freedom
## AIC: 642.44
## 
## Number of Fisher Scoring iterations: 2

Check the regerssion results (NMAR)

lalonde.nmar.glm <- glm(treat~., data=cbind(treat=Tr, complete.nmar, shadow.matrix.nmar))
summary(lalonde.nmar.glm)

## 
## Call:
## glm(formula = treat ~ ., data = cbind(treat = Tr, complete.nmar, 
##     shadow.matrix.nmar))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7777  -0.4193  -0.3031   0.5451   0.8019  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      6.367e-01  2.558e-01   2.489   0.0132 *
## age              1.124e-03  3.467e-03   0.324   0.7459  
## educ            -1.082e-02  1.706e-02  -0.634   0.5262  
## black           -7.999e-02  8.761e-02  -0.913   0.3618  
## hisp            -1.834e-01  1.157e-01  -1.585   0.1136  
## married          6.740e-02  6.619e-02   1.018   0.3091  
## nodegr          -1.565e-01  7.449e-02  -2.100   0.0363 *
## re74            -1.135e-06  5.872e-06  -0.193   0.8468  
## re75            -1.688e-06  1.016e-05  -0.166   0.8682  
## nodegr_missTRUE  1.146e-01  5.469e-02   2.095   0.0367 *
## re75_missTRUE    1.250e-01  5.430e-02   2.302   0.0218 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2366897)
## 
##     Null deviance: 108.09  on 444  degrees of freedom
## Residual deviance: 102.72  on 434  degrees of freedom
## AIC: 634.47
## 
## Number of Fisher Scoring iterations: 2

Discussion

Additional Resources

Wikipedia article
VIM Vignette
mice package website: https://stefvanbuuren.name/mice/
mitools package: http://cran.cnr.berkeley.edu/web/packages/mitools/index.html
Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.
Gelman and Hill’s chapter on missing data: http://www.stat.columbia.edu/~gelman/arm/missing.pdf
Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: what is it and how does it work?. International journal of methods in psychiatric research, 20(1), 40–49. doi:10.1002/mpr.329