2019-10-17
MCAR - Missing completely at random if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. When data are MCAR, the analysis performed on the data is unbiased; however, data are rarely MCAR.
MAR - Missing at random occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information.
NMAR - Not missing at random (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR (i.e. the value of the variable that’s missing is related to the reason it’s missing).
Allison, T. & Chichetti, D. (1976). Sleep in mammals: ecological and constitutional correlates. Science 194 (4266), 732–734.
Abstract The interrelationships between sleep, ecological, and constitutional variables were assessed statistically for 39 mammalian species. Slow-wave sleep is negatively associated with a factor related to body size, which suggests that large amounts of this sleep phase are disadvantageous in large species. Paradoxical sleep is associated with a factor related to predatory danger, which suggests that large amounts of this sleep phase are disadvantageous in prey species.
BodyWgt
- Body weight in kilogramsBrainWgt
- Brain weight in gramsNonD
- Nondreaming sleep (hours / day)Dream
- Dreaming sleep (hours / day)Sleep
- Total sleepSpan
- Life-span in yearsGest
- Gestation time in daysPred
- Predation index: 1 = minimum (least likely to be preyed upon) to 5 = maximum (most likely to be preyed upon)Exp
- Sleep exposure index: 1 = least exposed (e.g. animal sleeps in a well-protected den) to 5 = most exposedDanger
- Overall danger index (based on the above two indices and other information): 1 = least danger (from other animals) to 5 = most danger (from other animals)data(sleep, package="VIM") str(sleep)
## 'data.frame': 62 obs. of 10 variables: ## $ BodyWgt : num 6654 1 3.38 0.92 2547 ... ## $ BrainWgt: num 5712 6.6 44.5 5.7 4603 ... ## $ NonD : num NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ... ## $ Dream : num NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ... ## $ Sleep : num 3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ... ## $ Span : num 38.6 4.5 14 NA 69 27 19 30.4 28 50 ... ## $ Gest : num 645 42 60 25 624 180 35 392 63 230 ... ## $ Pred : int 3 3 1 5 3 4 1 4 1 1 ... ## $ Exp : int 5 1 1 2 5 4 1 5 2 1 ... ## $ Danger : int 3 3 1 3 4 4 1 4 1 1 ...
complete.cases(sleep)
## [1] FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [12] TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE ## [23] TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE ## [34] TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE ## [45] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE ## [56] FALSE TRUE TRUE TRUE TRUE TRUE FALSE
head(sleep[complete.cases(sleep),])
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger ## 2 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3 ## 5 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4 ## 6 10.550 179.5 9.1 0.7 9.8 27.0 180 4 4 4 ## 7 0.023 0.3 15.8 3.9 19.7 19.0 35 1 1 1 ## 8 160.000 169.0 5.2 1.0 6.2 30.4 392 4 5 4 ## 9 3.300 25.6 10.9 3.6 14.5 28.0 63 1 2 1
head(sleep[!complete.cases(sleep),])
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger ## 1 6654.000 5712.0 NA NA 3.3 38.6 645 3 5 3 ## 3 3.385 44.5 NA NA 12.5 14.0 60 1 1 1 ## 4 0.920 5.7 NA NA 16.5 NA 25 5 2 3 ## 13 0.550 2.4 7.6 2.7 10.3 NA NA 2 1 2 ## 14 187.100 419.0 NA NA 3.1 40.0 365 5 5 5 ## 19 1.410 17.5 4.8 1.3 6.1 34.0 NA 1 2 1
Number of missing values
sum(is.na(sleep$Dream))
## [1] 12
Percent missing
mean(is.na(sleep$Dream))
## [1] 0.1935484
Percent of rows with missing data
mean(!complete.cases(sleep))
## [1] 0.3225806
md.pattern(sleep, rotate.names = TRUE)
## BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD ## 42 1 1 1 1 1 1 1 1 1 1 0 ## 9 1 1 1 1 1 1 1 1 0 0 2 ## 3 1 1 1 1 1 1 1 0 1 1 1 ## 2 1 1 1 1 1 1 0 1 1 1 1 ## 1 1 1 1 1 1 1 0 1 0 0 3 ## 1 1 1 1 1 1 1 0 0 1 1 2 ## 2 1 1 1 1 1 0 1 1 1 0 2 ## 2 1 1 1 1 1 0 1 1 0 0 3 ## 0 0 0 0 0 4 4 4 12 14 38
aggr(sleep, prop=FALSE, numbers=TRUE)
matrixplot(sleep)
## ## Click in a column to sort by the corresponding variable. ## To regain use of the VIM GUI and the R console, click outside the plot region.
marginplot(sleep[,c('Gest','Dream')], pch=c(20), col=c('darkgray','red','blue'))
sm <- as.data.frame(abs(is.na(sleep))) head(sleep)
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger ## 1 6654.000 5712.0 NA NA 3.3 38.6 645 3 5 3 ## 2 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3 ## 3 3.385 44.5 NA NA 12.5 14.0 60 1 1 1 ## 4 0.920 5.7 NA NA 16.5 NA 25 5 2 3 ## 5 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4 ## 6 10.550 179.5 9.1 0.7 9.8 27.0 180 4 4 4
head(sm)
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger ## 1 0 0 1 1 0 0 0 0 0 0 ## 2 0 0 0 0 0 0 0 0 0 0 ## 3 0 0 1 1 0 0 0 0 0 0 ## 4 0 0 1 1 0 1 0 0 0 0 ## 5 0 0 0 0 0 0 0 0 0 0 ## 6 0 0 0 0 0 0 0 0 0 0
Examine the correlation of missingness between variables
# Extract varabibles that have any missingness y <- sm[which(sapply(sm, sd) > 0)] cor(y)
## NonD Dream Sleep Span Gest ## NonD 1.00000000 0.90711474 0.48626454 0.01519577 -0.14182716 ## Dream 0.90711474 1.00000000 0.20370138 0.03752394 -0.12865350 ## Sleep 0.48626454 0.20370138 1.00000000 -0.06896552 -0.06896552 ## Span 0.01519577 0.03752394 -0.06896552 1.00000000 0.19827586 ## Gest -0.14182716 -0.12865350 -0.06896552 0.19827586 1.00000000
cor(sleep, y, use='pairwise.complete.obs')
## NonD Dream Sleep Span Gest ## BodyWgt 0.22682614 0.22259108 0.001684992 -0.05831706 -0.05396818 ## BrainWgt 0.17945923 0.16321105 0.007859438 -0.07921370 -0.07332961 ## NonD NA NA NA -0.04314514 -0.04553485 ## Dream -0.18895206 NA -0.188952059 0.11699247 0.22774685 ## Sleep -0.08023157 -0.08023157 NA 0.09638044 0.03976464 ## Span 0.08336361 0.05981377 0.005238852 NA -0.06527277 ## Gest 0.20239201 0.05140232 0.159701523 -0.17495305 NA ## Pred 0.04758438 -0.06834378 0.202462711 0.02313860 -0.20101655 ## Exp 0.24546836 0.12740768 0.260772984 -0.19291879 -0.19291879 ## Danger 0.06528387 -0.06724755 0.208883617 -0.06666498 -0.20443928
Rows are observed variables, columns missing indicators. Nondreaming (NonD
) sleep scores are more likely to be missing with larger body weights (BodyWgt
) with r=0.227. Since the correlations are not very large suggests that the nature of the missingness deviates minimally from the MCAR and MAR assumptions.
Kabacoff (2011, p. 362) suggests the following questions to address:
na.omit
function to remove any rows with missing values.First introduced by Rubin (1987) as way of handling missing data. Steps for multiple imputation:
Impute the missing values by using an appropriate model which incorporates random variation.
Repeat the first step multiple times (typicall 5 to 10 times is sufficient).
Perform the desired analysis on each imputed dataset.
Pool the results from step 3.
The most common approach is multiple imputation by chained equations (MICE). This has been shown to work very well when data is missing at random, but also in when data is missing not at random (Sulis, Porcu, & Mariano, 2017).
See volume 45 of the Journal of Statistical Software which is a special volume on multiple imputation: http://www.jstatsoft.org/v45/.
Using the mice
package to impute missing values.
imp <- mice(sleep, printFlag=FALSE, seed=1234) imp
## Class: mids ## Number of multiple imputations: 5 ## Imputation methods: ## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred ## "" "" "pmm" "pmm" "pmm" "pmm" "pmm" "" ## Exp Danger ## "" "" ## PredictorMatrix: ## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger ## BodyWgt 0 1 1 1 1 1 1 1 1 1 ## BrainWgt 1 0 1 1 1 1 1 1 1 1 ## NonD 1 1 0 1 1 1 1 1 1 1 ## Dream 1 1 1 0 1 1 1 1 1 1 ## Sleep 1 1 1 1 0 1 1 1 1 1 ## Span 1 1 1 1 1 0 1 1 1 1 ## Number of logged events: 7 ## it im dep meth out ## 1 1 1 Span pmm Sleep ## 2 1 1 Gest pmm Sleep ## 3 2 4 Span pmm Sleep ## 4 2 4 Gest pmm Sleep ## 5 2 5 Span pmm Sleep ## 6 2 5 Gest pmm Sleep
If we assume the data is MCAR, then the imputations should have the same distribution as the observed data.
stripplot(imp, Dream~.imp, pch=20, cex=2)
stripplot(imp)
dataset5 <- complete(imp, 5) head(dataset5)
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger ## 1 6654.000 5712.0 3.3 0.0 3.3 38.6 645 3 5 3 ## 2 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3 ## 3 3.385 44.5 11.0 1.4 12.5 14.0 60 1 1 1 ## 4 0.920 5.7 15.2 1.2 16.5 3.9 25 5 2 3 ## 5 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4 ## 6 10.550 179.5 9.1 0.7 9.8 27.0 180 4 4 4
fit <- with(imp, lm(Dream ~ Span + Gest)) pooled <- pool(fit) summary(pooled)
## estimate std.error statistic df p.value ## (Intercept) 2.531560872 0.260063277 9.7344035 44.41076 1.379785e-12 ## Span -0.004478330 0.011716849 -0.3822128 55.79406 7.037555e-01 ## Gest -0.004022046 0.001481345 -2.7151303 53.29365 8.909314e-03
data(lalonde, package='Matching') Tr <- lalonde$treat X <- lalonde[,c('age','educ','black','hisp','married','nodegr','re74','re75')] lalonde.glm <- glm(treat ~ ., family=binomial, data=cbind(treat=Tr, X)) summary(lalonde.glm)
## ## Call: ## glm(formula = treat ~ ., family = binomial, data = cbind(treat = Tr, ## X)) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.4358 -0.9904 -0.9071 1.2825 1.6946 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.178e+00 1.056e+00 1.115 0.26474 ## age 4.698e-03 1.433e-02 0.328 0.74297 ## educ -7.124e-02 7.173e-02 -0.993 0.32061 ## black -2.247e-01 3.655e-01 -0.615 0.53874 ## hisp -8.528e-01 5.066e-01 -1.683 0.09228 . ## married 1.636e-01 2.769e-01 0.591 0.55463 ## nodegr -9.035e-01 3.135e-01 -2.882 0.00395 ** ## re74 -3.161e-05 2.584e-05 -1.223 0.22122 ## re75 6.161e-05 4.358e-05 1.414 0.15744 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 604.20 on 444 degrees of freedom ## Residual deviance: 587.22 on 436 degrees of freedom ## AIC: 605.22 ## ## Number of Fisher Scoring iterations: 4
Create a copy of the covariates to simulate missing at random (mar) and not missing at random (nmar).
lalonde.mar <- X lalonde.nmar <- X missing.rate <- .2 # What percent of rows will have missing data missing.cols <- c('nodegr', 're75') # The columns we will add missing values to missing.ratio <- 1.5 # Ratio of missingness for treatment-to-control
Vectors indiciating which rows are treatment and control.
treat.rows <- which(lalonde$treat == 1) control.rows <- which(lalonde$treat == 0)
Add missingness to the existing data. For the not missing at random data treatment units will have twice as many missing values as the control group.
for(i in missing.cols) { lalonde.mar[sample(nrow(lalonde), nrow(lalonde) * missing.rate), i] <- NA lalonde.nmar[sample(treat.rows, length(treat.rows) * missing.rate * missing.ratio), i] <- NA lalonde.nmar[sample(control.rows, length(control.rows) * missing.rate), i] <- NA }
For our data missing at random.
prop.table(table(is.na(lalonde.mar[,missing.cols[1]]), Tr, useNA='ifany'), 2)
## Tr ## 0 1 ## FALSE 0.7923077 0.8108108 ## TRUE 0.2076923 0.1891892
For our data not missing at random.
prop.table(table(is.na(lalonde.nmar[,missing.cols[1]]), Tr, useNA='ifany'), 2)
## Tr ## 0 1 ## FALSE 0.8000000 0.7027027 ## TRUE 0.2000000 0.2972973
Create a shadow matrix. This is a logical vector where each cell is TRUE if the value is missing in the original data frame.
shadow.matrix.mar <- as.data.frame(is.na(lalonde.mar[,missing.cols,drop=FALSE])) shadow.matrix.nmar <- as.data.frame(is.na(lalonde.nmar[,missing.cols,drop=FALSE]))
Change the column names to include "_miss" in their name.
names(shadow.matrix.mar) <- names(shadow.matrix.nmar) <- paste0(names(shadow.matrix.mar), '_miss')
Impute the missing values using the mice package
mice.mar <- mice(lalonde.mar, m=1, printFlag=FALSE, seed = 2112) mice.nmar <- mice(lalonde.nmar, m=1, printFlag=FALSE, seed = 2112)
Get the imputed data set.
complete.mar <- complete(mice.mar) complete.nmar <- complete(mice.nmar)
lalonde.mar.glm <- glm(treat~., data=cbind(treat=Tr, complete.mar, shadow.matrix.mar)) summary(lalonde.mar.glm)
## ## Call: ## glm(formula = treat ~ ., data = cbind(treat = Tr, complete.mar, ## shadow.matrix.mar)) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.6429 -0.3945 -0.3367 0.5665 0.7940 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 7.046e-01 2.466e-01 2.857 0.00448 ** ## age 1.248e-03 3.420e-03 0.365 0.71546 ## educ -1.263e-02 1.704e-02 -0.741 0.45911 ## black -3.303e-02 9.013e-02 -0.366 0.71418 ## hisp -1.825e-01 1.170e-01 -1.561 0.11936 ## married 3.578e-02 6.776e-02 0.528 0.59775 ## nodegr -1.883e-01 7.491e-02 -2.513 0.01232 * ## re74 -6.253e-06 5.876e-06 -1.064 0.28788 ## re75 1.093e-05 1.001e-05 1.091 0.27577 ## nodegr_missTRUE -3.520e-02 5.867e-02 -0.600 0.54884 ## re75_missTRUE -2.190e-02 5.844e-02 -0.375 0.70805 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.2409686) ## ## Null deviance: 108.09 on 444 degrees of freedom ## Residual deviance: 104.58 on 434 degrees of freedom ## AIC: 642.44 ## ## Number of Fisher Scoring iterations: 2
lalonde.nmar.glm <- glm(treat~., data=cbind(treat=Tr, complete.nmar, shadow.matrix.nmar)) summary(lalonde.nmar.glm)
## ## Call: ## glm(formula = treat ~ ., data = cbind(treat = Tr, complete.nmar, ## shadow.matrix.nmar)) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.7777 -0.4193 -0.3031 0.5451 0.8019 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.367e-01 2.558e-01 2.489 0.0132 * ## age 1.124e-03 3.467e-03 0.324 0.7459 ## educ -1.082e-02 1.706e-02 -0.634 0.5262 ## black -7.999e-02 8.761e-02 -0.913 0.3618 ## hisp -1.834e-01 1.157e-01 -1.585 0.1136 ## married 6.740e-02 6.619e-02 1.018 0.3091 ## nodegr -1.565e-01 7.449e-02 -2.100 0.0363 * ## re74 -1.135e-06 5.872e-06 -0.193 0.8468 ## re75 -1.688e-06 1.016e-05 -0.166 0.8682 ## nodegr_missTRUE 1.146e-01 5.469e-02 2.095 0.0367 * ## re75_missTRUE 1.250e-01 5.430e-02 2.302 0.0218 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.2366897) ## ## Null deviance: 108.09 on 444 degrees of freedom ## Residual deviance: 102.72 on 434 degrees of freedom ## AIC: 634.47 ## ## Number of Fisher Scoring iterations: 2
mice
package website: https://stefvanbuuren.name/mice/mitools
package: http://cran.cnr.berkeley.edu/web/packages/mitools/index.html