Multiple Imputations with Missing Data

This example demonstrates how to estimate causal effects with multiple imputations by using cmest when the dataset contains missing values. For this purpose, we simulate some data containing a continuous baseline confounder \(C_1\), a binary baseline confounder \(C_2\), a binary exposure \(A\), a binary mediator \(M\) and a binary outcome \(Y\). The true regression models for \(A\), \(M\) and \(Y\) are: \[logit(E(A|C_1,C_2))=0.2+0.5C_1+0.1C_2\] \[logit(E(M|A,C_1,C_2))=1+2A+1.5C_1+0.8C_2\] \[logit(E(Y|A,M,C_1,C_2)))=-3-0.4A-1.2M+0.5AM+0.3C_1-0.6C_2\]

To create the dataset with missing values, we randomly delete 10% of values in the dataset.

set.seed(1)
expit <- function(x) exp(x)/(1+exp(x))
n <- 10000
C1 <- rnorm(n, mean = 1, sd = 0.1)
C2 <- rbinom(n, 1, 0.6)
A <- rbinom(n, 1, expit(0.2 + 0.5*C1 + 0.1*C2))
M <- rbinom(n, 1, expit(1 + 2*A + 1.5*C1 + 0.8*C2))
Y <- rbinom(n, 1, expit(-3 - 0.4*A - 1.2*M + 0.5*A*M + 0.3*C1 - 0.6*C2))
data_noNA <- data.frame(A, M, Y, C1, C2)
missing <- sample(1:(5*n), n*0.1, replace = FALSE)
C1[missing[which(missing <= n)]] <- NA
C2[missing[which((missing > n)*(missing <= 2*n) == 1)] - n] <- NA
A[missing[which((missing > 2*n)*(missing <= 3*n) == 1)] - 2*n] <- NA
M[missing[which((missing > 3*n)*(missing <= 4*n) == 1)] - 3*n] <- NA
Y[missing[which((missing > 4*n)*(missing <= 5*n) == 1)] - 4*n] <- NA
data <- data.frame(A, M, Y, C1, C2)

The DAG for this scientific setting is:

library(CMAverse)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: mstate

## Loading required package: survival

## Loading required package: tidyr

cmdag(outcome = "Y", exposure = "A", mediator = "M",
      basec = c("C1", "C2"), postc = NULL, node = TRUE, text_col = "white")

To conduct multiple imputations, we set the multimp argument to be TRUE. The regression-based approach is used for illustration. The multiple imputation results:

res_multimp <- cmest(data = data, model = "rb", outcome = "Y", exposure = "A",
                     mediator = "M", basec = c("C1", "C2"), EMint = TRUE,
                     mreg = list("logistic"), yreg = "logistic",
                     astar = 0, a = 1, mval = list(1), 
                     estimation = "paramfunc", inference = "delta", 
                     multimp = TRUE, m = 10)

summary(res_multimp)

## Causal Mediation Analysis
## 
## # Regressions with imputed dataset 1 
## 
## ## Outcome regression: 
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output[[1L]]$yreg)$data, weights = getCall(x$reg.output[[1L]]$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5694  -0.2416  -0.1744  -0.1601   3.0488  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.6067     0.7610  -4.740 2.14e-06 ***
## A             0.6880     0.5286   1.301  0.19311    
## M            -1.0444     0.3732  -2.798  0.00514 ** 
## C1            0.9828     0.6875   1.429  0.15288    
## C2           -0.8432     0.1443  -5.843 5.12e-09 ***
## A:M          -0.4189     0.5542  -0.756  0.44969    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2030.4  on 9999  degrees of freedom
## Residual deviance: 1972.2  on 9994  degrees of freedom
## AIC: 1984.2
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## ## Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output[[1L]]$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output[[1L]]$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2915   0.1100   0.1627   0.2428   0.5763  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.2134     0.6391   0.334  0.73848    
## A             1.7157     0.1461  11.747  < 2e-16 ***
## C1            2.2136     0.6476   3.418  0.00063 ***
## C2            0.9837     0.1369   7.185 6.72e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2264.5  on 9999  degrees of freedom
## Residual deviance: 2034.0  on 9996  degrees of freedom
## AIC: 2042
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Regressions with imputed dataset 2 
## 
## ## Outcome regression: 
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output[[2L]]$yreg)$data, weights = getCall(x$reg.output[[2L]]$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5675  -0.2408  -0.1734  -0.1610   3.0621  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4557     0.7724  -4.474 7.67e-06 ***
## A             0.7235     0.5289   1.368  0.17129    
## M            -1.1024     0.3747  -2.942  0.00326 ** 
## C1            0.8216     0.7014   1.171  0.24143    
## C2           -0.8368     0.1454  -5.753 8.75e-09 ***
## A:M          -0.3930     0.5556  -0.707  0.47935    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1999.6  on 9999  degrees of freedom
## Residual deviance: 1941.7  on 9994  degrees of freedom
## AIC: 1953.7
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## ## Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output[[2L]]$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output[[2L]]$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2872   0.1122   0.1581   0.2495   0.5865  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.06114    0.64423   0.095 0.924394    
## A            1.74914    0.14643  11.945  < 2e-16 ***
## C1           2.38172    0.65266   3.649 0.000263 ***
## C2           0.89858    0.13520   6.646 3.01e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2279.3  on 9999  degrees of freedom
## Residual deviance: 2048.3  on 9996  degrees of freedom
## AIC: 2056.3
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Regressions with imputed dataset 3 
## 
## ## Outcome regression: 
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output[[3L]]$yreg)$data, weights = getCall(x$reg.output[[3L]]$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5610  -0.2410  -0.1711  -0.1586   3.0539  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4754     0.7762  -4.477 7.55e-06 ***
## A             0.6578     0.5285   1.245  0.21327    
## M            -1.0783     0.3738  -2.885  0.00391 ** 
## C1            0.8717     0.7056   1.235  0.21666    
## C2           -0.8594     0.1457  -5.897 3.70e-09 ***
## A:M          -0.3875     0.5546  -0.699  0.48473    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1999.6  on 9999  degrees of freedom
## Residual deviance: 1941.0  on 9994  degrees of freedom
## AIC: 1953
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## ## Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output[[3L]]$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output[[3L]]$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2840   0.1135   0.1603   0.2463   0.5915  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.02469    0.64619  -0.038 0.969527    
## A            1.70728    0.14525  11.754  < 2e-16 ***
## C1           2.48160    0.65504   3.788 0.000152 ***
## C2           0.90288    0.13534   6.671 2.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2271.9  on 9999  degrees of freedom
## Residual deviance: 2047.8  on 9996  degrees of freedom
## AIC: 2055.8
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Regressions with imputed dataset 4 
## 
## ## Outcome regression: 
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output[[4L]]$yreg)$data, weights = getCall(x$reg.output[[4L]]$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6131  -0.2409  -0.1769  -0.1626   3.0550  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.5583     0.7664  -4.643 3.44e-06 ***
## A             0.8684     0.5120   1.696  0.08984 .  
## M            -1.0941     0.3742  -2.924  0.00346 ** 
## C1            0.9264     0.6936   1.336  0.18167    
## C2           -0.8117     0.1435  -5.655 1.56e-08 ***
## A:M          -0.5374     0.5389  -0.997  0.31868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2038.1  on 9999  degrees of freedom
## Residual deviance: 1977.6  on 9994  degrees of freedom
## AIC: 1989.6
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## ## Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output[[4L]]$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output[[4L]]$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2910   0.1110   0.1596   0.2461   0.5821  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.1220     0.6485   0.188 0.850782    
## A             1.7418     0.1467  11.877  < 2e-16 ***
## C1            2.3162     0.6558   3.532 0.000413 ***
## C2            0.9316     0.1361   6.847 7.55e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2264.5  on 9999  degrees of freedom
## Residual deviance: 2035.6  on 9996  degrees of freedom
## AIC: 2043.6
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Regressions with imputed dataset 5 
## 
## ## Outcome regression: 
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output[[5L]]$yreg)$data, weights = getCall(x$reg.output[[5L]]$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5628  -0.2464  -0.1719  -0.1620   3.0578  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.1670     0.7635  -4.148 3.35e-05 ***
## A             0.6129     0.5183   1.182 0.237047    
## M            -1.2113     0.3598  -3.367 0.000761 ***
## C1            0.6585     0.7005   0.940 0.347196    
## C2           -0.8612     0.1449  -5.942 2.81e-09 ***
## A:M          -0.2803     0.5454  -0.514 0.607333    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2015.0  on 9999  degrees of freedom
## Residual deviance: 1953.3  on 9994  degrees of freedom
## AIC: 1965.3
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## ## Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output[[5L]]$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output[[5L]]$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2762   0.1129   0.1605   0.2487   0.5715  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.2134     0.6455   0.331 0.740950    
## A             1.7214     0.1460  11.791  < 2e-16 ***
## C1            2.2401     0.6538   3.426 0.000612 ***
## C2            0.8976     0.1355   6.625 3.47e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2264.5  on 9999  degrees of freedom
## Residual deviance: 2041.5  on 9996  degrees of freedom
## AIC: 2049.5
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Effect decomposition on the risk ratio scale via the regression-based approach with multiple imputation
##  
## Closed-form parameter function estimation with 
##  delta method standard errors, confidence intervals and p-values 
##  
##                Estimate Std.error  95% CIL 95% CIU    P.val    
## Rcde            1.35895   0.23436  0.96919   1.905   0.0753 .  
## Rpnde           1.44731   0.25058  1.03084   2.032   0.0327 *  
## Rtnde           1.37654   0.23224  0.98896   1.916   0.0582 .  
## Rpnie           0.92859   0.03747  0.85799   1.005   0.0663 .  
## Rtnie           0.88318   0.05446  0.78264   0.997   0.0440 *  
## Rte             1.27825   0.20622  0.93173   1.754   0.1281    
## ERcde           0.32787   0.20880 -0.08137   0.737   0.1164    
## ERintref        0.12035   0.12110 -0.11700   0.358   0.3203    
## ERintmed       -0.09812   0.09897 -0.29210   0.096   0.3215    
## ERpnie         -0.07139   0.03744 -0.14478   0.002   0.0566 .  
## ERcde(prop)     1.17873   0.23528  0.71758   1.640 5.45e-07 ***
## ERintref(prop)  0.43179   0.47321 -0.49568   1.359   0.3615    
## ERintmed(prop) -0.35192   0.38637 -1.10919   0.405   0.3624    
## ERpnie(prop)   -0.25861   0.23334 -0.71595   0.199   0.2677    
## pm             -0.61053   0.48589 -1.56286   0.342   0.2089    
## int             0.07988   0.08780 -0.09220   0.252   0.3629    
## pe             -0.17873   0.23528 -0.63988   0.282   0.4475    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Rcde: controlled direct effect risk ratio; Rpnde: pure natural direct effect risk ratio; Rtnde: total natural direct effect risk ratio; Rpnie: pure natural indirect effect risk ratio; Rtnie: total natural indirect effect risk ratio; Rte: total effect risk ratio; ERcde: excess relative risk due to controlled direct effect; ERintref: excess relative risk due to reference interaction; ERintmed: excess relative risk due to mediated interaction; ERpnie: excess relative risk due to pure natural indirect effect; ERcde(prop): proportion ERcde; ERintref(prop): proportion ERintref; ERintmed(prop): proportion ERintmed; ERpnie(prop): proportion ERpnie; pm: overall proportion mediated; int: overall proportion attributable to interaction; pe: overall proportion eliminated)
## 
## Relevant variable values: 
## $a
## [1] 1
## 
## $astar
## [1] 0
## 
## $yval
## [1] "1"
## 
## $mval
## $mval[[1]]
## [1] 1
## 
## 
## $basecval
## $basecval[[1]]
## [1] 0.9993387
## 
## $basecval[[2]]
## [1] 0.6060235

Compare the multiple imputation results with true results:

res_noNA <- cmest(data = data_noNA, model = "rb", outcome = "Y", exposure = "A",
                  mediator = "M", basec = c("C1", "C2"), EMint = TRUE,
                  mreg = list("logistic"), yreg = "logistic",
                  astar = 0, a = 1, mval = list(1), 
                  estimation = "paramfunc", inference = "delta")

summary(res_noNA)

## Causal Mediation Analysis
## 
## # Outcome regression:
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output$yreg)$data, weights = getCall(x$reg.output$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5681  -0.2414  -0.1741  -0.1604   3.0521  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.5731     0.7713  -4.633 3.61e-06 ***
## A             0.7084     0.5283   1.341  0.17990    
## M            -1.0471     0.3736  -2.803  0.00507 ** 
## C1            0.9337     0.6973   1.339  0.18054    
## C2           -0.8415     0.1443  -5.829 5.56e-09 ***
## A:M          -0.4222     0.5541  -0.762  0.44607    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2022.7  on 9999  degrees of freedom
## Residual deviance: 1965.0  on 9994  degrees of freedom
## AIC: 1977
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2578   0.1145   0.1647   0.2519   0.5563  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4353     0.6410   0.679  0.49712    
## A             1.7076     0.1443  11.837  < 2e-16 ***
## C1            1.9982     0.6474   3.086  0.00203 ** 
## C2            0.9024     0.1345   6.709 1.96e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2294.0  on 9999  degrees of freedom
## Residual deviance: 2072.5  on 9996  degrees of freedom
## AIC: 2080.5
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Effect decomposition on the risk ratio scale via the regression-based approach
##  
## Closed-form parameter function estimation with 
##  delta method standard errors, confidence intervals and p-values 
##  
##                Estimate Std.error  95% CIL 95% CIU    P.val    
## Rcde            1.33138   0.22278  0.95912   1.848   0.0872 .  
## Rpnde           1.41988   0.23708  1.02358   1.970   0.0358 *  
## Rtnde           1.34928   0.22056  0.97940   1.859   0.0669 .  
## Rpnie           0.93338   0.03576  0.86587   1.006   0.0719 .  
## Rtnie           0.88697   0.05295  0.78902   0.997   0.0445 *  
## Rte             1.25939   0.19804  0.92535   1.714   0.1425    
## ERcde           0.30416   0.20019 -0.08820   0.697   0.1287    
## ERintref        0.11571   0.11472 -0.10914   0.341   0.3131    
## ERintmed       -0.09387   0.09323 -0.27659   0.089   0.3140    
## ERpnie         -0.06662   0.03576 -0.13670   0.003   0.0624 .  
## ERcde(prop)     1.17262   0.23540  0.71124   1.634 6.32e-07 ***
## ERintref(prop)  0.44610   0.49051 -0.51528   1.407   0.3631    
## ERintmed(prop) -0.36189   0.39881 -1.14354   0.420   0.3642    
## ERpnie(prop)   -0.25684   0.23474 -0.71691   0.203   0.2739    
## pm             -0.61872   0.50289 -1.60437   0.367   0.2186    
## int             0.08421   0.09264 -0.09736   0.266   0.3633    
## pe             -0.17262   0.23540 -0.63400   0.289   0.4634    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Rcde: controlled direct effect risk ratio; Rpnde: pure natural direct effect risk ratio; Rtnde: total natural direct effect risk ratio; Rpnie: pure natural indirect effect risk ratio; Rtnie: total natural indirect effect risk ratio; Rte: total effect risk ratio; ERcde: excess relative risk due to controlled direct effect; ERintref: excess relative risk due to reference interaction; ERintmed: excess relative risk due to mediated interaction; ERpnie: excess relative risk due to pure natural indirect effect; ERcde(prop): proportion ERcde; ERintref(prop): proportion ERintref; ERintmed(prop): proportion ERintmed; ERpnie(prop): proportion ERpnie; pm: overall proportion mediated; int: overall proportion attributable to interaction; pe: overall proportion eliminated)
## 
## Relevant variable values: 
## $a
## [1] 1
## 
## $astar
## [1] 0
## 
## $yval
## [1] "1"
## 
## $mval
## $mval[[1]]
## [1] 1
## 
## 
## $basecval
## $basecval[[1]]
## [1] 0.9993463
## 
## $basecval[[2]]
## [1] 0.6061

2022-03-30