This example demonstrates how to estimate causal effects with multiple imputations by using cmest when the dataset contains missing values. For this purpose, we simulate some data containing a continuous baseline confounder \(C_1\), a binary baseline confounder \(C_2\), a binary exposure \(A\), a binary mediator \(M\) and a binary outcome \(Y\). The true regression models for \(A\), \(M\) and \(Y\) are: \[logit(E(A|C_1,C_2))=0.2+0.5C_1+0.1C_2\] \[logit(E(M|A,C_1,C_2))=1+2A+1.5C_1+0.8C_2\] \[logit(E(Y|A,M,C_1,C_2)))=-3-0.4A-1.2M+0.5AM+0.3C_1-0.6C_2\]

To create the dataset with missing values, we randomly delete 10% of values in the dataset.

set.seed(1)
expit <- function(x) exp(x)/(1+exp(x))
n <- 10000
C1 <- rnorm(n, mean = 1, sd = 0.1)
C2 <- rbinom(n, 1, 0.6)
A <- rbinom(n, 1, expit(0.2 + 0.5*C1 + 0.1*C2))
M <- rbinom(n, 1, expit(1 + 2*A + 1.5*C1 + 0.8*C2))
Y <- rbinom(n, 1, expit(-3 - 0.4*A - 1.2*M + 0.5*A*M + 0.3*C1 - 0.6*C2))
data_noNA <- data.frame(A, M, Y, C1, C2)
missing <- sample(1:(5*n), n*0.1, replace = FALSE)
C1[missing[which(missing <= n)]] <- NA
C2[missing[which((missing > n)*(missing <= 2*n) == 1)] - n] <- NA
A[missing[which((missing > 2*n)*(missing <= 3*n) == 1)] - 2*n] <- NA
M[missing[which((missing > 3*n)*(missing <= 4*n) == 1)] - 3*n] <- NA
Y[missing[which((missing > 4*n)*(missing <= 5*n) == 1)] - 4*n] <- NA
data <- data.frame(A, M, Y, C1, C2)

The DAG for this scientific setting is:

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: mstate
## Loading required package: survival
## Loading required package: tidyr
cmdag(outcome = "Y", exposure = "A", mediator = "M",
      basec = c("C1", "C2"), postc = NULL, node = TRUE, text_col = "white")

To conduct multiple imputations, we set the multimp argument to be TRUE. The regression-based approach is used for illustration. The multiple imputation results:

res_multimp <- cmest(data = data, model = "rb", outcome = "Y", exposure = "A",
                     mediator = "M", basec = c("C1", "C2"), EMint = TRUE,
                     mreg = list("logistic"), yreg = "logistic",
                     astar = 0, a = 1, mval = list(1), 
                     estimation = "paramfunc", inference = "delta", 
                     multimp = TRUE, m = 10)
summary(res_multimp)
## Causal Mediation Analysis
## 
## # Regressions with imputed dataset 1 
## 
## ## Outcome regression: 
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output[[1L]]$yreg)$data, weights = getCall(x$reg.output[[1L]]$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5694  -0.2416  -0.1744  -0.1601   3.0488  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.6067     0.7610  -4.740 2.14e-06 ***
## A             0.6880     0.5286   1.301  0.19311    
## M            -1.0444     0.3732  -2.798  0.00514 ** 
## C1            0.9828     0.6875   1.429  0.15288    
## C2           -0.8432     0.1443  -5.843 5.12e-09 ***
## A:M          -0.4189     0.5542  -0.756  0.44969    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2030.4  on 9999  degrees of freedom
## Residual deviance: 1972.2  on 9994  degrees of freedom
## AIC: 1984.2
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## ## Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output[[1L]]$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output[[1L]]$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2915   0.1100   0.1627   0.2428   0.5763  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.2134     0.6391   0.334  0.73848    
## A             1.7157     0.1461  11.747  < 2e-16 ***
## C1            2.2136     0.6476   3.418  0.00063 ***
## C2            0.9837     0.1369   7.185 6.72e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2264.5  on 9999  degrees of freedom
## Residual deviance: 2034.0  on 9996  degrees of freedom
## AIC: 2042
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Regressions with imputed dataset 2 
## 
## ## Outcome regression: 
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output[[2L]]$yreg)$data, weights = getCall(x$reg.output[[2L]]$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5675  -0.2408  -0.1734  -0.1610   3.0621  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4557     0.7724  -4.474 7.67e-06 ***
## A             0.7235     0.5289   1.368  0.17129    
## M            -1.1024     0.3747  -2.942  0.00326 ** 
## C1            0.8216     0.7014   1.171  0.24143    
## C2           -0.8368     0.1454  -5.753 8.75e-09 ***
## A:M          -0.3930     0.5556  -0.707  0.47935    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1999.6  on 9999  degrees of freedom
## Residual deviance: 1941.7  on 9994  degrees of freedom
## AIC: 1953.7
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## ## Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output[[2L]]$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output[[2L]]$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2872   0.1122   0.1581   0.2495   0.5865  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.06114    0.64423   0.095 0.924394    
## A            1.74914    0.14643  11.945  < 2e-16 ***
## C1           2.38172    0.65266   3.649 0.000263 ***
## C2           0.89858    0.13520   6.646 3.01e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2279.3  on 9999  degrees of freedom
## Residual deviance: 2048.3  on 9996  degrees of freedom
## AIC: 2056.3
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Regressions with imputed dataset 3 
## 
## ## Outcome regression: 
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output[[3L]]$yreg)$data, weights = getCall(x$reg.output[[3L]]$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5610  -0.2410  -0.1711  -0.1586   3.0539  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4754     0.7762  -4.477 7.55e-06 ***
## A             0.6578     0.5285   1.245  0.21327    
## M            -1.0783     0.3738  -2.885  0.00391 ** 
## C1            0.8717     0.7056   1.235  0.21666    
## C2           -0.8594     0.1457  -5.897 3.70e-09 ***
## A:M          -0.3875     0.5546  -0.699  0.48473    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1999.6  on 9999  degrees of freedom
## Residual deviance: 1941.0  on 9994  degrees of freedom
## AIC: 1953
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## ## Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output[[3L]]$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output[[3L]]$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2840   0.1135   0.1603   0.2463   0.5915  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.02469    0.64619  -0.038 0.969527    
## A            1.70728    0.14525  11.754  < 2e-16 ***
## C1           2.48160    0.65504   3.788 0.000152 ***
## C2           0.90288    0.13534   6.671 2.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2271.9  on 9999  degrees of freedom
## Residual deviance: 2047.8  on 9996  degrees of freedom
## AIC: 2055.8
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Regressions with imputed dataset 4 
## 
## ## Outcome regression: 
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output[[4L]]$yreg)$data, weights = getCall(x$reg.output[[4L]]$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6131  -0.2409  -0.1769  -0.1626   3.0550  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.5583     0.7664  -4.643 3.44e-06 ***
## A             0.8684     0.5120   1.696  0.08984 .  
## M            -1.0941     0.3742  -2.924  0.00346 ** 
## C1            0.9264     0.6936   1.336  0.18167    
## C2           -0.8117     0.1435  -5.655 1.56e-08 ***
## A:M          -0.5374     0.5389  -0.997  0.31868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2038.1  on 9999  degrees of freedom
## Residual deviance: 1977.6  on 9994  degrees of freedom
## AIC: 1989.6
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## ## Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output[[4L]]$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output[[4L]]$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2910   0.1110   0.1596   0.2461   0.5821  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.1220     0.6485   0.188 0.850782    
## A             1.7418     0.1467  11.877  < 2e-16 ***
## C1            2.3162     0.6558   3.532 0.000413 ***
## C2            0.9316     0.1361   6.847 7.55e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2264.5  on 9999  degrees of freedom
## Residual deviance: 2035.6  on 9996  degrees of freedom
## AIC: 2043.6
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Regressions with imputed dataset 5 
## 
## ## Outcome regression: 
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output[[5L]]$yreg)$data, weights = getCall(x$reg.output[[5L]]$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5628  -0.2464  -0.1719  -0.1620   3.0578  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.1670     0.7635  -4.148 3.35e-05 ***
## A             0.6129     0.5183   1.182 0.237047    
## M            -1.2113     0.3598  -3.367 0.000761 ***
## C1            0.6585     0.7005   0.940 0.347196    
## C2           -0.8612     0.1449  -5.942 2.81e-09 ***
## A:M          -0.2803     0.5454  -0.514 0.607333    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2015.0  on 9999  degrees of freedom
## Residual deviance: 1953.3  on 9994  degrees of freedom
## AIC: 1965.3
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## ## Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output[[5L]]$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output[[5L]]$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2762   0.1129   0.1605   0.2487   0.5715  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.2134     0.6455   0.331 0.740950    
## A             1.7214     0.1460  11.791  < 2e-16 ***
## C1            2.2401     0.6538   3.426 0.000612 ***
## C2            0.8976     0.1355   6.625 3.47e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2264.5  on 9999  degrees of freedom
## Residual deviance: 2041.5  on 9996  degrees of freedom
## AIC: 2049.5
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Effect decomposition on the risk ratio scale via the regression-based approach with multiple imputation
##  
## Closed-form parameter function estimation with 
##  delta method standard errors, confidence intervals and p-values 
##  
##                Estimate Std.error  95% CIL 95% CIU    P.val    
## Rcde            1.35895   0.23436  0.96919   1.905   0.0753 .  
## Rpnde           1.44731   0.25058  1.03084   2.032   0.0327 *  
## Rtnde           1.37654   0.23224  0.98896   1.916   0.0582 .  
## Rpnie           0.92859   0.03747  0.85799   1.005   0.0663 .  
## Rtnie           0.88318   0.05446  0.78264   0.997   0.0440 *  
## Rte             1.27825   0.20622  0.93173   1.754   0.1281    
## ERcde           0.32787   0.20880 -0.08137   0.737   0.1164    
## ERintref        0.12035   0.12110 -0.11700   0.358   0.3203    
## ERintmed       -0.09812   0.09897 -0.29210   0.096   0.3215    
## ERpnie         -0.07139   0.03744 -0.14478   0.002   0.0566 .  
## ERcde(prop)     1.17873   0.23528  0.71758   1.640 5.45e-07 ***
## ERintref(prop)  0.43179   0.47321 -0.49568   1.359   0.3615    
## ERintmed(prop) -0.35192   0.38637 -1.10919   0.405   0.3624    
## ERpnie(prop)   -0.25861   0.23334 -0.71595   0.199   0.2677    
## pm             -0.61053   0.48589 -1.56286   0.342   0.2089    
## int             0.07988   0.08780 -0.09220   0.252   0.3629    
## pe             -0.17873   0.23528 -0.63988   0.282   0.4475    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Rcde: controlled direct effect risk ratio; Rpnde: pure natural direct effect risk ratio; Rtnde: total natural direct effect risk ratio; Rpnie: pure natural indirect effect risk ratio; Rtnie: total natural indirect effect risk ratio; Rte: total effect risk ratio; ERcde: excess relative risk due to controlled direct effect; ERintref: excess relative risk due to reference interaction; ERintmed: excess relative risk due to mediated interaction; ERpnie: excess relative risk due to pure natural indirect effect; ERcde(prop): proportion ERcde; ERintref(prop): proportion ERintref; ERintmed(prop): proportion ERintmed; ERpnie(prop): proportion ERpnie; pm: overall proportion mediated; int: overall proportion attributable to interaction; pe: overall proportion eliminated)
## 
## Relevant variable values: 
## $a
## [1] 1
## 
## $astar
## [1] 0
## 
## $yval
## [1] "1"
## 
## $mval
## $mval[[1]]
## [1] 1
## 
## 
## $basecval
## $basecval[[1]]
## [1] 0.9993387
## 
## $basecval[[2]]
## [1] 0.6060235

Compare the multiple imputation results with true results:

res_noNA <- cmest(data = data_noNA, model = "rb", outcome = "Y", exposure = "A",
                  mediator = "M", basec = c("C1", "C2"), EMint = TRUE,
                  mreg = list("logistic"), yreg = "logistic",
                  astar = 0, a = 1, mval = list(1), 
                  estimation = "paramfunc", inference = "delta")
summary(res_noNA)
## Causal Mediation Analysis
## 
## # Outcome regression:
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output$yreg)$data, weights = getCall(x$reg.output$yreg)$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5681  -0.2414  -0.1741  -0.1604   3.0521  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.5731     0.7713  -4.633 3.61e-06 ***
## A             0.7084     0.5283   1.341  0.17990    
## M            -1.0471     0.3736  -2.803  0.00507 ** 
## C1            0.9337     0.6973   1.339  0.18054    
## C2           -0.8415     0.1443  -5.829 5.56e-09 ***
## A:M          -0.4222     0.5541  -0.762  0.44607    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2022.7  on 9999  degrees of freedom
## Residual deviance: 1965.0  on 9994  degrees of freedom
## AIC: 1977
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2578   0.1145   0.1647   0.2519   0.5563  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4353     0.6410   0.679  0.49712    
## A             1.7076     0.1443  11.837  < 2e-16 ***
## C1            1.9982     0.6474   3.086  0.00203 ** 
## C2            0.9024     0.1345   6.709 1.96e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2294.0  on 9999  degrees of freedom
## Residual deviance: 2072.5  on 9996  degrees of freedom
## AIC: 2080.5
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Effect decomposition on the risk ratio scale via the regression-based approach
##  
## Closed-form parameter function estimation with 
##  delta method standard errors, confidence intervals and p-values 
##  
##                Estimate Std.error  95% CIL 95% CIU    P.val    
## Rcde            1.33138   0.22278  0.95912   1.848   0.0872 .  
## Rpnde           1.41988   0.23708  1.02358   1.970   0.0358 *  
## Rtnde           1.34928   0.22056  0.97940   1.859   0.0669 .  
## Rpnie           0.93338   0.03576  0.86587   1.006   0.0719 .  
## Rtnie           0.88697   0.05295  0.78902   0.997   0.0445 *  
## Rte             1.25939   0.19804  0.92535   1.714   0.1425    
## ERcde           0.30416   0.20019 -0.08820   0.697   0.1287    
## ERintref        0.11571   0.11472 -0.10914   0.341   0.3131    
## ERintmed       -0.09387   0.09323 -0.27659   0.089   0.3140    
## ERpnie         -0.06662   0.03576 -0.13670   0.003   0.0624 .  
## ERcde(prop)     1.17262   0.23540  0.71124   1.634 6.32e-07 ***
## ERintref(prop)  0.44610   0.49051 -0.51528   1.407   0.3631    
## ERintmed(prop) -0.36189   0.39881 -1.14354   0.420   0.3642    
## ERpnie(prop)   -0.25684   0.23474 -0.71691   0.203   0.2739    
## pm             -0.61872   0.50289 -1.60437   0.367   0.2186    
## int             0.08421   0.09264 -0.09736   0.266   0.3633    
## pe             -0.17262   0.23540 -0.63400   0.289   0.4634    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Rcde: controlled direct effect risk ratio; Rpnde: pure natural direct effect risk ratio; Rtnde: total natural direct effect risk ratio; Rpnie: pure natural indirect effect risk ratio; Rtnie: total natural indirect effect risk ratio; Rte: total effect risk ratio; ERcde: excess relative risk due to controlled direct effect; ERintref: excess relative risk due to reference interaction; ERintmed: excess relative risk due to mediated interaction; ERpnie: excess relative risk due to pure natural indirect effect; ERcde(prop): proportion ERcde; ERintref(prop): proportion ERintref; ERintmed(prop): proportion ERintmed; ERpnie(prop): proportion ERpnie; pm: overall proportion mediated; int: overall proportion attributable to interaction; pe: overall proportion eliminated)
## 
## Relevant variable values: 
## $a
## [1] 1
## 
## $astar
## [1] 0
## 
## $yval
## [1] "1"
## 
## $mval
## $mval[[1]]
## [1] 1
## 
## 
## $basecval
## $basecval[[1]]
## [1] 0.9993463
## 
## $basecval[[2]]
## [1] 0.6061