Using a DAG to simulate data with the dagR library

Directed acyclic graphs (DAGs), and causal graphs in general, provide a framework for making assumptions explicit and identifying confounders or mediators of the relationship between the exposure of interest and outcome that need to be adjusted for in analysis. Recently, I ran into the need to generate data from a DAG for a paper I am writing with my peers Kevin McIntyre and Joshua Wiener. After a quick Google search, I was pleasantly surprised to see there were several options to do so. In particular, the dagR library provides “functions to draw, manipulate, [and] evaluate directed acyclic graphs and simulate corresponding data”.

Besides dagR‘s reference manual, a short letter published in Epidemiology, and a limited collection of examples, I couldn’t find too many resources regarding how to use the functionality provided by dagR. The goal of this blog post is to provide an expository example of how to create a DAG and generate data from it using the dagR library.

To simulate data from a DAG with dagR, we need to:

  1. Create the DAG of interest using the dag.init function by specifying its nodes (exposure, outcome, and covariates) and their directed arcs (directed arrows to/from nodes).
  2. Pass the DAG from (1) to the dag.sim function and specify the number of observations to be generated, arc coefficients, node types (binary or continuous), and parameters of the node distributions (Normal or Bernoulli).

For this tutorial, we are going to try to replicate the simple confounding/common cause DAG presented in Figure 1b as well as the more complex DAG in Figure 2a of Shier and Platt’s (2008) paper, Reducing bias through directed acyclic graphs.

library(dagR)
set.seed(12345)

Simple confounding / common cause DAG


Figure 1b of Shier and Platt (2008).

Specify

In dagR, the node corresponding to the primary exposure X is assigned a ID of 0. The node corresponding to the outcome O is assigned an ID of -1. Any covariates are assigned an integer ID between 1 and the total number of covariates length(covs). Covariates can be specified as either known or measured (covs = 1), or unknown or unmeasured (covs = 2).

Arcs are defined by provided the ID of the first node followed by the ID of the terminal node. Arcs are directed by default (assocs = 0) but can be specified as undirected (assocs = 1).

confound_dag <- dag.init(
                         # REQUIRED: SPECIFY COVARIATES AND THEIR TYPE
                         covs = c(1),      # specify that single covariate (common cause) is "known"

                         # OPTIONAL: PROVIDE CUSTOM LABELS FOR DAG NODES AND SIM DATA
                         symbols = c('X',  # symbol for primary exposure within DAG
                                     'C',  # "      "   common cause     "      "
                                     'O'), # "      "   outcome          "      "

                         # REQUIRED: SPECIFY ARCS BETWEEN X, C, AND O
                         arcs = c(0, -1,   # arc from primary exposure to outcome
                                  1,  0,   # arc from common cause to exposure
                                  1, -1),  # arc from common cause to outcome

                         # OPTIONAL: PROVIDE NAMES FOR X, C, AND O (USED IN LEGEND)
                         x.name = 'Primary exposure',   # name of primary exposure
                         y.name = 'Outcome',            # name of outcome
                         cov.names = c('Common cause'), # (vector of) names of covariates

                         # REQUIRED: TURN OFF WEIRD DEFAULTS
                         noxy   = T        # tell dagR we want to specify our *own* arc between X and O
                         )

That’s it! We’ve specified the simple confounding DAG. To make sure we have done so correctly, we can take a look at the details of our DAG object confound as well as its summary summary(confound).

confound_dag
## $cov.types
## [1]  0  1 -1
## 
## $x
## [1] 0.0 0.5 1.0
## 
## $y
## [1] 0.0 0.7 0.0
## 
## $arc
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    1
## [3,]    2    3
## 
## $arc.type
## [1] 0 0 0
## 
## $curve.x
## [1] NA NA NA
## 
## $curve.y
## [1] NA NA NA
## 
## $xgap
## [1] 0.04
## 
## $ygap
## [1] 0.05
## 
## $len
## [1] 0.1
## 
## $names
## [1] "Primary exposure" "Common cause"     "Outcome"         
## 
## $symbols
## [1] "X" "C" "O"
## 
## $version
## [1] "1.2.0"
## 
## Directed acyclic graph (DAG) object created by dagR version 1.2.0.
## For a convenient interpreted presentation, use summary().
  • $cov.types summarizes tell variables in our model in their type, confirming that we have 1 primary exposure with ID 0, 1 covariate with ID 1, and 1 outcome with ID -1.
  • $x and $y provide the (x, y) co-ordinates of nodes for plotting.
  • $arc confirms the arcs that we’ve drawn. Here, nodes are numbered such that node 1 is the primary exposure, node 3 is the outcome, and node 2 is the common cause.
  • $arc.type confirms that all of our arcs are directed (0), as desired.
  • $xgap, $ygap, and $len are graphical parameters corresponding to the arcs.
  • Finally, our $names and $symbols appear to correctly correspond with our $cov.types.

Perfect!

summary(confound_dag)
## DAG created by dagR version 1.2.0.
##  - from X = Primary exposure to Y = Outcome
##  - acyclic: TRUE
##  - 3 nodes: X, Y, 1 cov(s) (0 unknown/unmeasured)
##    - node 2 = Common cause
##  - alternative node symbols:  X C O
##  - 3 arc(s) (0 undirected)
##  - currently unadjusted
##  - potentially biasing paths not sought
##  - possible adjustment sets not examined

Our summary appears to confirm our interpretation of our DAG object.

Plot

The dag.draw function can be used to plot a DAG object.

confound_fig <- dag.draw(
                        confound_dag,      # DAG object
                        numbering = F,     # Number arcs: T/F
                        legend = T,        # Legend: T/F
                        noxy = 2           # do not draw ? above X -> O arc
                       )

plot of chunk unnamed-chunk-5

Simulate

The algorithm used to simulate data is described in the dagR documentation as follows:

*Simulation steps: *

  1. Simulate data for nodes i without ancestors, drawing from Normal distribution with mean mu[i] and stdev[i] (continuous node), or drawing from Bernoulli events with probability mu[i] (binary node).

  2. Simulate data for nodes i for which all ancestors already have been simulated by multiplying the ancestor values with the corresponding arc coefficients and summing them up, shifting the resulting values to the mean mu[i] specified for the currently simulated node (logit-transformed if binary), then adding noise drawn from a Normal distribution with mean 0 and standard deviation stdev[i], finally using the inverse logit of the resulting values as success probabilities for simulating binary data if node is binary.

Suppose we want to generate n = 100 observations based on our DAG. In our generated sample, we want the primary exposure X to be a binary variable with an observed prevalence of p = 0.5, the common cause C to be a continuous variable distributed according to N(\mu = 120, \sigma = 20), and the outcome to be a continuous variable centered at O = 10.

The common cause node C is our only node without ancestors. According to the algorithm above, data for C will be generated first by drawing n = 100 observations from a N(120, 20) distribution.

The primary exposure X only has one ancestor while the outcome O has two. Thus, X is the next candidate for simulation. X is binary and C is continuous so data is first generated on the logit scale such that,

    \[\text{logit}~p = \log \left(\frac{p}{1-p}\right) = b_0 + b_{CX} \times C\]


where b_{CX} is the direct effect of C \rightarrow X and p is the prevalence of X. Then, the desired prevalence p^{*} can be generated from C using an operation similar to,

    \[p_i = \text{ilogit} \left\{ \left[b_{CX} \times C_i - \text{mean}(b_{CX} \times C_i) + \text{logit}~p^{*} \right] + \epsilon_i \right \}\]


where the \epsilon_i are random noise generated from N(0, \sigma_{X}). Finally, X_i \sim Bernoulli(p_i). In this simulation, b_{CX} = \log 2 and \sigma_X = 3.

Finally, to generate O centered at O^{*},

    \[O_i = \left[(b_{CO} \times C_i + b_{XO} \times X_i) - \text{mean}(b_{CO} \times C_i + b_{XO} \times X_i) + O^{*}\right] + \gamma_i\]


where b_{XO} is the direct effect of X \rightarrow O, b_{CO} is the direct effect of C \rightarrow O, and the \gamma_i are random noise generated from N(0, \sigma_{O}). In this simulation, b_{CO} = 1.5, b_{XO} = 5, and \sigma_O = 1.

confound_sim <- dag.sim(
                        # REQUIRED: PROVIDE DAG OBJECT AND DESIRED SAMPLE SIZE
                        confound_dag,
                        n = 100,

                        # REQUIRED: SPECIFY NODE TYPES (1 = BINARY, 0 = CTS)
                        binary = c(1,  # binary primary exposure X
                                   0,  # continuous common cause C
                                   0), # continuous outcome O

                        # REQUIRED: SPECIFY AND MEAN AND SD OF CTS NODES
                        # (*) MU = PREVALENCE FOR BINARY NODES
                        mu = c(0.5,    # prevalence p of primary exposure X
                               120,    # mean of common cause C
                               10),    # "    "  outcome O
                        stdev = c(3,   # sd of random noise added when generating X
                                  20,  # "  "  common cause C
                                  1),  # "  "  random noise added when generate O

                        # REQUIRED: SPECIFY DIRECT EFFECTS OF ARCS
                        b = c(5,       # direct effect b of X -> O 
                              log(2),  # direct effect *exp(b)* of C -> X (since X binary)
                              1.5),    # direct effect b of C -> O

                        # OPTIONAL: SET SEED FOR REPRODUCIBILITY
                        seed = 12345,  # set seed for reproducibility
                        naming = 2,    # use custom labels for output variable names
                        verbose = F    # suppress alg. output
)
## Note: undirected arcs are discarded, an xy arc is appended.
## simulating node 2 from Normal, with mean SD 120 20
## node 1 : calculated from nodes 2 via arcs 2 with exp(b) 2 , shifted to probability 0.5 (added noise SD: 3 );
## node 3 : calculated from nodes 1 2 1 via arcs 1 3 4 with b 5 1.5 0 , shifted to mean 10 (added noise SD: 1 );

Let’s check out the data that gets spit out by data.sim.

head(confound_sim)
##   X         C          O
## 1 1 131.71058  21.701275
## 2 1 134.18932  27.229184
## 3 0 117.81393  -2.878963
## 4 0 110.93006 -15.095081
## 5 1 132.11775  22.855403
## 6 0  83.64088 -53.952826
apply(confound_sim, MARGIN = 2, FUN = mean)
##         X         C         O 
##   0.54000 124.90394  10.07882

The data has our desired variable types and appears to have produced the correct means and prevalence.

Let’s see if we can use the generated C values to derive X and O as outlined above.

# Logit and inverse logit functions for generating X
logit <- function(p){log(p/(1-p))}
ilogit <- function(x){1/(1+exp(-x))}

s_X  = 3
s_O  = 1
b_CX = log(10)
b_CO = 1.5
b_XO = 5

p_i = ilogit((b_CX * confound_sim$C - mean(b_CX * confound_sim$C) + logit(0.5)) + rnorm(n = 100, mean = 0, sd = s_X))
X_i = sapply(p_i, FUN = function(x) rbinom(n = 1, size = 1, prob = x))
mean(X_i)
## [1] 0.56
O_i = (b_CO * confound_sim$C + b_XO * X_i - mean(b_CO * confound_sim$C + b_XO * X_i) + 10) + rnorm(n = 100, mean = 0, sd = s_O)
mean(O_i)
## [1] 9.946751

Sweet, this is pretty close to what we got from our simulation.

Finally, we can construct some regression models using our simulate data to determine if they behave how we would expect. In a model of X on O, without adjustment for C, we would expect the direct effect of X \rightarrow O to be distorted due to confounding by C. After adjustment for the common cause C, the direct effect of X \rightarrow O should be unbiased.

lm(O ~ X, data = confound_sim)
## 
## Call:
## lm(formula = O ~ X, data = confound_sim)
## 
## Coefficients:
## (Intercept)            X  
##      -22.20        59.78

In the model adjusting for X alone, it appears the direct effect of X has been overestimated as a result of failing to adjust for C.

lm(O ~ X + C, data = confound_sim)
## 
## Call:
## lm(formula = O ~ X + C, data = confound_sim)
## 
## Coefficients:
## (Intercept)            X            C  
##    -180.689        4.860        1.506

After adjustment for C, we have approximately recovered our direct effect of X \rightarrow O which we set as b_{XO} = 5.

More complex DAG



Figure 2a of Shier and Platt (2008).

Specify

complex_dag <- dag.init(
                         # REQUIRED: SPECIFY COVARIATES AND THEIR TYPE
                         covs = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),

                         # OPTIONAL: PROVIDE CUSTOM LABELS FOR DAG NODES AND SIM DATA
                         symbols = c('X',   # 0: Warm-up exercises (primary exposure)
                                     'IGP', # 1: Intra-game proprioception 
                                     'TMA', # 2: Team motivation, aggression 
                                     'PI',  # 3: Previous injury 
                                     'CS',  # 4: Contact sport 
                                     'C',   # 5: Coach 
                                     'PGP', # 6: Pre-game proprioception 
                                     'FL',  # 7: Fitness level 
                                     'G',   # 8: Genetics  
                                     'NMF', # 9: Neuromuscular fatigue
                                     'CTD', # 10: Connective Tissue Disorder
                                     'TW',  # 11: Tissue weakness
                                     'O'),  # -1: Injury (outcome)

                         # REQUIRED: SPECIFY ARCS BETWEEN X, C, AND O
                         arcs = c(0,   1,      # X to IGP
                                  1,  -1,      # IGP to O
                                  2,   0,      # TMA to X
                                  2,   3,      # TMA to PI
                                  4,   3,      # CS to PI
                                  4,   1,      # CS to IGP
                                  6,   0,      # PGP to X
                                  5,   2,      # C to TMA
                                  5,   7,      # C to FL
                                  7,   6,      # FL to PGP
                                  7,   9,      # FL to NMF
                                  8,   7,      # G to FL
                                  8,   9,      # G to NMF
                                  8,  10,      # G to CTD
                                  10,  9,      # CTD to NMF
                                  10, 11,      # CTD to TW
                                  11, -1,      # TW to O
                                  9,   1,      # NMF to IGP
                                  9,  -1,      # NMF to O
                                  1,  -1),     # arc from common cause to outcome

                         # OPTIONAL: PROVIDE NAMES FOR X, C, AND O (USED IN LEGEND)
                         x.name = 'Warm-up Exercises',   # name of primary exposure
                         y.name = 'Injury',            # name of outcome
                         cov.names = c('Intra-game proprioception',
                                       'Team motivation, aggression',
                                       'Previous injury',
                                       'Contact sport',
                                       'Coach',
                                       'Pre-game proprioception',
                                       'Fitness level',
                                       'Genetics',
                                       'Neuromuscular fatigue',
                                       'Connective Tissue Disorder',
                                       'Tissue weakness'),

                         # REQUIRED: TURN OFF WEIRD DEFAULTS
                         noxy   = T        # tell dagR we want to specify our *own* arc between X and O
                         )
complex_dag
## $cov.types
##  [1]  0  1  1  1  1  1  1  1  1  1  1  1 -1
## 
## $x
##  [1]  0.000000000 -0.176148078 -0.106217783  0.005025253  0.150000000  0.318826668  0.500000000  0.681173332  0.850000000
## [10]  0.994974747  1.106217783  1.176148078  1.000000000
## 
## $y
##  [1] 0.0000000 0.1811733 0.3500000 0.4949747 0.6062178 0.6761481 0.7000000 0.6761481 0.6062178 0.4949747 0.3500000
## [12] 0.1811733 0.0000000
## 
## $arc
##       [,1] [,2]
##  [1,]    1    2
##  [2,]    2   13
##  [3,]    3    1
##  [4,]    3    4
##  [5,]    5    4
##  [6,]    5    2
##  [7,]    7    1
##  [8,]    6    3
##  [9,]    6    8
## [10,]    8    7
## [11,]    8   10
## [12,]    9    8
## [13,]    9   10
## [14,]    9   11
## [15,]   11   10
## [16,]   11   12
## [17,]   12   13
## [18,]   10    2
## [19,]   10   13
## [20,]    2   13
## 
## $arc.type
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 
## $curve.x
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 
## $curve.y
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 
## $xgap
## [1] 0.04
## 
## $ygap
## [1] 0.05
## 
## $len
## [1] 0.1
## 
## $names
##  [1] "Warm-up Exercises"           "Intra-game proprioception"   "Team motivation, aggression"
##  [4] "Previous injury"             "Contact sport"               "Coach"                      
##  [7] "Pre-game proprioception"     "Fitness level"               "Genetics"                   
## [10] "Neuromuscular fatigue"       "Connective Tissue Disorder"  "Tissue weakness"            
## [13] "Injury"                     
## 
## $symbols
##  [1] "X"   "IGP" "TMA" "PI"  "CS"  "C"   "PGP" "FL"  "G"   "NMF" "CTD" "TW"  "O"  
## 
## $version
## [1] "1.2.0"
## 
## Directed acyclic graph (DAG) object created by dagR version 1.2.0.
## For a convenient interpreted presentation, use summary().
summary(complex_dag)
## DAG created by dagR version 1.2.0.
##  - from X = Warm-up Exercises to Y = Injury
##  - acyclic: TRUE
##  - 13 nodes: X, Y, 11 cov(s) (0 unknown/unmeasured)
##    - node 2 = Intra-game proprioception
##    - node 3 = Team motivation, aggression
##    - node 4 = Previous injury
##    - node 5 = Contact sport
##    - node 6 = Coach
##    - node 7 = Pre-game proprioception
##    - node 8 = Fitness level
##    - node 9 = Genetics
##    - node 10 = Neuromuscular fatigue
##    - node 11 = Connective Tissue Disorder
##    - node 12 = Tissue weakness
##  - alternative node symbols:  X IGP TMA PI CS C PGP FL G NMF CTD TW O
##  - 20 arc(s) (0 undirected)
##  - currently unadjusted
##  - potentially biasing paths not sought
##  - possible adjustment sets not examined

Plot

By default, dagR arranges covariates in a “semi-circle” pattern. It doesn’t work too well for a large DAG such as this one.

complex_fig0 <- dag.draw(complex_dag)

plot of chunk unnamed-chunk-13

We can provide custom co-ordinates to get our DAG to closer resemble the form of Figure 2a.

# Co-ordinates specified in the order of symbols
# "X"   "IGP" "TMA" "PI"  "CS"  "C"   "PGP" "FL"  "G"   "NMF" "CTD" "TW"  "O"  
complex_dag$x <- c(0, 1, 0, 0.5, 1.0, 0.5, 0.5, 1.0, 1.5, 1.5, 2.0, 2.0, 2)
complex_dag$y <- c(0.25, 0.25, 1, 0.5, 1.0, 1.5, 1.0, 1.5, 1.5, 1.0, 1.5, 0.5, 0.25)
complex_fig1 <- dag.draw(complex_dag, noxy = 2)

plot of chunk unnamed-chunk-15

Simulate

I’m going to randomly generate probabilities between [0.1, 0.9] for binary variables and means between [10, 100] for continuous variables. All standard deviations will be set to 1. All twenty direct effects will be generated from a uniform distribution between [-1, 1].

complex_sim <- dag.sim(
                        # REQUIRED: PROVIDE DAG OBJECT AND DESIRED SAMPLE SIZE
                        complex_dag,
                        n = 100,

                        # REQUIRED: SPECIFY NODE TYPES (1 = BINARY, 0 = CTS)
                        binary = c(1,  # X
                                   0,  # IGP
                                   0,  # TMA
                                   1,  # PI
                                   1,  # CS
                                   1,  # C
                                   0,  # PGP
                                   0,  # FL
                                   1,  # G
                                   1,  # NMF
                                   1,  # CTD
                                   1,  # TW
                                   1), # O

                        # REQUIRED: SPECIFY AND MEAN AND SD OF CTS NODES
                        # (*) MU = PREVALENCE FOR BINARY NODES
                        mu = c(runif(n = 1, min = 0.1, max = 0.9),    # X
                               runif(n = 2, min = 10,  max = 100),    # IGP, TMA
                               runif(n = 3, min = 0.1, max = 0.9),    # PI, CS, C
                               runif(n = 2, min = 10,  max = 100),    # PGP, FL
                               runif(n = 5, min = 0.1, max = 0.9)),   # G, NMF, CTD, TW, O
                        stdev = rep(1, 13),

                        # REQUIRED: SPECIFY DIRECT EFFECTS OF 20 ARCS
                        b = runif(n = 20, min = -1, max = 1),

                        # OPTIONAL: SET SEED FOR REPRODUCIBILITY
                        seed = 12345,  # set seed for reproducibility
                        naming = 2,    # use custom labels for output variable names
                        verbose = F    # suppress alg. output
)
## Note: undirected arcs are discarded, an xy arc is appended.
## simulating node 5 from Binomial, with prob 0.465184768103063
## simulating node 6 from Binomial, with prob 0.23309742808342
## simulating node 9 from Binomial, with prob 0.682164203003049
## node 11 : calculated from nodes 9 via arcs 14 with exp(b) 0.836072462190722 , shifted to probability 0.127628348022699 (added noise SD: 1 );
## node 12 : calculated from nodes 11 via arcs 16 with exp(b) 0.854790032790292 , shifted to probability 0.221898792125285 (added noise SD: 1 );
## node 3 : calculated from nodes 6 via arcs 8 with b 0.524216595105827 , shifted to mean 78.4884095494635 (added noise SD: 1 );
## node 4 : calculated from nodes 3 5 via arcs 4 5 with exp(b) 0.444515879520425 2.66794269837119 , shifted to probability 0.808899652957916 (added noise SD: 1 );
## node 8 : calculated from nodes 6 9 via arcs 9 12 with b -0.336310201324522 0.907394020818174 , shifted to mean 55.8301902096719 (added noise SD: 1 );
## node 10 : calculated from nodes 8 9 11 via arcs 11 13 15 with exp(b) 1.4709273414126 2.0516112877568 2.16855062119996 , shifted to probability 0.891789550334215 (added noise SD: 1 );
## node 7 : calculated from nodes 8 via arcs 10 with b 0.697887991555035 , shifted to mean 39.2585848039016 (added noise SD: 1 );
## node 1 : calculated from nodes 3 7 via arcs 3 7 with exp(b) 1.78283463041697 1.98592216989379 , shifted to probability 0.676723117008805 (added noise SD: 1 );
## node 2 : calculated from nodes 1 5 10 via arcs 1 6 18 with b 0.732577715069056 0.576788076665252 0.703376833815128 , shifted to mean 88.8195873773657 (added noise SD: 1 );
## node 13 : calculated from nodes 2 12 10 2 1 via arcs 2 17 19 20 21 with exp(b) 0.669151770139957 1.78079431432041 0.543818744665322 0.710830900554524 1 , shifted to probability 0.688547961972654 (added noise SD: 1 );
head(complex_sim)
##   X      IGP      TMA PI CS C      PGP       FL G NMF CTD TW O
## 1 0 88.64873 78.96827  1  0 0 39.65966 58.32906 1   1   0  0 0
## 2 1 89.27663 79.58407  0  0 1 40.24144 54.95490 0   1   0  0 0
## 3 0 88.73670 78.42844  0  0 1 39.05994 55.80366 1   0   0  1 1
## 4 0 87.55727 78.42953  1  0 1 38.62691 55.84037 1   1   0  0 0
## 5 0 87.70700 78.87595  1  0 0 38.22488 55.41399 1   1   0  1 1
## 6 1 89.25711 78.44596  0  0 0 39.28925 54.65930 1   1   0  0 1

Beauty!


Use this code for your own research by downloading this blogpost as an RMarkdown file here.

Published by

Emma Davies Smith

Emma Davies Smith is currently a postdoctoral research fellow at the Harvard School of Public Health. Her current research interests include clinical trial methodology, nonparametric methods, missing data, data visualization, and communication. When she's not working on expanding her knowledge of statistics, she's busy petting cats and unsuccessfully convincing her husband to let her adopt them, hiking, and concocting indie and folk rock playlists.

One thought on “Using a DAG to simulate data with the dagR library”

Leave a Reply

Your email address will not be published. Required fields are marked *