Analyze and visualize DAGs in R by integrating the dagR, daggity, and ggdag libraries

Huzzah!!

This blog post makes 3/3 for the year. I made it, folks! I wasn’t sure I would. It was a jam-packed year for us: I turned 30 and finished my Ph.D. at Western (not causal – well, maybe), we sold our first house and moved from Canada to the USA, and I promptly started my postdoc at Harvard. But it was a good year for us, too. We are having fun exploring Boston, I’m enjoying my postdoc work and co-workers, and I can now force my family to address all my Christmas presents to Dr. Emma – worth it. Very much looking forward to what the New Year brings.



Here’s looking at you, New Year! (Left: Dr. Emma, Right: Just Ethan)

D’ya like DAGs?

If so, good news! This blogpost is all about DAGs. Directed acyclic graphs, that is. One of the most popular posts on my site is Using a DAG to simulate data with the dagR library. I think this is really interesting. Generation of data from randomized controlled trials is relatively straightforward, but I imagine this is not the case for observational studies, confounding and all. DAGs represent a way to formalize our assumptions around causal mechanisms, i.e., X causes Y, and this relationship is confounded by Z, for example. As the adoption of causal inference grows (look at any statistical conference programme), it makes sense that we would want to assess novel methods using data generated according to realistic mechanisms. Simulation according to DAGs allows us to do this, or at least attempt to.

There are two things I want to comment on regarding my previous blogpost. First, I’d really like to see some more sophisticated ways of generating data according to DAGs. The dagR library is limited to continuous and binary exposures, which may be good enough for now but won’t be enough in the long run (survival endpoints are very common in medicine, so too are ordinal exposures or outcomes, etc). I find that the way direct effects are specified and defined also makes it challenging to know the “ground truth” in some cases. I’d like to see the packages do more heavy lifting.

I’m starting to see more materials around DAG simulation, like the DagSim framework for Python or papers like Illustrating How to Simulate Data From Directed Acyclic Graphs to Understand Epidemiologic Concepts and DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation. There might be some promising nuggets in here already, and movement in this area suggests we should see something promising soon. Also, thank you to the people working on this!

Second, the default diagrams produced by dagR aren’t bad, but they aren’t great. Sorry (see also: thank you above). Writing this blog post, I also noticed that an article on the use of dagR has also been published by the package authors in 2022. Funny enough, they seem to recognize it too – suggesting that the ggdag library in R can be used to improve the graphical layout of DAGs. Hey, wouldn’t you know it – that’s exactly what this blog post is about!! A great feature of the dagR library is that you can convert a dagR object to dagitty object with a single command. So, let’s make your already programmed DAG work for you!

TL;DR

This blog post is a quick tutorial on how to transform dagR objects into dagitty objects so you can construct publication-worthy visualizations of your DAG using the ggdag library.

library(dagR)
library(dagitty)
library(ggdag)
library(tidyverse)

Confounding triangle

In my previous blog post, we did the hard work of specifying our DAGs in dagR to simulate data. I’m all about working smarter, not harder. So, let’s copy our previously written DAG code.

Check out Using a DAG to simulate data with the dagR library for more detail on how to construct and simulate data using DAGs in R!

We’ll start first with the classic confounding triangle: exposure X causes outcome O, and this relationship is confounded by their common cause C.

dagR_tri <- dag.init(
                         # REQUIRED: SPECIFY COVARIATES AND THEIR TYPE
                         covs = c(1),      # specify that single covariate (common cause) is "known"

                         # OPTIONAL: PROVIDE CUSTOM LABELS FOR DAG NODES AND SIM DATA
                         symbols = c('X',  # symbol for primary exposure within DAG
                                     'C',  # "      "   common cause     "      "
                                     'O'), # "      "   outcome          "      "

                         # REQUIRED: SPECIFY ARCS BETWEEN X, C, AND O
                         arcs = c(0, -1,   # arc from primary exposure to outcome
                                  1,  0,   # arc from common cause to exposure
                                  1, -1),  # arc from common cause to outcome

                         # OPTIONAL: PROVIDE NAMES FOR X, C, AND O (USED IN LEGEND)
                         x.name = 'Primary exposure',   # name of primary exposure
                         y.name = 'Outcome',            # name of outcome
                         cov.names = c('Common cause'), # (vector of) names of covariates

                         # REQUIRED: TURN OFF WEIRD DEFAULTS
                         noxy   = T        # tell dagR we want to specify our *own* arc between X and O
                         )

For reference, here is the default dagR plot of our confounding triangle:

confound_fig <- dag.draw(
                        dagR_tri,      # DAG object
                        numbering = F,     # Number arcs: T/F
                        legend = T,        # Legend: T/F
                        noxy = 2           # do not draw ? above X -> O arc
                       )

plot of chunk unnamed-chunk-3

Again not bad, but nothing to write home about. So, let’s transform our dagR object into a dagitty object so we can pass it to ggdag. This can be done with a single function call, sweet!

dagitty_tri <- dagR2dagitty(dagR_tri, only.code=F)

The following dagitty code is produced:

dagitty_tri
dag {
C
O [outcome]
X [exposure]
C -> O
C -> X
X -> O
}

This is a really nice feature of dagR. You plug this code into the online dagitty interface at dagitty.net, allowing you to explore your more complex DAGs using point-and-click functionality.

To plot our new dagitty object using ggdag, we need to “tidy up” our dagitty object using the tidy_dagitty function.

# Convert to tidy DAG object
tidy_tri <- tidy_dagitty(dagitty_tri)

If we simply use the ggdag command, we will get a plot of our nodes and arcs.

# ggdag of our confounding triangle!
main_tri <- ggdag(tidy_tri, text_size = 5) + 
  theme_dag() +
  labs(title='Confounding triangle')
main_tri

plot of chunk unnamed-chunk-8

Looks pretty! Our layout is a bit disorganized, though. We can fix this by specifying node coordinates explicitly within our dagitty object and tidying up our DAG again.

Note: coordinates are relative, so it doesn’t seem to matter exactly what value you put. For example, if you want one node higher than another, specify a higher y value. If you want a node on the same level, specify the same y value, etc. X and Y coordinates should be specified for nodes according to their name or label within the dagitty object.

coordinates(dagitty_tri) <- list(
  x = c(X=0, O=2, C=1),
  y = c(X=0, O=0, C=1)
  )
tidy_tri <- tidy_dagitty(dagitty_tri)
main_tri <- ggdag(tidy_tri, text_size = 5) + 
  theme_dag() +
  labs(title='Confounding triangle')
main_tri

plot of chunk unnamed-chunk-12

That’s better!

If we want to highlight “special nodes,” in this case our exposure and outcome of interest, we can use the following command instead:

status_tri <- ggdag_status(tidy_tri, text_size=5) +
  theme_dag() +
  labs(title='Confounding triangle', col='Status') +
  theme(legend.position='bottom')
status_tri

plot of chunk unnamed-chunk-14
Another nice feature of the ggdag library is that it will visualize our adjustment sets for us!

# adjustment sets for our confounding triangle
adj_tri <- ggdag_adjustment_set(tidy_tri, text_size = 5) +
  theme_dag() +
  labs(title = 'Adjustment sets', col='Adjusted', shape='Adjusted') +
  theme(legend.position='bottom')
adj_tri 

plot of chunk unnamed-chunk-16

Brilliant, we get exactly the result we expect. If we adjust for common cause C, we can unbiasedly estimate the total effect of exposure X on outcome O.

We can also print out the adjustment sets using dagitty, for reference:

adjustmentSets(x=dagitty_tri, exposure="X", outcome="O",
               effect = c('total', 'direct'))
## { C }

Finally, we can mash our plots together to create a two-panel figure for publication:

gridExtra::grid.arrange(status_tri, adj_tri, ncol=2)

plot of chunk unnamed-chunk-18

It’s that easy!!11!

Shrier & Platt (2008)

Okay, let’s practice our newly learned skills on a more complex DAG. Again, we will reuse our already specified dagR object to recreate the complex DAG from Shrier & Platt (2008).

Check out Using a DAG to simulate data with the dagR library for more detail on how to construct and simulate data using DAGs in R!

dagR_shrier <- dag.init(
                         # REQUIRED: SPECIFY COVARIATES AND THEIR TYPE
                         covs = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),

                         # OPTIONAL: PROVIDE CUSTOM LABELS FOR DAG NODES AND SIM DATA
                         symbols = c('X',   # 0: Warm-up exercises (primary exposure)
                                     'IGP', # 1: Intra-game proprioception 
                                     'TMA', # 2: Team motivation, aggression 
                                     'PI',  # 3: Previous injury 
                                     'CS',  # 4: Contact sport 
                                     'C',   # 5: Coach 
                                     'PGP', # 6: Pre-game proprioception 
                                     'FL',  # 7: Fitness level 
                                     'G',   # 8: Genetics  
                                     'NMF', # 9: Neuromuscular fatigue
                                     'CTD', # 10: Connective Tissue Disorder
                                     'TW',  # 11: Tissue weakness
                                     'O'),  # -1: Injury (outcome)

                         # REQUIRED: SPECIFY ARCS BETWEEN X, C, AND O
                         arcs = c(0,   1,      # X to IGP
                                  1,  -1,      # IGP to O
                                  2,   0,      # TMA to X
                                  2,   3,      # TMA to PI
                                  4,   3,      # CS to PI
                                  4,   1,      # CS to IGP
                                  6,   0,      # PGP to X
                                  5,   2,      # C to TMA
                                  5,   7,      # C to FL
                                  7,   6,      # FL to PGP
                                  7,   9,      # FL to NMF
                                  8,   7,      # G to FL
                                  8,   9,      # G to NMF
                                  8,  10,      # G to CTD
                                  10,  9,      # CTD to NMF
                                  10, 11,      # CTD to TW
                                  11, -1,      # TW to O
                                  9,   1,      # NMF to IGP
                                  9,  -1,      # NMF to O
                                  1,  -1),     # arc from common cause to outcome

                         # OPTIONAL: PROVIDE NAMES FOR X, C, AND O (USED IN LEGEND)
                         x.name = 'Warm-up Exercises',   # name of primary exposure
                         y.name = 'Injury',            # name of outcome
                         cov.names = c('Intra-game proprioception',
                                       'Team motivation, aggression',
                                       'Previous injury',
                                       'Contact sport',
                                       'Coach',
                                       'Pre-game proprioception',
                                       'Fitness level',
                                       'Genetics',
                                       'Neuromuscular fatigue',
                                       'Connective Tissue Disorder',
                                       'Tissue weakness'),

                         # REQUIRED: TURN OFF WEIRD DEFAULTS
                         noxy   = T        # tell dagR we want to specify our *own* arc between X and O
                         )

For reference, here is the default dagR plot of this DAG:

shrier_fig <- dag.draw(
                        dagR_shrier,      # DAG object
                        numbering = F,     # Number arcs: T/F
                        legend = T,        # Legend: T/F
                        noxy = 2           # do not draw ? above X -> O arc
                       )

plot of chunk unnamed-chunk-20
Aaaah! Let’s clean this up.

dagitty_shrier <- dagR2dagitty(dagR_shrier, only.code=F)
coordinates(dagitty_shrier) <- list(
  x = c(X=0, IGP=2, TMA=0, PI=1, CS=2, C=1, PGP=1, FL=2, G=3, NMF=3, 
        CTD=4, TW=4, O=4),
  y = c(X=0, IGP=0, TMA=2, PI=1, CS=2, C=3, PGP=2, FL=3, G=3, NMF=2, 
        CTD=3, TW=1, O=0)
)
tidy_shrier <- tidy_dagitty(dagitty_shrier)
status_shrier <- ggdag_status(tidy_shrier, text_size=5) +
  theme_dag() +
  labs(title='Shrier & Platt (2008)', col='Status') +
  theme(legend.position='bottom')
status_shrier

plot of chunk unnamed-chunk-25
A work of art, truly. Now lets take a look at adjustment sets. I imagine for a DAG of this size and complexity, there will be quite a few.

Adjustment sets for the total effect of exposure X on outcome O are provided by:

adjustmentSets(x=dagitty_shrier, exposure="X", outcome="O", effect = c('total'))
## { NMF, TW }
## { CTD, NMF }
## { FL, G }
## { C, FL }
## { FL, TMA }
## { C, PGP }
## { PGP, TMA }

Adjustment sets for the direct effect of exposure X on outcome O are provided by:

adjustmentSets(x=dagitty_shrier, exposure="X", outcome="O", effect = c('direct'))
## { IGP, NMF, TW }
## { CTD, IGP, NMF }
## { FL, G, IGP, NMF }
## { C, FL, IGP, NMF }
## { FL, IGP, NMF, TMA }
## { C, IGP, NMF, PGP }
## { IGP, NMF, PGP, TMA }
# adjustment sets for our confounding triangle
adj_shrier <- ggdag_adjustment_set(tidy_shrier, text_size = 5) +
  theme_dag() +
  labs(title = 'Adjustment sets', col='Adjusted', shape='Adjusted') +
  theme(legend.position='bottom')
adj_shrier

plot of chunk unnamed-chunk-29

Ummm, yeah okay. Lots of adjustment sets, or should I say, adjustment blobs. As a note, it also appears as though only total adjustment sets are plotted. Perhaps this is a scenario where you might want to report your adjustment sets in text form… 🙂


If you are reading this and know how to better customize this adjustment plot, please leave a comment!

The end



Well — that’s all you’ll hear from me for this year, folks!
Thanks for reading, and I wish anyone that stumbles upon these ramblings a very happy new year!

Published by

Emma Davies Smith

Emma Davies Smith is currently a Postdoctoral Research Fellow in Biostatistics at the Harvard T. H. Chan School of Public Health. Her current research interests include randomized controlled trials, correlated data, clinical epidemiology, and causal inference. When she's not working on expanding her knowledge of statistics, she's busy petting cats and unsuccessfully convincing her husband to let her adopt them, hiking, and concocting indie and folk rock playlists.

Leave a Reply

Your email address will not be published. Required fields are marked *