Embracing the EM algorithm: One continuous response


I’m currently working on a project that revolves around the EM algorithm, and am finally realizing the power of this machinery. It really is like that movie with Jim Carrey where he can’t stop seeing the number 23 everywhere, except for me it’s the EM algorithm. Apparently this is called THE BAADER-MEINHOF PHENOMENON, oooh that’s fancy. You’ve probably seen the EM algorithm around too – though perhaps you didn’t know it. It’s commonly used for estimation with missing data. A modified EM algorithm (EMis) is used by the Amelia library in R. The EM algorithm also underpins latent variable models, which makes sense because latent variables are really missing observations when you think about it, right?! The more I learn about statistics, the more I realize most things are really missing data problems… cough potential outcomes cough

Anyways, I was previously taught the EM algorithm using the classic multinomial example. This is a great teaching tool, but I’ve never run into a situation like this in my life (yet). But, I do run into mixture distributions a surprising amount – mostly when investigating heterogeneity within patient populations. There’s a whole textbook on this, see: Medical Applications of Finite Mixture Models. The EM algorithm makes a lot more sense to me in the context of mixture models:

  • We sample a group of patients and observe their response.
  • We notice a bimodal structure in the response distribution.
  • We hypothesize the observed distribution actually corresponds to two subpopulations or “classes.”
  • We don’t know who belongs to which subpopulation.
  • We estimate the probability of latent class membership using the EM algorithm.

Wouldn’t ya know it, this is unsupervised clustering.

In this blog post, I motivate the EM algorithm in the context of a two-component Gaussian mixture model. A thorough walkthrough of the underlying theory is provided. In this case, estimators take a nice closed form, but this is rarely the case for complex problems encountered in practice. R code for implementating the EM algorithm using the closed form estimators is provided. I also demonstrate how this model can be easily fit using the flexmix library.

Figure: A two-component Gaussian mixture density.

Figure: A two-component Gaussian mixture density.

Continue reading Embracing the EM algorithm: One continuous response

Analyze and visualize DAGs in R by integrating the dagR, daggity, and ggdag libraries


This blog post makes 3/3 for the year. I made it, folks! I wasn’t sure I would. It was a jam-packed year for us: I turned 30 and finished my Ph.D. at Western (not causal – well, maybe), we sold our first house and moved from Canada to the USA, and I promptly started my postdoc at Harvard. But it was a good year for us, too. We are having fun exploring Boston, I’m enjoying my postdoc work and co-workers, and I can now force my family to address all my Christmas presents to Dr. Emma – worth it. Very much looking forward to what the New Year brings.

Here’s looking at you, New Year! (Left: Dr. Emma, Right: Just Ethan)

D’ya like DAGs?

If so, good news! This blogpost is all about DAGs. Directed acyclic graphs, that is. One of the most popular posts on my site is Using a DAG to simulate data with the dagR library. I think this is really interesting. Generation of data from randomized controlled trials is relatively straightforward, but I imagine this is not the case for observational studies, confounding and all. DAGs represent a way to formalize our assumptions around causal mechanisms, i.e., X causes Y, and this relationship is confounded by Z, for example. As the adoption of causal inference grows (look at any statistical conference programme), it makes sense that we would want to assess novel methods using data generated according to realistic mechanisms. Simulation according to DAGs allows us to do this, or at least attempt to.

There are two things I want to comment on regarding my previous blogpost. First, I’d really like to see some more sophisticated ways of generating data according to DAGs. The dagR library is limited to continuous and binary exposures, which may be good enough for now but won’t be enough in the long run (survival endpoints are very common in medicine, so too are ordinal exposures or outcomes, etc). I find that the way direct effects are specified and defined also makes it challenging to know the “ground truth” in some cases. I’d like to see the packages do more heavy lifting.

I’m starting to see more materials around DAG simulation, like the DagSim framework for Python or papers like Illustrating How to Simulate Data From Directed Acyclic Graphs to Understand Epidemiologic Concepts and DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation. There might be some promising nuggets in here already, and movement in this area suggests we should see something promising soon. Also, thank you to the people working on this!

Second, the default diagrams produced by dagR aren’t bad, but they aren’t great. Sorry (see also: thank you above). Writing this blog post, I also noticed that an article on the use of dagR has also been published by the package authors in 2022. Funny enough, they seem to recognize it too – suggesting that the ggdag library in R can be used to improve the graphical layout of DAGs. Hey, wouldn’t you know it – that’s exactly what this blog post is about!! A great feature of the dagR library is that you can convert a dagR object to dagitty object with a single command. So, let’s make your already programmed DAG work for you!


This blog post is a quick tutorial on how to transform dagR objects into dagitty objects so you can construct publication-worthy visualizations of your DAG using the ggdag library.


Continue reading Analyze and visualize DAGs in R by integrating the dagR, daggity, and ggdag libraries

Nonparametric neighbours: U-statistic structural components and jackknife pseudo-observations for the AUC

Two of my recent blog posts focused on two different, but as we will see related, methods which essentially transform observed responses into a summary of their contribution to an estimate: structural components resulting from Sen’s (1960) decomposition of U-statistics and pseudo-observations resulting from application of the leave-one-out jackknife. As I note in this comment, I think the real value of deconstructing estimators in this way results from the use of these quantities, which in special (but common) cases are asymptotically uncorrelated and identically distributed, to: (1) simplify otherwise complex variance estimates and construct interval estimates, and (2) apply regression methods to estimators without an existing regression framework.

As discussed by Miller (1974), pseudo-observations may be treated as approximately independent and identically distributed random variables when the quantity of interest is a function of the mean or variance, and more generally, any function of a U-statistic. Several other scenarios where these methods are applicable are also outlined. Many estimators of popular “parameters” can actually be expressed as U-statistics. Thus, these methods are quite broadly applicable. A review of basic U-statistic theory and some common examples, notably the difference in means or the Wilcoxon Mann-Whitney test statistic, can be found within my blog post: One, Two, U: Examples of common one- and two-sample U-statistics.

As an example of use case (1), Delong et al. (1988) used structural components to estimate the variances and covariances of the areas under multiple, correlated receiver operator curves or multiple AUCs. Hanley and Hajian-Tilaki (1997) later referred to the methods of Delong et al. (1988) as “the cleanest and most elegant approach to variances and covariances of AUCs.” As an example of use case (2), Andersen & Pohar Perme (2010) provide a thorough summary of how pseudo-observations can be used to construct regression models for important survival parameters like survival at a single time point and the restricted mean survival time.

Now, structural components are restricted to U-statistics while pseudo-observations may be used more generally, as discussed. But, if we construct pseudo-observations for U-statistics, one of several “valid” scenarios, what is the relationship between these two quantities? Hanley and Hajian-Tilaki (1997) provide a lovely discussion of the equivalence of these two methods when applied to the area under the receiver operating characteristic curve or simply the AUC. This blog post follows their discussion, providing concrete examples of computing structural components and pseudo-observations using R, and demonstrating their equivalence in this special case.

Continue reading Nonparametric neighbours: U-statistic structural components and jackknife pseudo-observations for the AUC

Resampling, the jackknife, and pseudo-observations

Resampling methods approximate the sampling distribution of a statistic or estimator. In essence, a sample taken from the population is treated as a population itself. A large number of new samples, or resamples, are taken from this “new population”, commonly with replacement, and within each of these resamples, the estimate of interest is re-obtained. A large number of these estimate replicates can then be used to construct the empirical sampling distribution from which confidence intervals, bias, and variance may be estimated. These methods are particularly advantageous for statistics or estimators for which no standard methods apply or are difficult to derive.

The jackknife is a popular resampling method, first introduced by Quenouille in 1949 as a method of bias estimation. In 1958, jackknifing was both named by Tukey and expanded to include variance estimation. A jackknife is a multipurpose tool, similar to a swiss army knife, that can get its user out of tricky situations. Efron later developed the arguably most popular resampling method, the bootstrap, in 1979 after being inspired by the jackknife.

In Efron’s (1982) book The jackknife, the bootstrap, and other resampling plans, he states,

Good simple ideas, of which the jackknife is a prime example, are our most precious intellectual commodity, so there is no need to apologize for the easy mathematical level.

Despite existing since the 1940’s, resampling methods were infeasible due to the computational power required to perform resampling and recalculate estimates many times. With today’s computing power, the uncomplicated yet powerful jackknife, and resampling methods more generally, should be a tool in every analyst’s toolbox.

Continue reading Resampling, the jackknife, and pseudo-observations