Multiple Imputation by Chained Equations with Predictive Mean Matching for Cluster Randomized Trials using mice and miceadds in R

A journal article recently published in Clinical Trials caught my eye, How is missing data handled in cluster randomized trials? A review of trials published in the NIHR Journals Library 1997-2024. I immediately knew this was going to be a poor report card. Indeed, the abstract summarizes:

Among the 110 identified cluster randomized controlled trials, 45% (50/110) did not report or take any action on missing data in either primary analysis or sensitivity analysis. In total, 75% (82/110) of the identified cluster randomized controlled trials did not impute missing values in their primary analysis. Advanced methods like multiple imputation were applied in only 15% (16/110) of primary analyses and 28% (31/110) of sensitivity analyses. On the contrary, the review highlighted that missing data handling methods have evolved over time, with an increasing adoption of multiple imputation since 2017.

It is promising to see that adoption of multiple imputation is increasing, but almost half of all studied cluster trials did not address missing data at all! BIG YIKES?!? Honestly, I have to say I am not at all surprised. Missing data is hard. Cluster randomized trials are hard. Together…

Do you ever feel like you just can’t do anything right?! Well, *do I have* more problems for you…

Working on cluster randomized trials introduces a slew of additional considerations, even in the presence of “simpler” parallel two-arm designs with completely observed data. Do I have enough clusters in each arm to reliability estimate between-cluster variances? What about number of individuals per cluster to estimate within-cluster variances? What if I have repeated measures for each subject, so I have additional correlation to worry about? WHAT ABOUT MY TREATMENT EFFECT?!!! Should I be estimating cluster-level effects or individual-level effects? How do I incorporate cluster-level and individual-level covariates into my models? WHAT DO MY TREATMENT EFFECTS EVEN REPRESENT?!!!

If you are going to address missing data in your cluster trial, you need to make further considerations and assumptions. We can’t even evaluate most of these assumptions. Are my data MCAR, MAR, or MNAR? If I think my data is missing not at random, what assumptions do I feel I can reasonably make about the missing mechanism? What if I’m wrong? What are reasonable sensitivity analyses to perform? What auxiliary variables do I need to include in my imputation models? Should they be at the cluster-level or the individual-level? Perhaps most importantly, how do I properly capture clustering using multiple imputation?!!! Maybe I could borrow observed responses from similar participants in the same cluster… But what if my clusters are small? What if my clusters are large? Does my strategy need to change? (SPOILER ALERT: yes)

No wonder multiple imputation was only applied in 15% of studied primary analyses. Not to mention that two-level predictive mean matching, which I personally think is a very elegant and pragmatic approach to imputing ordinal data, was not even implemented in the add-on popular missing data R package “mice” until 2016. If you wanted to use existing tools before that, many would be faced with suboptimal imputation using a continuous model that doesn’t capture the bounded, discrete nuances of your data. Real talk: It’s also possible, particularly in the presence of relatively few missing observations, that the statisticians on some of these cluster trials “not addressing missing data” weighed the risk of bias due to the missing data against their confidence in required assumptions or inputs, or required effort to implement, and decided it was not worthwhile. Practical considerations are important too.

Anyhow, while I do think there is still a long way to go to make missing data methods for cluster trials accessible, the good news is that several, different two-level (level 1: individual; level 2: cluster) multiple imputation methods are now available in the popular and easy-to-use mice and add-on miceadds packages in R. This includes predictive mean matching! Yay! In this blog post, I will demonstrate how to impute missing ordinal data in a cluster randomized trial using two-level predictive mean matching via the mice and miceadds packages so you can avoid contributing to bad report cards! More yay – we love avoiding citation for bad practices!

Continue reading Multiple Imputation by Chained Equations with Predictive Mean Matching for Cluster Randomized Trials using mice and miceadds in R

EM Algorithm Essentials: Estimating standard errors using the empirical information matrix

Introduction

At the end of my latest blog post, I promised that I would talk about how to perform constrained maximization using unconstrained optimizers. This can be accomplished by employing clever transformation and nice properties of maximum likelihood estimators – see this fantastic post from the now defunct 🙁 Econometrics Beat. Eventually I will get around to discussing more of the statistical details behind this approach and its implementation from scratch in R.

RIGHT NOW, I want to talk about how to obtain standard errors for Gaussian mixture model parameters estimated using the EM algorithm. So far, I’ve written two posts with respect to Gaussian mixtures and the EM algorithm:

Embracing the EM algorithm: One continuous response, which motivates the theory behind the EM algorithm using a two component Gaussian mixture model; and
EM Algorithm Essentials: Maximizing objective functions using R’s optim, which demonstrates how to implement log-likelihood maximization from scratch using the optim function.

Both posts so far have focused only on POINT ESTIMATION! That is, obtaining estimates of our mixture model parameters.

Any estimate we obtain using any statistical method has some uncertainty associated with it. We quantify the uncertainty of a parameter estimate by its STANDARD ERROR. If we repeated the same experiment with comparable samples a large number of times, the standard error would reflect how much our estimates differ from experiment to experiment. Indeed, we would estimate the standard error as the standard deviation of the effect estimates across experiments. If we can understand how our estimates vary across experiments, i.e., estimate their standard errors, we can perform statistical inference by testing hypotheses or constructing confidence intervals, for example.

We usually do not need to conduct all of these experiments because theory tells us what form this distribution of effect estimates takes! We refer to this distribution as the SAMPLING DISTRIBUTION. When the form of the sampling distribution is not obvious, we may use resampling techniques to estimate standard errors. The EM algorithm, however, attempts to obtain maximum likelihood estimates (MLEs) which have theoretical sampling distributions. For well-behaved densities, MLEs can be shown to be asymptotically normally distributed and unbiased with a variance-covariance matrix equal to the inverse of the Fisher Information matrix, i.e., a function of the second derivatives of the log-likelihood or equivalently, the square of the first derivatives…

The tricky bit with the EM algorithm is we don’t maximize the observed log-likelihood directly to estimate Gaussian mixture parameters. Instead, we maximize a more convenient objective function. People much smarter than I have figured out this can give you the same answer. A question remains: if standard errors are estimated using the second derivative of the log-likelihood, but we used the objective function, how do we properly quantify uncertainty? Particularly as the derivatives of the log-likelihood are not straight-forward, for the same reasons that make optimization difficult.

Isaac Meilijson, another person smarter than me, provides us with a wealth of guidance in his 1989 paper titled A Fast Improvement to the EM Algorithm on its Own Terms (I love this title). In this blog post, we summarize and demonstrate how to construct simple estimators of the observed information matrix, and subsequently the standard errors of our EM estimates, when we have independent and identically distributed responses arising from a two-component Gaussian mixture model. Essentially, due to some nice properties, we can simply use the derivatives of our objective function, in lieu of our log-likelihood, to estimate the observed information. Standard errors computed using the observed information are then compared to those obtained using numerical differentiation of the observed log-likelihood.

Continue reading EM Algorithm Essentials: Estimating standard errors using the empirical information matrix

EM Algorithm Essentials: Maximizing objective functions using R’s optim

Overview

In my previous blogpost, I motivated the EM algorithm in the context of estimating the parameters of a two-component Gaussian mixture density. In this case, we can write the estimators of the mixing probability, means, and variance in a nice closed form, and I demonstrated how to implement the corresponding iterative estimation procedure from scratch. Results were then compared to those obtained from the very nice R package flexmix.

However, rarely do we get such nice closed form estimators! We usually need to use numerical methods to maximize our objective function directly. In this blog post, I demonstrate how we can specify our objective function, and use the optim function in R to obtain our parameter estimates. optim has lots of options, and we will cover how to change the optimization procedure and implement restrictions on our parameter spaces.

EM for two-component Gaussian mixture

Let’s quickly recap our motivation, previously discussed in Embracing the EM algorithm: One continuous response.

We randomly sample $N$ patients from a population, and examine the empirical density of their responses. We notice two modes, and based on prior knowledge, hypothesize that the density is actually a mixture of two Gaussian densities. For example, the density centered around greater responses may correspond to “healthy” patients and the other to “ill” patients. We would like to (1) estimate the probability of belonging to each subpopulation; and (2) estimate subpopulation parameters, e.g. mean response in among the healthy and ill. But, we have a problem: we don’t know who belongs to each subpopulation. In other words, subpopulation labels are unobserved or “latent.”

Figure: Observed two-component Gaussian mixture density (purple), and distribution of responses among latent healthy (blue) and ill (red) patient subpopulations.

We can represent the density of the observed responses as a mixture of the subpopulation densities:

$f(y) = \pi~ f_1(y) + (1-\pi)~ f_2(y).$

That is, individuals belong to the first subpopulation, or are distributed according to density $f_{1}(y)$ , according to probability $\pi$ and to the second, distributed per $f_2(y)$ , with probability $1-\pi$ . We assume both densities are Gaussian with respective means $\mu_1$ and $\mu_2$ and variances $\sigma_1^2$ and $\sigma_2^2$ .

Continue reading EM Algorithm Essentials: Maximizing objective functions using R’s optim

Embracing the EM algorithm: One continuous response

Overview

I’m currently working on a project that revolves around the EM algorithm, and am finally realizing the power of this machinery. It really is like that movie with Jim Carrey where he can’t stop seeing the number 23 everywhere, except for me it’s the EM algorithm. Apparently this is called THE BAADER-MEINHOF PHENOMENON, oooh that’s fancy. You’ve probably seen the EM algorithm around too – though perhaps you didn’t know it. It’s commonly used for estimation with missing data. A modified EM algorithm (EMis) is used by the Amelia library in R. The EM algorithm also underpins latent variable models, which makes sense because latent variables are really missing observations when you think about it, right?! The more I learn about statistics, the more I realize most things are really missing data problems… cough potential outcomes cough

Anyways, I was previously taught the EM algorithm using the classic multinomial example. This is a great teaching tool, but I’ve never run into a situation like this in my life (yet). But, I do run into mixture distributions a surprising amount – mostly when investigating heterogeneity within patient populations. There’s a whole textbook on this, see: Medical Applications of Finite Mixture Models. The EM algorithm makes a lot more sense to me in the context of mixture models:

We sample a group of patients and observe their response.
We notice a bimodal structure in the response distribution.
We hypothesize the observed distribution actually corresponds to two subpopulations or “classes.”
We don’t know who belongs to which subpopulation.
We estimate the probability of latent class membership using the EM algorithm.

Wouldn’t ya know it, this is unsupervised clustering.

In this blog post, I motivate the EM algorithm in the context of a two-component Gaussian mixture model. A thorough walkthrough of the underlying theory is provided. In this case, estimators take a nice closed form, but this is rarely the case for complex problems encountered in practice. R code for implementating the EM algorithm using the closed form estimators is provided. I also demonstrate how this model can be easily fit using the flexmix library.

Figure: A two-component Gaussian mixture density.

Continue reading Embracing the EM algorithm: One continuous response