A journal article recently published in Clinical Trials caught my eye, How is missing data handled in cluster randomized trials? A review of trials published in the NIHR Journals Library 1997-2024. I immediately knew this was going to be a poor report card. Indeed, the abstract summarizes:
Among the 110 identified cluster randomized controlled trials, 45% (50/110) did not report or take any action on missing data in either primary analysis or sensitivity analysis. In total, 75% (82/110) of the identified cluster randomized controlled trials did not impute missing values in their primary analysis. Advanced methods like multiple imputation were applied in only 15% (16/110) of primary analyses and 28% (31/110) of sensitivity analyses. On the contrary, the review highlighted that missing data handling methods have evolved over time, with an increasing adoption of multiple imputation since 2017.
It is promising to see that adoption of multiple imputation is increasing, but almost half of all studied cluster trials did not address missing data at all! BIG YIKES?!? Honestly, I have to say I am not at all surprised. Missing data is hard. Cluster randomized trials are hard. Together…

If you are going to address missing data in your cluster trial, you need to make further considerations and assumptions. We can’t even evaluate most of these assumptions. Are my data MCAR, MAR, or MNAR? If I think my data is missing not at random, what assumptions do I feel I can reasonably make about the missing mechanism? What if I’m wrong? What are reasonable sensitivity analyses to perform? What auxiliary variables do I need to include in my imputation models? Should they be at the cluster-level or the individual-level? Perhaps most importantly, how do I properly capture clustering using multiple imputation?!!! Maybe I could borrow observed responses from similar participants in the same cluster… But what if my clusters are small? What if my clusters are large? Does my strategy need to change? (SPOILER ALERT: yes)
No wonder multiple imputation was only applied in 15% of studied primary analyses. Not to mention that two-level predictive mean matching, which I personally think is a very elegant and pragmatic approach to imputing ordinal data, was not even implemented in the add-on popular missing data R package “mice” until 2016. If you wanted to use existing tools before that, many would be faced with suboptimal imputation using a continuous model that doesn’t capture the bounded, discrete nuances of your data. Real talk: It’s also possible, particularly in the presence of relatively few missing observations, that the statisticians on some of these cluster trials “not addressing missing data” weighed the risk of bias due to the missing data against their confidence in required assumptions or inputs, or required effort to implement, and decided it was not worthwhile. Practical considerations are important too.
Anyhow, while I do think there is still a long way to go to make missing data methods for cluster trials accessible, the good news is that several, different two-level (level 1: individual; level 2: cluster) multiple imputation methods are now available in the popular and easy-to-use mice and add-on miceadds packages in R. This includes predictive mean matching! Yay! In this blog post, I will demonstrate how to impute missing ordinal data in a cluster randomized trial using two-level predictive mean matching via the mice and miceadds packages so you can avoid contributing to bad report cards! More yay – we love avoiding citation for bad practices!

