Overview

I’m currently working on a project that revolves around the EM algorithm, and am finally realizing the power of this machinery. It really is like that movie with Jim Carrey where he can’t stop seeing the number 23 everywhere, except for me it’s the EM algorithm. Apparently this is called THE BAADER-MEINHOF PHENOMENON, oooh that’s fancy. You’ve probably seen the EM algorithm around too – though perhaps you didn’t know it. It’s commonly used for estimation with missing data. A modified EM algorithm (EMis) is used by the Amelia library in R. The EM algorithm also underpins latent variable models, which makes sense because latent variables are really missing observations when you think about it, right?! The more I learn about statistics, the more I realize most things are really missing data problems… cough potential outcomes cough

Anyways, I was previously taught the EM algorithm using the classic multinomial example. This is a great teaching tool, but I’ve never run into a situation like this in my life (yet). But, I do run into mixture distributions a surprising amount – mostly when investigating heterogeneity within patient populations. There’s a whole textbook on this, see: Medical Applications of Finite Mixture Models. The EM algorithm makes a lot more sense to me in the context of mixture models:

We sample a group of patients and observe their response.
We notice a bimodal structure in the response distribution.
We hypothesize the observed distribution actually corresponds to two subpopulations or “classes.”
We don’t know who belongs to which subpopulation.
We estimate the probability of latent class membership using the EM algorithm.

Wouldn’t ya know it, this is unsupervised clustering.

In this blog post, I motivate the EM algorithm in the context of a two-component Gaussian mixture model. A thorough walkthrough of the underlying theory is provided. In this case, estimators take a nice closed form, but this is rarely the case for complex problems encountered in practice. R code for implementating the EM algorithm using the closed form estimators is provided. I also demonstrate how this model can be easily fit using the flexmix library.

Figure: A two-component Gaussian mixture density.

Theory walkthrough

Density of a two-component Gaussian mixture

Consider a random sample of $i=1, ..., N$ individuals. Let $Y_{1}, ..., Y_{N}$ represent their independent and identically distributed responses, which we assume arise from a two-component Gaussian mixture model. Individuals belong to the first component (i.e., subpopulation) with probability $\pi_1 = \pi$ and to the second component with probability $\pi_2 = (1-\pi)$ . The (mixture) density of the responses is formally provided by,

$\begin{equation*} f(y) = \pi_1 f_{1}(y) + \pi_2 f_{2}(y) \end{equation*}$

where $f_k(y)=f_k(y; \mu_k, \sigma_k)$ corresponds to the Normal density of the $k^{th}$ subpopulation, $k \in \lbrace 1, 2\rbrace$ , with mean $\mu_k$ and variance $\sigma_{k}^2$ . Equivalently, in the simple two-class scenario, we can express this as:

$\begin{equation*} f(y) = \pi f_{1}(y) + (1-\pi) f_{2}(y) \end{equation*}$

since $\pi_2 = 1-\pi_1$ . Thus, the mixture density is essentially an average of the two component densities, weighted by their respective component probabilities.

Let $\pmb{\theta}' = (\pi, \mu_1, \mu_2, \sigma_1, \sigma_2)$ represent the vector of all parameters. The corresponding likelihood is:

$\begin{align*} L(\pmb{\theta}) = \prod_{i=1}^{N} f(y_{i}; \pmb{\theta}) = \prod_{i=1}^{N} \left\lbrace \pi f_{1}(y_i) + (1-\pi) f_{2}(y_i) \right\rbrace \end{align*}$

and subsequently, the log-likelihood is:

$\begin{align*} \ell(\pmb{\theta}) = \log \left[ \prod_{i=1}^{N} f(y_i; \pmb{\theta}) \right] = \sum_{i=1}^{N} \log \left\lbrace \pi f_{1}(y_i) + (1-\pi) f_{2}(y_i) \right\rbrace. \end{align*}$

While the density and corresponding (log) likelihoods appear simple in form, the weighted sum within the logarithm makes the log-likelihood $\ell(\pmb{\theta})$ surprisingly challenging to maximize. Maximum likelihood estimation with respect to mixture distributions is also notoriously ill-behaved (e.g., boundary problems, singularity). The EM algorithm reconceptualizes this problem to make our lives easier.

Latent class membership

Now, suppose that for each individual there exists an unobserved or “latent” binary random variable $Z_{i} \sim \text{Bernoulli}(\pi)$ which dictates their component membership.
Then, the (observed) $Y_i$ can be re-expressed in terms of the (latent) $Z_{i}$ :

$\begin{equation*} Y_i = Z_{i} X_{1i} + (1-Z_{i}) X_{2i} \end{equation*}$

where the $X_{ki}$ are random deviates distributed according to the $k^{th}$ density. That is, $X_{1i} \stackrel{i.i.d.}{\sim} N(\mu_1, \sigma_1^2)$ and $X_{2i} \stackrel{i.i.d.}{\sim} N(\mu_2, \sigma_2^2)$ . The $X_{1i}$ , $X_{2i}$ , and $Z_{i}$ are assumed to be independent.

Notice that by definition, the conditional density of $Y_i$ given $Z_i$ is simply the Normal density corresponding to the component indicated by $Z_i$ . This can be compactly represented as:

$\begin{equation*} f(y|z) = \left\lbrace f_{1}(y) \right\rbrace^{z} \left\lbrace f_{2}(y) \right\rbrace^{1-z}. \end{equation*}$

That is, $f(y|z=1) = f_{1}(y)$ and $f(y|z=0) = f_{2}(y)$ .

Bayes’ Theorem tells us that $f(y, z) = f(y|z) f(z)$ . Plugging in $f(y|z)$ , and $f(z)$ , we obtain the joint density of $Y_i$ and $Z_i$ :

$\begin{align*} f(y, z) &= \left[ \left\lbrace f_{1}(y) \right\rbrace^{z} \left\lbrace f_{2}(y) \right\rbrace^{1-z} \right] \left[ \pi^{z} (1-\pi)^{1-z} \right] = \left\lbrace \pi f_{1}(y) \right\rbrace^{z} \left\lbrace (1-\pi) f_{2}(y) \right\rbrace^{1-z}. \end{align*}$

We can see that the density of the $Y_i$ remains $f(y)$ as defined above by simply marginalizing (i.e., summing) the joint distribution over the distinct values of $Z_i$ :

$\begin{align*} f(y) &= \sum_{z=0}^{1} f(y, z) = \pi f_{1}(y) + (1-\pi) f_{2}(y). \end{align*}$

Complete data likelihood

Although we obtain the same density for $Y_i$ as before, the advantage redefining the $Y_i$ in this way is that we can now work with the joint distribution $f(y,z)$ . $f(y, z)$ is much better suited to logarithmic transformation due to its product representation, unlike the marginal distribution $f(y)$ . This allows us to simplify estimation significantly.

If we observed both $Y_i$ and $Z_i$ , the complete data likelihood would take the form:

$\begin{align*} L_{C}(\pmb{\theta}) &= \prod_{i=1}^{N} f(y_i, z_i; \theta) = \prod_{i=1}^{N} \left\lbrace \pi f_{1}(y_i) \right\rbrace^{z_i} \left\lbrace (1-\pi) f_{2}(y_i) \right\rbrace^{1-z_i} \end{align*}$

and subsequently, the complete data log-likelihood is:

$\begin{align*} \ell_{C}(\pmb{\theta}) &= \sum_{i=1}^{N} z_i \left\lbrace \log \pi + \log f_1(y_i) \right\rbrace + (1-z_i) \left\lbrace \log (1-\pi) + \log f_2(y_i) \right\rbrace. \end{align*}$

Notice two things: (1) the logarithm now acts directly on the Normal densities; and (2) the complete data log-likelihood is simply the sum of contributions from the two components. As a result, derivatives of the complete data log-likelihood will be extremely straightforward to calculate.

Conditional expectation of complete log-likelihood

Of course, we can’t actually observe $z_i$ , so we cannot maximize the complete data log-likelihood directly. But, recall that conditional expectations are only functions of the parameters and the conditioning random variable. Thus, by taking the expectation of the complete data log-likelihood conditional on the observed $Y_i$ , we get something we can actually maximize directly. When we condition on $Y_i$ , the terms within the squiggly brackets of $\ell_C$ can be treated as fixed, implying that:

$\begin{align*} \mathbb{E}\left[ \ell_{C}(\pmb{\theta}) | \pmb{Y} \right] = \sum_{i=1}^{N} \mathbb{E}(Z_i | Y_i) \left\lbrace \log \pi + \log f_1(y_i) \right\rbrace + \mathbb{E}(1-Z_i | Y_i) \left\lbrace \log (1-\pi) + \log f_2(y_i) \right\rbrace. \end{align*}$

Using our previous work, we can easily derive the density of $Z_i$ conditional on $Y_i$ :

$\begin{align*} f(z|y) &= \frac{f(y, z)}{f(y)} = \frac{\left\lbrace \pi f_{1}(y) \right\rbrace^{z} \left\lbrace (1-\pi) f_{0}(y) \right\rbrace^{1-z}}{\pi f_{1}(y) + (1-\pi) f_{0}(y)}. \end{align*}$

Now, note that $\mathbb{E}(Z_i | Y_i) = \Pr(Z_i = 1 | Y_i) = f(z=1|y)$ and similarly $\mathbb{E}(1-Z_i | Y_i) = f(z=0|y)$ .
Plugging $f(z|y)$ into the conditional expectation, we obtain:

$\begin{align*} \mathbb{E}\left[ \ell_{C}(\pmb{\theta}) | \pmb{Y} \right] &= \sum_{i=1}^{N} w_i \left\lbrace \log \pi + \log f_1(y_i) \right\rbrace + (1-w_i) \left\lbrace\log (1-\pi) + \log f_2(y_i) \right\rbrace \end{align*}$

where:

$w_i = f(z=1|y_i) = \frac{\pi f_1(y_i)}{\pi f_1(y_i) + (1-\pi) f_2(y_i)}$

$1-w_i = f(z=0|y_i) = \frac{(1-\pi) f_2(y_i)}{\pi f_1(y_i) + (1-\pi) f_2(y_i)}$

The conditional expectation of the complete log-likelihood no longer features the unobserved $Z_i$ which is nice. But, it does feature $\pi$ – so how can we estimate these parameters?

The EM algorithm

We can maximize the expected complete log-likelihood with respect to $\pmb{\theta}$ , treating the weights $w_i$ as fixed. To do so, we need to use iteration. In essence:

Estimation: Initial parameter values $\pmb{\theta}^{(0)}$ are used to estimate the weights $w_{i}^{(1)}$ .
Maximization: The conditional expectation of the complete log-likelihood is then maximized with respect to $\pmb{\theta}$ , treating the $w_{i}^{(1)}$ as constant, and yielding $\pmb{\theta}^{(1)}$ .

The expectation and maximization steps are then repeated until convergence is reached. Hence, the “EM” algorithm!

Formally, at the $v^{th}$ iteration, we consider the objective function:

$\begin{align*} Q(\pmb{\theta}, \pmb{\theta}^{(v)}) &= \sum_{i=1}^{N} w_i^{(v)} \left\lbrace \log \pi + \log f_1(y_i) \right\rbrace + (1-w_i^{(v)}) \left\lbrace \log (1-\pi) + \log f_2(y_i) \right\rbrace \end{align*}$

where

$\begin{equation*} w_i^{(v)} =\frac{\pi^{(v-1)} f_1\left(Y_i; \mu_1^{(v-1)}, \sigma_1^{(v-1)}\right)}{\pi^{(v-1)} f_1\left(Y_i; \mu_1^{(v-1)}, \sigma_1^{(v-1)}\right) + \left(1-\pi^{(v-1)}\right) f_2\left(Y_i; \mu_2^{(v-1)}, \sigma_2^{(v-1)}\right)}. \end{equation*}$

Notice that weights at the $v^{th}$ step are estimated using parameter estimates from the previous $(v-1)^{th}$ step. The EM algorithm then proceeds as follows:

Initialize the parameter vector, $\pmb{\theta}^{(0)}$ .
E-step: Estimate the weights $w_{i}^{(1)}$ using $\pmb{\theta}^{(0)}$ .
M-step: Maximize the objective function with respect to $\pmb{\theta}$ to obtain $\pmb{\theta}^{(1)}$ :
$\begin{equation*} \pmb{\theta}^{(1)} = \arg \max_{\pmb{\theta}} Q(\pmb{\theta}, \pmb{\theta}^{(0)}) \end{equation*}$
Check for convergence in $Q$ or the parameter vector. If convergence is met, stop. Else, repeat steps 2 through 4.

Note: The performance of the EM algorithm is very sensitive to initialization. A commonly used strategy is to apply the $K$ -means clustering algorithm, and then estimate parameters within each cluster (demonstrated shortly).

Analytic form of the M-step

For this simple problem, the M-step can be represented as a set of formulas for each entry in the parameter vector $\pmb{\theta}$ . Recall that the $k^{th}$ Normal density is provided by,

$\begin{equation*} f_k(y) = \frac{1}{\sqrt{2\pi \sigma_k^2}} \exp \left\lbrace - \frac{(Y_i - \mu_k)^2}{\sigma_{k}^{2}} \right\rbrace. \end{equation*}$

Plugging this into our objective function (and ignoring the constant $2\pi$ ), we obtain:

$Q(\pmb{\theta}, \pmb{\theta}^{(v)}) = \sum_{i=1}^{N} w_i^{(v)} \left\lbrace \log \pi + \left(-\frac{1}{2} \log \sigma_1^2 - \frac{(Y_i - \mu_1)^2}{2\sigma_1^2}\right) \right\rbrace$

$+ \sum_{i=1}^{N} (1-w_i^{(v)}) \left\lbrace \log (1-\pi) + \left(-\frac{1}{2} \log \sigma_0^2 - \frac{(Y_i - \mu_0)^2}{2\sigma_0^2}\right) \right\rbrace$

To simplify notation, let $Q^{(v)} = Q(\pmb{\theta}, \pmb{\theta}^{(v)})$ . Taking partial derivatives with respect to each parameter provides:

$\frac{\partial}{\partial \pi} Q^{(v)} = \sum_{i=1}^{N} \left( \frac{w_{i}^{(v)}}{\pi} - \frac{(1-w_{i}^{(v)})}{1-\pi} \right)$

$\frac{\partial}{\partial \mu_1} Q^{(v)} = \sum_{i=1}^{N} w_{i}^{(v)} \frac{(Y_i - \mu_1)}{\sigma_1^2}$

$\frac{\partial}{\partial \sigma_1^2} Q^{(v)} = \sum_{i=1}^{N} w_{i}^{(v)} \left( -\frac{1}{2\sigma_1^2} + \frac{(Y_i-\mu_k)^2}{2\sigma_1^4} \right)$

Let $N_1^{(v)} = \sum_{i=1}^{N} w_{i}^{(v)}$ , which can be interpreted as “the effective number of subjects assigned to mixture component 1” at iteration $v$ . Setting each partial derivative equal to zero and solving provides estimators:

$\begin{equation*} \pi^{(v+1)} = \frac{N_{1}^{(v)}}{N} \end{equation*}$

$\begin{equation*} \mu_{1}^{(v+1)} = \frac{1}{N^{(v)}_{1}} \sum_{n=1}^{N} w_{i}^{(v)} Y_{i} \end{equation*}$

$\begin{equation*} \sigma_{1}^{2 ~(v+1)} = \frac{1}{N^{(v)}_{1}} \sum_{n=1}^{N} w_{i}^{(v)} \left(Y_{i} - \mu_{1}^{(v+1)}\right)^2 \end{equation*}$

Similar results are obtained for $\mu_2$ and $\sigma_2^2$ , with $w_i^{(v)}$ replaced by $(1-w_i^{(v)})$ and $N_{1}^{(v)}$ replaced by $N_{2}^{(v)} = \sum_{i=1}^{N} (1-w_{i}^{(v)})$ . Thus, estimators at the $(v+1)^{th}$ iteration are weighted estimators, where the weights represent the estimated class membership proportions from the $v^{th}$ iteration.

Implementation

Data generation

We set the following:

Sample size to $N=5,000$ ;
$f_{1}(y)$ equal to $N(\mu_1 = 5, \sigma_1 = 1)$ ;
$f_{2}(y)$ equal to $N(\mu_2 = 2, \sigma_2 = 1.25)$ ;
Mixing probability equal to $\pi = 0.6$ .

The following chunk generates data from the specified mixture by first sampling the true value of the latent class $Z_i$ from a Bernoulli distribution. Based on the drawn $Z_i$ , the response $Y_i$ is generated from the corresponding Normal density.

library(tidyverse)
set.seed(12345)

# Simulate two-component Gaussian mixture
## Sample size
N = 5000
## Parameters for first component
mu1 <- 5; sigma1 <- 1
## Parameters for second component
mu2 <- 2; sigma2 <- 1.25
## Mixing probability
pi <- 0.6
## True parameter vector
truth <- c(pi, mu1, sigma1, mu2, sigma2)

## Sample latent class
z <- rbinom(n=N, size=1, prob=pi)
## Sample response based on latent class
y1 <- rnorm(n=sum(z==1), mean=mu1, sd=sigma1)
y2 <- rnorm(n=sum(z==0), mean=mu2, sd=sigma2)
# Combine all responses into a single vector
y  <- c(y1, y2)

## Visualize mixture distribution
qplot(x=y, geom=c('histogram')) + 
  labs(x='Observed response, y', y='Observed frequency') +
  theme_bw()

Figure: Histogram of N=5000 responses generated from our two-component Gaussian mixture.

From scratch: analytic estimators

Helper functions

It is helpful to first define all functions we will need for our EM algorithm. In the following chunks, we define 5 functions:

f1: The density of the first component
f2: The density of the second component
estep: The estimation step, which calculates the weights
mstep: The maximization step, which updates parameter estimates
ll: The objective function, which calculates the value of the weighted log-likelihood based on current parameter estimates.

The density functions are simply reskinned versions of dnorm, to align with the notation used here.

# First density
f1 <- function(y, mu1, sigma1){
  dnorm(y, mean=mu1, sd=sigma1)
}

# Second density
f2 <- function(y, mu2, sigma2){
  dnorm(y, mean=mu2, sd=sigma2)
}

The E-step uses parameter estimates from the previous step to update weights at the current step.

# E-step: update weights
estep <- function(y, pi1, mu1, sigma1, mu2, sigma2){
  pi2 <- 1 - pi1
  w1_num <- pi1 * f1(y, mu1, sigma1)
  w1_den <- pi1 * f1(y, mu1, sigma1) + pi2 * f2(y, mu2, sigma2)
  w1 <- w1_num / w1_den
  return(w1)
}

The M-step uses the updated weights to update parameter estimates. Notice we are estimating $\sigma_k$ here rather than $\sigma_{k}^2$ , hence the square root.

# M-step: update parameter estimates
mstep <- function(y, w1){
  # weights of component 2
  w2 <- 1 - w1
  # mixture probability
  pi1 <- sum(w1)/N
  # first component params
  mu1 <- sum(w1 * y) / sum(w1)
  sigma1 <- sqrt(sum(w1 * (y-mu1)^2) / sum(w1))
  # second component params
  mu2 <- sum(w2 * y) / sum(w2)
  sigma2 <- sqrt(sum(w2 * (y-mu2)^2) / sum(w2))
  # estimated parameter vector
  est <- c(pi1, mu1, sigma1, mu2, sigma2)
  return(est)
}

Finally, we calculate the value of the weighted log-likelihood (objective function) based on the current parameter estimates in order to check convergence.

# Log-likelihood: calculate weighted log-likelihood based on parameter est
ll <- function(y, pi1, mu1, sigma1, mu2, sigma2){
  pi2 <- 1 - pi1
  inner <- log(pi1 * f1(y, mu1, sigma1) + pi2 * f2(y, mu2, sigma2))
  ll <- sum(inner)
  return(ll)
}

Putting it all together

We allow for a maximum of $B=1000$ iterations of the EM algorithm. If the value of the objective function at the current step is less than some tolerance greater than that at the previous step, iteration is stopped.

Parameter estimates at each iteration are stored within the vector est_list while the value of the objective function is stored within the vector ll_vec. This will allow us to see how our EM algorithm converges!

On the first iteration, b=1, the k-means algorithm is applied the observed responses with $k=2$ . Estimates of the parameters are then obtained within the two predicted clusters to serve as inital values.

ll_vec <- c()
est_list <- list()

for (b in 1:1000){
  
  # Initialize parameter values 
  if (b==1){     
    # Apply k-means with two centers
    kinit <- kmeans(y, centers=2)
    # Store predicted cluster labels
    cluster <- kinit$cluster
    
    # Estimate mixture probability
    pi1 <- sum(cluster==1)/N
    # Estimate first component parameters
    y1 <- y[cluster==1]
    mu1 <- mean(y1)
    sigma1 <- sd(y1)
    # Estimate second component parameters
    y2 <- y[cluster==2]
    mu2 <- mean(y2)
    sigma2 <- sd(y2)
    # Objective function
    obj <- ll(y, pi1, mu1, sigma1, mu2, sigma2)
  } 
  
  else { 
    # Iterative updating
    # estep
    w1 <- estep(y, pi1, mu1, sigma1, mu2, sigma2)
    # mstep
    est <- mstep(y, w1)
    pi1 <- est[1] 
    mu1 <- est[2] ; sigma1 <- est[3]
    mu2 <- est[4] ; sigma2 <- est[5]
    # value of objective function
    obj <- ll(y, pi1, mu1, sigma1, mu2, sigma2)
  }
  
  # Update log-likelihood vector and estimate list
  ll_vec[b] <- obj
  est_list[[b]] <- c('pi1'=pi1, 
                     'mu1'=mu1, 'sigma1'=sigma1, 
                     'mu2'=mu2, 'sigma2'=sigma2)
  
  if (b > 2) { if (ll_vec[b] - ll_vec[(b-1)] < 0.001) break }
}

We get pretty dang close to the truth with only 61 iterations using the EM algorithm!

rbind(c('True values', truth),
      c('Inital estimate', round(est_list[[1]], 3)),
      c('Best estimate', round(est_list[[length(ll_vec)]], 3)))

##                        pi1     mu1     sigma1  mu2     sigma2 
## [1,] "True values"     "0.6"   "5"     "1"     "2"     "1.25" 
## [2,] "Inital estimate" "0.375" "1.756" "1.052" "5.002" "0.917"
## [3,] "Best estimate"   "0.404" "1.994" "1.277" "5.001" "0.981"

rbind(c('Best log-lik', round(ll_vec[length(ll_vec)], 3)),
      c('No. of iter', length(ll_vec)))

##      [,1]           [,2]       
## [1,] "Best log-lik" "-9844.273"
## [2,] "No. of iter"  "61"

plot(ll_vec, type='l')

plot of chunk unnamed-chunk-8

You might notice that the component labels produced by our algorithm are opposite what we defined, i.e., class 1 and 2 are switched so $\pi_1 = 0.4$ rather than $0.6$ . That’s because labels are not unique! But that’s okay, we are smart enough to recognize which class is which here… This can get a bit dicey in more complex scenarios, however.

Out-of-the-box: using flexmix

flexmix is a fantastic package which revolves fitting mixture distributions and provides, you guessed it, a lot of flexibility.

library(flexmix)

To use flexmix for this simple scenario, we create a dataframe with our observed responses.

df <- data.frame(y=y)

Then, we apply the flexmix function, first specifying the model for the responses. Here, we simply use ~1 to specify no covariates. Then, we specify the number of components k=2 and the distribution of the responses (here Gaussian).

m1 <- flexmix(y ~ 1, data=df, k=2,  model = FLXMRglm(family='gaussian'))

m1

## 
## Call:
## flexmix(formula = y ~ 1, data = df, k = 2, model = FLXMRglm(family = "gaussian"))
## 
## Cluster sizes:
##    1    2 
## 1949 3051 
## 
## convergence after 101 iterations

cbind(parameters(m1, component=1), parameters(m1, component=2))

##                    Comp.1    Comp.2
## coef.(Intercept) 2.044882 5.0225334
## sigma            1.303764 0.9701286

We see very similar results from both implementations! With 101 iterations, the mixing probability is estimated by flexmix as 1949/3051 = 0.64, so the estimated parameters are (0.64, 5.02, 0.97, 2.04, 1.30). Lovely.

And they all lived happily ever after

Hope you found this helpful! In the future, I hope to discuss more complicated extensions such as mixtures of linear regressions which allows responses to depend on covariates, and mixture of experts models which allow both responses and class probabilities to depend on covariates! In these cases, I will also explore how optim can be used to maximize the objective function when closed form estimators are not available.

2 thoughts on “Embracing the EM algorithm: One continuous response”

Pingback: EM Algorithm Essentials: Maximizing objective functions using R's optim • Statisticelle
Pingback: EM Algorithm Essentials: Estimating standard errors using the empirical information matrix • Statisticelle

Embracing the EM algorithm: One continuous response