EM Algorithm Essentials: Maximizing objective functions using R’s optim

Overview

In my previous blogpost, I motivated the EM algorithm in the context of estimating the parameters of a two-component Gaussian mixture density. In this case, we can write the estimators of the mixing probability, means, and variance in a nice closed form, and I demonstrated how to implement the corresponding iterative estimation procedure from scratch. Results were then compared to those obtained from the very nice R package flexmix.

However, rarely do we get such nice closed form estimators! We usually need to use numerical methods to maximize our objective function directly. In this blog post, I demonstrate how we can specify our objective function, and use the optim function in R to obtain our parameter estimates. optim has lots of options, and we will cover how to change the optimization procedure and implement restrictions on our parameter spaces.

EM for two-component Gaussian mixture

Let’s quickly recap our motivation, previously discussed in Embracing the EM algorithm: One continuous response.

We randomly sample N patients from a population, and examine the empirical density of their responses. We notice two modes, and based on prior knowledge, hypothesize that the density is actually a mixture of two Gaussian densities. For example, the density centered around greater responses may correspond to “healthy” patients and the other to “ill” patients. We would like to (1) estimate the probability of belonging to each subpopulation; and (2) estimate subpopulation parameters, e.g. mean response in among the healthy and ill. But, we have a problem: we don’t know who belongs to each subpopulation. In other words, subpopulation labels are unobserved or “latent.”

Figure: Observed two-component Gaussian mixture density (purple), and distribution of responses among latent healthy (blue) and ill (red) patient subpopulations.

Figure: Observed two-component Gaussian mixture density (purple), and distribution of responses among latent healthy (blue) and ill (red) patient subpopulations.

We can represent the density of the observed responses as a mixture of the subpopulation densities:

    \[f(y) = \pi~ f_1(y) + (1-\pi)~ f_2(y).\]

That is, individuals belong to the first subpopulation, or are distributed according to density f_{1}(y), according to probability \pi and to the second, distributed per f_2(y), with probability 1-\pi. We assume both densities are Gaussian with respective means \mu_1 and \mu_2 and variances \sigma_1^2 and \sigma_2^2.

Including our “mixing probability” \pi, the vector of all parameters is \pmb{\theta} = (\pi, \mu_1, \sigma_1, \mu_2, \sigma_2). Suppose we use a traditional maximum likelihood approach to estimation, and begin by defining the marginal log-likelihood as a function of our sample of N observed responses y:

    \begin{equation*} \ell(\pmb{\theta}) = \log \left\lbrace \prod_{i=1}^{N} f(y_i;\pmb{\theta}) \right\rbrace = \sum_{i=1}^{N} \log \left\lbrace \pi~ f_{1}(y_i) + (1-\pi) ~f_{2}(y_i) \right\rbrace. \end{equation*}

While this appears simple in form, this log-likelihood can be surprisingly challenging to maximize. The EM algorithm gets around this through clever re-expression of the mixture density in terms of the observed response and a latent random variable corresponding to subpopulation membership.

Let Z_{i} \sim \text{Bernoulli}(\pi) be a random subpopulation indicator, equal to 1 if individual i belongs to the first subpopulation (“healthy”) and 0 otherwise (“ill”). We have by definition f(y|z=1) = f_{1}(y) and f(y|z=0) = f_{2}(y), which can be represented compactly as:

    \begin{equation*} f(y|z) = \left\lbrace f_{1}(y) \right\rbrace^{z} \left\lbrace f_{2}(y) \right\rbrace^{1-z}. \end{equation*}

The joint density of Y_i and Z_i is then given through application of Bayes’ Theorem,

    \begin{equation*} f(y, z) = f(y|z) ~ f(z) = \left\lbrace \pi f_{1}(y) \right\rbrace^{z} \left\lbrace (1-\pi) f_{2}(y) \right\rbrace^{1-z} \end{equation*}

and using marginalization, you can check that we obtain the same form of f(y) given above.

If we observed both Y_i and Z_i, the complete data log-likelihood would take the form:

    \begin{align*} \ell_{C}(\pmb{\theta}) &= \sum_{i=1}^{N} z_i \left\lbrace \log \pi + \log f_1(y_i) \right\rbrace + (1-z_i) \left\lbrace \log (1-\pi) + \log f_2(y_i) \right\rbrace. \end{align*}

The complete data log-likelihood is better behaved for at least two reasons: (1) the logarithm now acts directly on the Normal densities; and (2) the complete data log-likelihood is simply the sum of contributions from the two components. But, of course, we don’t observe the Z_i! By taking the expectation of the complete data log-likelihood conditional on the observed Y_i, we get a function no longer featuring Z_i:

    \begin{equation*} \mathbb{E}\left[ \ell_{C}(\pmb{\theta}) | \pmb{Y} \right] = \sum_{i=1}^{N} w_i \left\lbrace \log \pi + \log f_1(y_i) \right\rbrace + (1-w_i) \left\lbrace\log (1-\pi) + \log f_2(y_i) \right\rbrace \end{equation*}

where

    \begin{equation*} w_i = f(z=1|y_i) = \frac{\pi f_1(y_i)}{\pi~ f_1(y_i) + (1-\pi)~ f_2(y_i)} \end{equation*}

and similarly, 1-w_i = f(z=0|y_i). We can maximize this function with respect to \pmb{\theta}, treating the weights w_i as fixed. But, since the weights feature the parameters, we will need to use iteration.

Formally, at the v^{th} iteration, we consider the objective function:

    \begin{align*} Q(\pmb{\theta} \mid \pmb{\theta}^{(v)}) &= \sum_{i=1}^{N} w_i^{(v)} \left\lbrace \log \pi + \log f_1(y_i) \right\rbrace + (1-w_i^{(v)}) \left\lbrace \log (1-\pi) + \log f_2(y_i) \right\rbrace \end{align*}

where

    \begin{equation*} w_i^{(v)} =\frac{\pi^{(v-1)} f_1\left(Y_i; \mu_1^{(v-1)}, \sigma_1^{(v-1)}\right)}{\pi^{(v-1)} f_1\left(Y_i; \mu_1^{(v-1)}, \sigma_1^{(v-1)}\right) + \left(1-\pi^{(v-1)}\right) f_2\left(Y_i; \mu_2^{(v-1)}, \sigma_2^{(v-1)}\right)}. \end{equation*}

Notice that weights at the v^{th} step are estimated using parameter estimates from the previous (v-1)^{th} step. The EM algorithm then proceeds as follows:

  1. Initialize the parameter vector, \pmb{\theta}^{(0)}.
  2. E-step: Estimate the weights w_{i}^{(1)} using \pmb{\theta}^{(0)}.
  3. M-step: Maximize the objective function with respect to \pmb{\theta} to obtain \pmb{\theta}^{(1)}:

        \begin{equation*} \pmb{\theta}^{(1)} = \arg \max_{\pmb{\theta}} Q(\pmb{\theta} \mid \pmb{\theta}^{(0)}) \end{equation*}

  4. Check for convergence in Q or the parameter vector. If convergence is met, stop. Else, repeat steps 2 through 4.

In what follows, we implement this algorithm using optim to perform our M-step!

Implementation

The EM algorithm splits things nicely into two steps, and we will follow this setup here by defining modular functions for our E-step and our M-step. But, first things first, we need to generate some data and initialize our estimates.

Data generation

For comparison, we will use exactly the same data generation process as the companion blogpost Embracing the EM algorithm: One continuous response. We set the following:

  • Sample size to N=5,000;
  • f_{1}(y) equal to N(\mu_1 = 5, \sigma_1 = 1);
  • f_{2}(y) equal to N(\mu_2 = 2, \sigma_2 = 1.25);
  • Mixing probability equal to \pi = 0.6.

The following chunk generates data from the specified mixture by first sampling the true value of the latent class Z_i from a Bernoulli distribution. Conditional on the drawn Z_i, the response Y_i is generated from the corresponding Gaussian distribution.

set.seed(12345)

# Simulate two-component Gaussian mixture
## Sample size
N = 5000
## Parameters for first component
mu1 <- 5; sigma1 <- 1
## Parameters for second component
mu2 <- 2; sigma2 <- 1.25
## Mixing probability
pi <- 0.6
## True parameter vector
truth <- c(pi, mu1, sigma1, mu2, sigma2)

## Sample latent class
z <- rbinom(n=N, size=1, prob=pi)
## Sample response based on latent class
y1 <- rnorm(n=sum(z==1), mean=mu1, sd=sigma1)
y2 <- rnorm(n=sum(z==0), mean=mu2, sd=sigma2)
# Combine all responses into a single vector
y  <- c(y1, y2)
library(tidyverse)
## Visualize mixture distribution
qplot(x=y, geom=c('histogram')) + 
  labs(x='Observed response, y', y='Observed frequency') +
  theme_bw()
Figure: Histogram of N=5000 responses generated from our two-component Gaussian mixture.

Figure: Histogram of N=5000 responses generated from our two-component Gaussian mixture.

E-step

Given the previous iteration’s parameter estimates, the E-step estimates the weights w_i. For simplification, I first reskin dnorm to align with the notation used here. If log=T then the log density is returned (helpful for log-likelihood).

# First density
f1 <- function(y, Mu1, Sigma1, log=F){ dnorm(y, mean=Mu1, sd=Sigma1, log=log) }

# Second density
f2 <- function(y, Mu2, Sigma2, log=F){ dnorm(y, mean=Mu2, sd=Sigma2, log=log) }

We need to pass in the data y and a vector of parameter values theta (corresponding to \pmb{\theta} defined previously).

EStep <- function(y, theta){
  # E-STEP: Estimate weights
  # Extract parameter values
  Pi  <- theta[1]
  Mu1 <- theta[2]; Sigma1 <- theta[3]
  Mu2 <- theta[4]; Sigma2 <- theta[5]
  
  # Compute numerator and denominator of weight w
  Numerator <- Pi * f1(y, Mu1, Sigma1)
  Denominator <- Pi * f1(y, Mu1, Sigma1) + (1-Pi) * f2(y, Mu2, Sigma2)
  
  # Calculate and return weights W
  w <- Numerator / Denominator
  return(w)
}

M-step using optim

Now for the moment you’ve all been waiting for – performing the Maximization or M-step using optim!

We first need to turn our objective function (the function we want to maximize) into an R function that we can later pass to optim. The arguments of Q are: the data y, the current values of the weights w, and the current parameter estimates theta.

Q <- function(y, w, theta){
  # OBJECTIVE FUNCTION
  
  # Extract parameter values
  Pi  <- theta[1]
  Mu1 <- theta[2]; Sigma1 <- theta[3]
  Mu2 <- theta[4]; Sigma2 <- theta[5]
  
  # Calculate two terms of Q per subject (notice log=T in f)
  Term1 <- w*log(Pi) + w*f1(y, Mu1, Sigma1, log=T)
  Term2 <- (1-w)*log(1-Pi) + (1-w)*f2(y, Mu2, Sigma2, log=T)
  
  # Add 'em all together and return
  Q <- sum(Term1) + sum(Term2)
  return(Q)
}

We wrap optim in an M-step function which requires the same arguments as Q. The optim function will return the maximized value of Q and corresponding parameter estimates in a named list, so we store the results to access them. We use ‘BFGS’ as our unconstrained optimization technique (default), but there are several options available such as ‘Nelder-Mead’. See ?optim for more details.

MStep <- function(y, w, theta){
  # M-STEP: Maximize objective function
  # Specify optim arguments
  maxQ <- optim(# function to optimize
                fn = Q,
                # initial (current) parameter values
                par = theta,
                # additional user specified arguments to pass to fn
                y = y,
                w = w,
                # optimization method, see help for options
                method = 'BFGS',
                # optim minimizes by default, specify fnscale = -1 to maximize
                control = list(fnscale = -1)
  )
  
  # Return results of maximization = (Q, theta)
  results <- list(Qval=maxQ$value, theta=maxQ$par)
}

Initialization and iteration

Each step of maximizing the objective function Q should increase the value of the marginal log-likelihood. However, the EM algorithm is only guaranteed to find a local maxima. Thus, in general, it is quite sensitive to the choice of starting estimates. There are a lot of strategies for dealing with this, such as selecting multiple sets of initial values and retaining the set which performs best, i.e., results in the greatest likelihood. For complex models, it can also be challenging to even know how to reasonably obtain initial estimates!

Here, we use the k-means algorithm to cluster the data into two groups and treat predicted labels as the true value of Z_i. Within the two groups defined by Z_i, we can independently obtain initial estimates of \mu_i and \sigma_i.

Q_vec <- c()
est_list <- list()

for (b in 1:1000){
  
  # Initialize parameter values 
  if (b==1){     
    # Apply k-means with two centers
    kinit <- kmeans(y, centers=2)
    # Store predicted cluster labels
    cluster <- kinit$cluster
    
    # Estimate mixture probability
    Pi <- sum(cluster==1)/N
    # Estimate first component parameters
    y1 <- y[cluster==1]
    Mu1 <- mean(y1)
    Sigma1 <- sd(y1)
    # Estimate second component parameters
    y2 <- y[cluster==2]
    Mu2 <- mean(y2)
    Sigma2 <- sd(y2)
    # Parameter vector
    theta <- c(Pi, Mu1, Sigma1, Mu2, Sigma2)
    # Objective function
    Qval <- Q(y, EStep(y, theta), theta)
  } 
  
  else { 
    # E-Step
    w <- EStep(y, theta)
    # M-Step
    M <- MStep(y, w, theta)
    Qval <- M$Qval 
    theta <- M$theta
  }
  
  # Update log-likelihood vector and estimate list
  Q_vec[b] <- Qval
  est_list[[b]] <- theta 
  
  if (b > 1) { if (abs(Q_vec[b] - Q_vec[(b-1)]) < 0.01) break }
}
rbind(c('True values:', truth),
      c('Inital estimate:', round(est_list[[1]], 3)),
      c('Best estimate:', round(est_list[[length(Q_vec)]], 3)))
##      [,1]               [,2]    [,3]    [,4]    [,5]    [,6]   
## [1,] "True values:"     "0.6"   "5"     "1"     "2"     "1.25" 
## [2,] "Inital estimate:" "0.375" "1.756" "1.052" "5.002" "0.917"
## [3,] "Best estimate:"   "0.406" "2"     "1.28"  "5.004" "0.979"

Ah! This is something we saw in Embracing the EM algorithm: One continuous response! Cluster or component labels are not unique. Here, the algorithm has “swapped” the component labels. That’s okay, we are smart enough to figure out which is which. However, this can be tricky for more complex problems… Constrained optimization presents one solution, discussed shortly.

rbind(c('Initial objective:', round(Q_vec[1], 3)),
      c('Best objective:', round(Q_vec[length(Q_vec)], 3)),
      c('No. of iterations:', length(Q_vec)))
##      [,1]                 [,2]        
## [1,] "Initial objective:" "-10586.343"
## [2,] "Best objective:"    "-10909.606"
## [3,] "No. of iterations:" "87"
LL <- function(y, theta){
  # Extract parameter values
  Pi  <- theta[1]
  Mu1 <- theta[2]; Sigma1 <- theta[3]
  Mu2 <- theta[4]; Sigma2 <- theta[5]
  Term <- log(Pi * f1(y, Mu1, Sigma1) + (1-Pi) * f2(y, Mu2, Sigma2))
  LL <- sum(Term)
  return(LL)
}
 plot(unlist(lapply(est_list, FUN=function(X) LL(y, X))), 
      type='l', xlab='Iteration', ylab='Marginal log-likelihood')
Marginal (observed) log-likelihood at each iteration of the EM algorithm.

Marginal (observed) log-likelihood at each iteration of the EM algorithm.

Constrained optimization with L-BFGS-B

The previous M-step and optim procedure performed unconstrained optimization for all parameters in theta over the range (-\infty, +\infty). However, we often need to constrain certain parameters to fall within a given range. For example, the variances \sigma^2_{1} and \sigma^2_{2} only exist within the range (0, +\infty). Another example is correlation which is bounded within (-1, 1).

By setting method = 'L-BFGS-B', we can set lower and upper constraints on the optimization of specific parameters. The following code sets bounds of (0, +\infty) for each of the \sigma_{i}. We can also address our label problem, by restricting the range of \pi to (0.5, 1) so the first component is forced to be properly identified. Unbounded variables have lower and upper set to -Inf and Inf, respectively.

MStep2 <- function(y, w, theta){
  # M-STEP: Maximize objective function
  # Specify optim arguments
  maxQ <- optim(# function to optimize
                fn = Q,
                # initial (current) parameter values
                par = theta,
                # additional user specified arguments to pass to fn
                y = y,
                w = w,
                # optimization method, see help for options
                method = 'L-BFGS-B',
                # Bounds on Pi, Mu1, Sigma1, Mu2, Sigma2
                lower = c(0.5, -Inf,   0, -Inf,   0),
                upper = c(1.0,  Inf, Inf,  Inf, Inf),
                # optim minimizes by default, specify fnscale = -1 to maximize
                control = list(fnscale = -1)
  )
  
  # Return results of maximization = (Q, theta)
  results <- list(Qval=maxQ$value, theta=maxQ$par)
}
Q_vec <- c()
est_list <- list()

for (b in 1:1000){
  
  # Initialize parameter values 
  if (b==1){     
    # Apply k-means with two centers
    kinit <- kmeans(y, centers=2)
    # Store predicted cluster labels
    cluster <- kinit$cluster
    
    # Estimate mixture probability
    Pi <- sum(cluster==1)/N
    # Estimate first component parameters
    y1 <- y[cluster==1]
    Mu1 <- mean(y1)
    Sigma1 <- sd(y1)
    # Estimate second component parameters
    y2 <- y[cluster==2]
    Mu2 <- mean(y2)
    Sigma2 <- sd(y2)
    # Parameter vector
    theta <- c(Pi, Mu1, Sigma1, Mu2, Sigma2)
    # Objective function
    Qval <- Q(y, EStep(y, theta), theta)
  } 
  
  else { 
    # E-Step
    w <- EStep(y, theta)
    # M-Step
    M <- MStep2(y, w, theta)
    Qval <- M$Qval 
    theta <- M$theta
  }
  
  # Update log-likelihood vector and estimate list
  Q_vec[b] <- Qval
  est_list[[b]] <- theta 
  
  if (b > 1) { if (abs(Q_vec[b] - Q_vec[(b-1)]) < 0.01) break }
}
rbind(c('True values:', truth),
      c('Inital estimate:', round(est_list[[1]], 3)),
      c('Best estimate:', round(est_list[[length(Q_vec)]], 3)))
##      [,1]               [,2]    [,3]    [,4]    [,5]    [,6]   
## [1,] "True values:"     "0.6"   "5"     "1"     "2"     "1.25" 
## [2,] "Inital estimate:" "0.625" "5.002" "0.917" "1.756" "1.052"
## [3,] "Best estimate:"   "0.593" "5.006" "0.978" "2.005" "1.282"

And voila! The components are properly ordered. We don’t see a difference in the \hat{\sigma}_i, but they aren’t problematically close to 0.

Concluding thoughts

Motivated by a two-component Gaussian mixture, this blog post demonstrates how to maximize objective functions using R’s optim function. Unconstrained maximization using BFGS and constrained maximization using L-BFGS-B is demonstrated. There are alternative ways of performing constrained optimization, for example, by performing unconstrained optimization on a transformed parameter space (similar idea to use of a link function in generalized linear models). I’ll cover this in my next blogpost, as it gets into some neat properties of maximum likelihood estimation and optimization! Hope this is helpful.

P.S. Shoutout and thanks to Dr. Tom Chen and Ph.D. student Nick Birk for their help on related implementations.

Published by

Emma Davies Smith

Emma Davies Smith is currently a postdoctoral research fellow at the Harvard School of Public Health. Her current research interests include clinical trial methodology, nonparametric methods, missing data, data visualization, and communication. When she's not working on expanding her knowledge of statistics, she's busy petting cats and unsuccessfully convincing her husband to let her adopt them, hiking, and concocting indie and folk rock playlists.

Leave a Reply

Your email address will not be published. Required fields are marked *