EM Algorithm Essentials: Estimating standard errors using the empirical information matrix

Introduction

At the end of my latest blog post, I promised that I would talk about how to perform constrained maximization using unconstrained optimizers. This can be accomplished by employing clever transformation and nice properties of maximum likelihood estimators – see this fantastic post from the now defunct 🙁 Econometrics Beat. Eventually I will get around to discussing more of the statistical details behind this approach and its implementation from scratch in R.

RIGHT NOW, I want to talk about how to obtain standard errors for Gaussian mixture model parameters estimated using the EM algorithm. So far, I’ve written two posts with respect to Gaussian mixtures and the EM algorithm:

  1. Embracing the EM algorithm: One continuous response, which motivates the theory behind the EM algorithm using a two component Gaussian mixture model; and
  2. EM Algorithm Essentials: Maximizing objective functions using R’s optim, which demonstrates how to implement log-likelihood maximization from scratch using the optim function.

Both posts so far have focused only on POINT ESTIMATION! That is, obtaining estimates of our mixture model parameters.

Any estimate we obtain using any statistical method has some uncertainty associated with it. We quantify the uncertainty of a parameter estimate by its STANDARD ERROR. If we repeated the same experiment with comparable samples a large number of times, the standard error would reflect how much our estimates differ from experiment to experiment. Indeed, we would estimate the standard error as the standard deviation of the effect estimates across experiments. If we can understand how our estimates vary across experiments, i.e., estimate their standard errors, we can perform statistical inference by testing hypotheses or constructing confidence intervals, for example.

We usually do not need to conduct all of these experiments because theory tells us what form this distribution of effect estimates takes! We refer to this distribution as the SAMPLING DISTRIBUTION. When the form of the sampling distribution is not obvious, we may use resampling techniques to estimate standard errors. The EM algorithm, however, attempts to obtain maximum likelihood estimates (MLEs) which have theoretical sampling distributions. For well-behaved densities, MLEs can be shown to be asymptotically normally distributed and unbiased with a variance-covariance matrix equal to the inverse of the Fisher Information matrix, i.e., a function of the second derivatives of the log-likelihood or equivalently, the square of the first derivatives…

The tricky bit with the EM algorithm is we don’t maximize the observed log-likelihood directly to estimate Gaussian mixture parameters. Instead, we maximize a more convenient objective function. People much smarter than I have figured out this can give you the same answer. A question remains: if standard errors are estimated using the second derivative of the log-likelihood, but we used the objective function, how do we properly quantify uncertainty? Particularly as the derivatives of the log-likelihood are not straight-forward, for the same reasons that make optimization difficult.

Isaac Meilijson, another person smarter than me, provides us with a wealth of guidance in his 1989 paper titled A Fast Improvement to the EM Algorithm on its Own Terms (I love this title). In this blog post, we summarize and demonstrate how to construct simple estimators of the observed information matrix, and subsequently the standard errors of our EM estimates, when we have independent and identically distributed responses arising from a two-component Gaussian mixture model. Essentially, due to some nice properties, we can simply use the derivatives of our objective function, in lieu of our log-likelihood, to estimate the observed information. Standard errors computed using the observed information are then compared to those obtained using numerical differentiation of the observed log-likelihood.

Review

Fisher information

Consider N random variables \pmb{Y} = (Y_1, ..., Y_N)^{T} which are independent and identically distributed (i.i.d.) according to some density f(y; \pmb{\theta}) with p parameters, \pmb{\theta} = (\theta_1, ..., \theta_p)^{T}. Let \pmb{y} = (y_1, ..., y_N)^{T} represent the observed values or realization of Y.

We wish to estimate \pmb{\theta} using maximum likelihood estimation. Because the y_i are i.i.d., the likelihood (equal to the joint density of \pmb{y}) is defined as,

    \[L(\pmb{\theta} \mid \pmb{y}) = \prod_{i=1}^{N} f(y_i; \pmb{\theta}).\]

We desire the estimates \hat{\pmb{\theta}} which maximize this likelihood,

    \[\hat{\pmb{\theta}} = \arg \max_{\pmb{\theta}} L(\pmb{\theta} \mid \pmb{y}).\]

That is, we seek the parameter values which make the observed most likely. Rather than maximizing the likelihood directly, it is convenient to instead maximize the log-likelihood, particularly because many common f(\cdot), including the normal, Bernoulli, Poisson, and exponential distributions, belong to the exponential family. The log-likelihood is thus,

    \[\ell(\pmb{\theta} \mid \pmb{y}) = \sum_{i=1}^{N} \log f(y_i; \pmb{\theta}).\]

Maximizing the log-likelihood with respect to \pmb{\theta} is equivalent to solving the p equations,

    \[S(\pmb{y}; \pmb{\theta}) = \frac{\partial \ell(\pmb{\theta} \mid \pmb{y})}{\partial \pmb{\theta}} = \pmb{0}\]

which yields the p maximum likelihood estimates \hat{\pmb{\theta}}=(\hat{\theta}_1, ..., \hat{\theta}_p)^{T}.

When certain conditions are met (they are for the following examples), asymptotically \hat{\pmb{\theta}} \sim N_{p}(\pmb{\theta}, \text{I}(\pmb{\theta})^{-1}) where \text{I}(\pmb{\theta})^{-1} is the inverse of the expected Fisher information matrix. The expected Fisher information matrix can be expressed as,

    \[\text{I}(\pmb{\theta}) = -\mathbb{E} \left[ \frac{\partial^2}{\partial \theta^2} \ell(\pmb{\theta} \mid \pmb{y}) \right]\]

or the negative expected value of the second derivative of the log-likelihood. If the second derivative of the log-likelihood is small when evaluated at \hat{\pmb{\theta}}, it suggests the likelihood is flat in the region of the MLE, or that there is lots of uncertainty with respect to \hat{\pmb{\theta}}. Thus, \widehat{\text{Var}}(\hat{\pmb{\theta}}) = \text{I}(\hat{\pmb{\theta}})^{-1} will be large. Often times, the observed Fisher information is used for variance estimation instead,

    \[\text{J}(\hat{\pmb{\theta}}) = - \left[ \frac{\partial^2}{\partial \theta^2} \ell(\pmb{\theta} \mid \pmb{y}) \right|_{\pmb{\theta} = \hat{\pmb{\theta}}}\]

as it avoids the need to take the expectation.

The vector of first derivatives can be re-expressed as,

    \[S(\pmb{y}; \pmb{\theta}) = \sum_{i=1}^{N} \frac{\partial}{\partial \pmb{\theta}} \log f(y_{i}; \pmb{\theta}) = \sum_{i=1}^{N} S(y_i; \pmb{\theta})\]

and is referred to as the score function. The score function has very interesting, and convenient properties:

  1. The expected value of the score function is zero: \mathbb{E}[S(\pmb{y}; \pmb{\theta})] = 0;
  2. When evaluated at the MLE \hat{\pmb{\theta}}, the score function is equal to \pmb{0} by definition: S(\pmb{y}; \hat{\pmb{\theta}}) = \pmb{0}; and
  3. The variance of the score function is the Fisher information: \text{Var}[S(\pmb{y}; \pmb{\theta})] = \text{I}(\pmb{\theta}).

This suggests a third form of the Fisher information matrix, referred to as the empirical Fisher information matrix,

    \[\text{H}(\pmb{\theta}) = \widehat{\text{Var}}(S(\pmb{y}; \pmb{\theta}))\]

or equivalently, because the individual score components S(y_i; \pmb{\theta}) are also i.i.d., H(\pmb{\theta}) = N \widehat{\text{Var}}(S(y_i; \pmb{\theta})):

    \[\text{H}(\pmb{\theta}) = N \left\lbrace \frac{1}{N} \sum_{i=1}^{N} S(y_i; \pmb{\theta}) ~S(y_i; \pmb{\theta})^{T} - \frac{1}{N^2} \left( \sum_{i=1}^{N} S(y_i; \pmb{\theta})\right)^2 \right\rbrace.\]

When evaluated at the MLE \hat{\pmb{\theta}}, this simplifies to,

    \[\text{H}(\hat{\pmb{\theta}}) = \sum_{i=1}^{N} S(y_i; \hat{\pmb{\theta}}) ~S(y_i; \hat{\pmb{\theta}})^{T}.\]

We can then invert the empirical Fisher information matrix at \hat{\pmb{\theta}} to estimate the covariance of the MLE estimates, \widehat{\text{Var}}(\hat{\pmb{\theta}}) = H(\hat{\pmb{\theta}})^{-1}. This is a nice result because it means we only need to compute the first derivatives of the log-likelihood. First derivatives can be challenging, let alone higher order derivatives.

Objective function for two-component mixture model

Suppose that \pmb{Y} = (Y_1, ..., Y_N)^{T} is a vector of N i.i.d. random variables following a two-component Gaussian mixture with mixing probability \pi and density,

    \[f(y) = \pi f_1(y) + (1-\pi) f_2(y)\]

where f_{k}(y) corresponds to the Normal density of subpopulation k \in \lbrace 1,2 \rbrace with mean \mu_{k} and variance \sigma^{2}_{k}. Let y_i represent the realization of Y_i and \pmb{\theta} = (\pi, \mu_1, \sigma_1^2, \mu_2, \sigma_2^2) represent the vector of all parameters. The observed log-likelihood is then,

    \[\ell(\pmb{\theta} \mid \pmb{y}) = \sum_{i=1}^{N} \log \left\lbrace \pi f_1(y_i) + (1-\pi) f_2(y_i) \right\rbrace.\]

For reasons previously explained elsewhere, mixture log-likelihoods can be challenging to maximize. Instead, we use the EM algorithm to iteratively maximize the objective function,

    \[Q(\pmb{\theta} \mid \pmb{\theta}^{(\nu)}) = \sum_{i=1}^{N} w_{i}^{(\nu)} \left[ \log \pi + \log f_1 (y_i) \right] + (1-w_i^{(v)}) \left[ \log(1-\pi) + \log f_2(y_i) \right]\]

where parameter estimates at iteration (\nu+1) are obtained as,

    \[\pmb{\theta}^{(\nu+1)} = \arg \max_{\pmb{\theta}} Q(\pmb{\theta} \mid \pmb{\theta}^{(\nu)})\]

using fixed weights,

    \[w_{i}^{(\nu)} = \frac{\pi^{(\nu)} f_1(y_i; \mu_{1}^{(\nu)}, \sigma_1^{(\nu)})}{\pi^{(\nu)} f_1(y_i; \mu_{1}^{(\nu)}, \sigma_1^{(\nu)}) + (1-\pi^{(\nu)}) f_2(y_i; \mu_{2}^{(\nu)}, \sigma_2^{(\nu)})}.\]

That is, the EM algorithm solves,

    \[\frac{\partial}{\partial \pmb{\theta}} Q(\pmb{\theta} \mid \pmb{\theta}^{(\nu)}) = \pmb{0}.\]

at iteration (\nu+1), rather than S(\pmb{y}; \pmb{\theta}) = \pmb{0}. If Q(\pmb{\theta}^{(\nu+1)} \mid \pmb{\theta}^{(\nu)}) > Q(\pmb{\theta}^{(\nu)} \mid \pmb{\theta}^{(\nu)}) then \ell(\pmb{\theta}^{(\nu+1)}) > \ell(\pmb{\theta}^{(\nu)}). Thus, increasing Q will increase \ell, though there is no guarantee that maximizing Q will yield the MLE.

Standard errors for EM estimates

No matter! The results of Louis (1982), summarized in Meilijson (1989), tell us that for any \pmb{\theta}, S(\pmb{y}; \pmb{\theta}) = Q'(\pmb{y}; \pmb{\theta}) where,

    \[Q'(\pmb{y}; \pmb{\theta}) =\left[ \frac{\partial}{\partial \pmb{\theta}0} Q(\pmb{\theta}0 \mid \pmb{\theta}) \right]_{\pmb{\theta}0 = \pmb{\theta}}.\]

This result also holds for a single y_i such that, Q'(y_i; \pmb{\theta}) = S(y_i; \pmb{\theta}). This means that for EM estimates \hat{\pmb{\theta}}, we can estimate the empirical information matrix using the “objective score function” Q' rather than the score function S:

    \[\text{H}(\hat{\theta}) = \widehat{\text{Var}}(Q'(\pmb{y}; \hat{\pmb{\theta}}))\]

or equivalently, H(\hat{\theta}) = N \widehat{\text{Var}}(Q'(y_i; \hat{\pmb{\theta}})). Awesome, let’s test this out!

“Objective score function” for Gaussian mixture

For notational simplicity, let:

    \[\frac{\partial Q_i}{\partial \pmb{\theta}} = \left[ \frac{\partial}{\partial \pmb{\theta}0} Q_i( \pmb{\theta}0 \mid \pmb{\theta}) \right]<em>{\pmb{\theta}0 = \pmb{\theta}}.\]

where Q_i represents the i^{th} element or summand of the objective function. Then, the elements of Q'(y_i; \pmb{\theta}) are:

    \[\frac{\partial Q_i}{\partial \pi} = \left( \frac{w_i}{\pi} - \frac{1-w_i}{1-\pi}\right)\]

    \[\frac{\partial Q_i}{\partial \mu</em>{1}} = w_{i} \left( \frac{y_i - \mu_1}{\sigma_{1}^2}\right)\]

    \[\frac{\partial Q_i}{\partial \sigma^2_{1}} = w_{i} \left( \frac{(y_i - \mu_1)^2 - \sigma^2_{1}}{2 ~\sigma_{1}^4} \right)\]

    \[\frac{\partial Q_i}{\partial \mu_{2}} = (1-w_{i}) \left( \frac{y_i - \mu_2}{\sigma_{2}^2}\right)\]

    \[\frac{\partial Q_i}{\partial \sigma^2_{2}} = (1-w_{i}) \left( \frac{(y_i - \mu_2)^2 - \sigma^2_{2}}{2 ~\sigma_{2}^4} \right)\]

In this case, it is easy to check that Q'(\pmb{y}; \pmb{\theta}) = S(\pmb{y}; \pmb{\theta}) with some additional work. For example,

    \[\frac{\partial \ell}{\partial \mu_1} = \sum_{i=1}^{N} \frac{\pi f_{1}(y_i)}{\pi f_1(y_i) + (1-\pi) f_{2}(y_i)} \left( \frac{y_i - \mu_1}{\sigma_1^2} \right) = \frac{\partial Q}{\partial \mu_{1}}.\]

P.S. you can see how the second derivative of the observed log-likelihood may get really nasty really quick.

Empirical information for simulated data

Data generation

Here, we apply the empirical information estimator to data generated in the same way as Embracing the EM algorithm: One continuous response. The true vector of parameters is \pmb{\theta} = (\pi = 0.6, \mu_1=5, \sigma_1^2 = 1, \sigma_2^2 = 1.25^2).

set.seed(12345)

# Simulate two-component Gaussian mixture
## Sample size
N = 5000
## Parameters for first component
mu1 <- 5; var1 <- 1
## Parameters for second component
mu2 <- 2; var2 <- 1.25^2
## Mixing probability
pi <- 0.6
## True parameter vector
truth <- c(pi, mu1, var1, mu2, var2)

## Sample latent class
z <- rbinom(n=N, size=1, prob=pi)
## Sample response based on latent class
y1 <- rnorm(n=sum(z==1), mean=mu1, sd=sqrt(var1))
y2 <- rnorm(n=sum(z==0), mean=mu2, sd=sqrt(var2))
# Combine all responses into a single vector
y  <- c(y1, y2)

EM estimation

EM estimates are obtained by following EM Algorithm Essentials: Maximizing objective functions using R’s optim… The only difference being that we perform an additional E-step using the final EM estimates to update the weights.

# First density
f1 <- function(y, Mu1, Var1, log=F){ dnorm(y, mean=Mu1, sd=sqrt(Var1), log=log) }

# Second density
f2 <- function(y, Mu2, Var2, log=F){ dnorm(y, mean=Mu2, sd=sqrt(Var2), log=log) }

EStep <- function(y, theta){
  # E-STEP: Estimate weights
  # Extract parameter values
  Pi  <- theta[1]
  Mu1 <- theta[2]; Var1 <- theta[3]
  Mu2 <- theta[4]; Var2 <- theta[5]
  
  # Compute numerator and denominator of weight w
  Numerator <- Pi * f1(y, Mu1, Var1)
  Denominator <- Pi * f1(y, Mu1, Var1) + (1-Pi) * f2(y, Mu2, Var2)
  
  # Calculate and return weights W
  w <- Numerator / Denominator
  return(w)
}

Q <- function(y, w, theta){
  # OBJECTIVE FUNCTION
  
  # Extract parameter values
  Pi  <- theta[1]
  Mu1 <- theta[2]; Var1 <- theta[3]
  Mu2 <- theta[4]; Var2 <- theta[5]
  
  # Calculate two terms of Q per subject (notice log=T in f)
  Term1 <- w*log(Pi) + w*f1(y, Mu1, Var1, log=T)
  Term2 <- (1-w)*log(1-Pi) + (1-w)*f2(y, Mu2, Var2, log=T)
  
  # Add 'em all together and return
  Q <- sum(Term1) + sum(Term2)
  return(Q)
}

MStep <- function(y, w, theta){
  # M-STEP: Maximize objective function
  # Specify optim arguments
  maxQ <- optim(# function to optimize
                fn = Q,
                # initial (current) parameter values
                par = theta,
                # additional user specified arguments to pass to fn
                y = y,
                w = w,
                # optimization method, see help for options
                method = 'BFGS',
                # optim minimizes by default, specify fnscale = -1 to maximize
                control = list(fnscale = -1)
  )
  
  # Return results of maximization = (Q, theta)
  results <- list(Qval=maxQ$value, theta=maxQ$par)
}
Q_vec <- c()
est_list <- list()

for (b in 1:1000){
  
  # Initialize parameter values 
  if (b==1){     
    # Apply k-means with two centers
    kinit <- kmeans(y, centers=2)
    # Store predicted cluster labels
    cluster <- kinit$cluster
    
    # Estimate mixture probability
    Pi <- sum(cluster==1)/N
    # Estimate first component parameters
    y1 <- y[cluster==1]
    Mu1 <- mean(y1)
    Var1 <- var(y1)
    # Estimate second component parameters
    y2 <- y[cluster==2]
    Mu2 <- mean(y2)
    Var2 <- var(y2)
    # Parameter vector
    theta <- c(Pi, Mu1, Var1, Mu2, Var2)
    # Objective function
    Qval <- Q(y, EStep(y, theta), theta)
  } 
  
  else { 
    # E-Step
    w <- EStep(y, theta)
    # M-Step
    M <- MStep(y, w, theta)
    Qval <- M$Qval 
    theta <- M$theta
  }
  
  # Update log-likelihood vector and estimate list
  Q_vec[b] <- Qval
  est_list[[b]] <- theta 
  
  if (b > 1) { if (abs(Q_vec[b] - Q_vec[(b-1)]) < 0.01) break }
}
# Final estimates of Theta
ThetaEM <- est_list[[length(Q_vec)]]

The final estimates were 0.41, 2, 1.64, 5, 0.96 which align closely with the true Theta = 0.6, 5, 1, 2, 1.5625, but with labels reversed. Here, we fix the labels, and re-estimate the weights using Theta.

Theta <- c('Pi'=1-ThetaEM[1], 'Mu1'=ThetaEM[4], 'Var1'=ThetaEM[5], 'Mu2'=ThetaEM[2], 'Var2'=ThetaEM[3])
w <- EStep(y, Theta)

As a result, we have: \hat{\pmb{\theta}} = 0.59, 5, 0.96, 2, 1.64.

Empirical information and standard errors

Now, we define functions corresponding to the elements of the “objective score function” Q(\pmb{y}; \hat{\pmb{\theta}}). Remember: to obtain the empirical information matrix, we estimate the variance of the individual objective score vectors. Thus, we need the value of Q' for each individual. The following code returns 5 columns for each individual corresponding to their objective score vectors of length p=5.

# Derivative of Q w.r.t Pi
dQi_Pi   <- function(wi, yi, Theta){ wi/Theta['Pi'] - (1-wi)/(1-Theta['Pi'])}

# Derivative of Q for first component parameters
dQi_Mu1  <- function(wi, yi, Theta){ wi*(yi-Theta['Mu1'])/Theta['Var1'] }
dQi_Var1 <- function(wi, yi, Theta){ wi*((yi - Theta['Mu1'])^2 - Theta['Var1'])/(2*Theta['Var1']^2)}

# Derivative of Q for second component parameters
dQi_Mu2  <- function(wi, yi, Theta){ (1-wi)*(yi-Theta['Mu2'])/Theta['Var2'] }
dQi_Var2 <- function(wi, yi, Theta){ (1-wi)*((yi-Theta['Mu2'])^2 - Theta['Var2'])/(2*Theta['Var2']^2)}

# Store all results with one row vector per individual
dQi_vec <- function(wi, yi, Theta){
  dQi <- cbind(dQi_Pi(wi, yi, Theta), 
               dQi_Mu1(wi, yi, Theta), 
               dQi_Var1(wi, yi, Theta),
               dQi_Mu2(wi, yi, Theta), 
               dQi_Var2(wi, yi, Theta))
  colnames(dQi) <- paste0('d',names(Theta))
  return(dQi)
}
# Calculate objective score components, and print a sample
dQi <- dQi_vec(w,y,Theta)
head(dQi)
##             dPi       dMu1       dVar1       dMu2      dVar2
## [1,] 0.95027906 -0.7872663 -0.05291849 0.22518904 0.08927032
## [2,] 1.63465475  0.6178672 -0.32241758 0.02620293 0.02514214
## [3,] 1.59079522  0.2134612 -0.48677399 0.04408917 0.03631870
## [4,] 0.06442835 -0.8809340  0.31884283 0.38535047 0.07091540
## [5,] 1.08828061 -0.7185802 -0.14532087 0.19270057 0.08536470
## [6,] 1.62744436  0.5275321 -0.37362115 0.02930063 0.02723943

Next, we estimate H(\hat{\pmb{\theta}}) by multiplying the sample size N by the estimated variance of the objective score components \widehat{\text{Var}}(\hat{\pmb{\theta}}). The variance-covariance matrix of \hat{\pmb{\theta}} is obtained by inverting H(\hat{\pmb{\theta}}) using the solve function. Taking the square root of the diagonal of the variance-covariance matrix gives us the standard errors for each element of \hat{\pmb{\theta}}.

H <- N*var(dQi)
rownames(H) <- colnames(H) <- names(Theta)
EmpVar <- solve(H)
EmpSE <- sqrt(diag(EmpVar))
## H:
H
##              Pi        Mu1       Var1        Mu2       Var2
## Pi   15291.9181 1731.67519 -895.24744 1391.26330  486.43460
## Mu1   1731.6752 2336.78831  547.37106 -319.38856  -16.21773
## Var1  -895.2474  547.37106 1035.94009   79.56311  -58.43365
## Mu2   1391.2633 -319.38856   79.56311  811.46535 -215.95742
## Var2   486.4346  -16.21773  -58.43365 -215.95742  246.26277
## Variance-covariance matrix:
EmpVar
##                 Pi           Mu1          Var1          Mu2         Var2
## Pi    0.0003675139 -0.0006374766  0.0006594684 -0.001445827 -0.001879341
## Mu1  -0.0006374766  0.0016459897 -0.0014395852  0.002811064  0.003491127
## Var1  0.0006594684 -0.0014395852  0.0023248343 -0.002804938 -0.003305549
## Mu2  -0.0014458268  0.0028110637 -0.0028049383  0.007467666  0.008924143
## Var2 -0.0018793411  0.0034911274 -0.0033055495  0.008924143  0.015044395
## Standard errors:
EmpSE
##         Pi        Mu1       Var1        Mu2       Var2 
## 0.01917065 0.04057080 0.04821654 0.08641566 0.12265559

Putting our point estimates and their standard errors based on the empirical information matrix H(\hat{\pmb{\theta}}) side by side, we have:

cbind(Theta, EmpSE)
##          Theta      EmpSE
## Pi   0.5937860 0.01917065
## Mu1  5.0046047 0.04057080
## Var1 0.9581729 0.04821654
## Mu2  2.0020342 0.08641566
## Var2 1.6396322 0.12265559

Comparison to numerical differentiation

Apparently the empirical information has a subpar reputation?? The realistic alternative to using the empirical information matrix is to use numerical differentiation to compute the Hessian of the observed log-likelihood, and then multiply that by -1 to approximate the observed Fisher information. Let’s use the hessian function within the numDeriv library to do just that, and compare!

library(numDeriv)

Here, we specify a function for the observed log-likelihood.

obsLL <- function(Theta, y){
  #' Observed log-likelihood
  summand <- log(Theta['Pi'] * f1(y, Theta['Mu1'], Theta['Var1']) +
                   (1-Theta['Pi']) * f2(y, Theta['Mu2'], Theta['Var2']))
  LL <- sum(summand)
  return(LL)
}

Next, we use hessian to obtain the second derivative of the observed log-likelihood. Again, taking the inverse of the hessian gives us the variance-covariance matrix, and our SEs are equal to the square root of the diagonal.

estSE <- function(y, Theta){
  Hess <- hessian(func=function(Theta, y) obsLL(Theta, y), x=Theta, y=y)
  colnames(Hess) <- rownames(Hess) <- names(Theta)
  Var  <- solve(-Hess)
  SE   <- sqrt(diag(Var))
  return(list('Hessian'=Hess, 'Var'=Var, 'SE'=SE))
}

Applying this to our data and estimates, we get very similar results to before:

numResults <- estSE(y, Theta)
numResults
## $Hessian
##               Pi         Mu1        Var1         Mu2       Var2
## Pi   -15288.8597 -1731.18348   894.96393 -1391.41293 -487.01278
## Mu1   -1731.1835 -2336.44501  -540.42459   319.32468   16.21448
## Var1    894.9639  -540.42459 -1074.39207   -79.54719   58.42197
## Mu2   -1391.4129   319.32468   -79.54719  -810.75429  205.53416
## Var2   -487.0128    16.21448    58.42197   205.53416 -233.31043
## 
## $Var
##                 Pi           Mu1          Var1          Mu2         Var2
## Pi    0.0003406151 -0.0005777326  0.0005757352 -0.001316482 -0.001766736
## Mu1  -0.0005777326  0.0015142485 -0.0012547940  0.002528440  0.003224413
## Var1  0.0005757352 -0.0012547940  0.0020627894 -0.002421239 -0.002905448
## Mu2  -0.0013164817  0.0025284404 -0.0024212392  0.006841598  0.008344544
## Var2 -0.0017667358  0.0032244133 -0.0029054483  0.008344544  0.014821678
## 
## $SE
##         Pi        Mu1       Var1        Mu2       Var2 
## 0.01845576 0.03891335 0.04541794 0.08271395 0.12174431

and putting everything side-by-side we have:

cbind(Theta, EmpSE, 'NumSE'=numResults$SE)
##          Theta      EmpSE      NumSE
## Pi   0.5937860 0.01917065 0.01845576
## Mu1  5.0046047 0.04057080 0.03891335
## Var1 0.9581729 0.04821654 0.04541794
## Mu2  2.0020342 0.08641566 0.08271395
## Var2 1.6396322 0.12265559 0.12174431

In this case at least, results are very similar between the two approaches! The empirical SEs are slightly larger than the numerical SEs, so will result in slightly more conservative inference if anything.

Conclusion

In this blog post, we reviewed variance estimation for maximum likelihood estimates based on inversion of the Fisher information matrix. We then used properties of the score function and the EM objective function to derive the form of the empirical information matrix for EM estimates. Results were applied to a simulated dataset generated from a two-component Gaussian mixture model. Empirical standard errors were then compared to those obtain through numerical differentiation of the observed log-likelihood. Standard errors were similar between the two methods, with empirical standard errors being slightly larger than their numerical counterparts.

I think that the empirical information matrix approach to estimating SEs is really elegant in the i.i.d. case, requiring only the first derivatives of the objective function. The objective function is much easier to differentiate than the observed log-likelihood, with results providing the same values as the score function. These days, numerical differentiation is really easy to implement (one function definition + one line of code here) so why not! But what I take away from this is that the empirical information matrix can serve as a good sanity check for your numerical results.

Thanks for reading!

Published by

Emma Davies Smith

Emma Davies Smith is currently a Postdoctoral Research Fellow in Biostatistics at the Harvard T. H. Chan School of Public Health. Her current research interests include randomized controlled trials, correlated data, clinical epidemiology, and causal inference. When she's not working on expanding her knowledge of statistics, she's busy petting cats and unsuccessfully convincing her husband to let her adopt them, hiking, and concocting indie and folk rock playlists.

One thought on “EM Algorithm Essentials: Estimating standard errors using the empirical information matrix”

  1. P.S. Shoutout to Dr. Tom Chen for first showing me how to use the hessian function; PhD candidate Nick Birk for performing a quick lit review on this topic; and Dr. Rui Wang for pointing me in the direction of Louis (1982).

Leave a Reply

Your email address will not be published. Required fields are marked *