ML Theory: Kernel Density Estimation

Motivation

It is important to have an understanding of some of the more traditional approaches to function estimation and classification before delving into the trendier topics of neural networks and decision trees. Many of these methods build on an understanding of each other and thus to truly be a MACHINE LEARNING MASTER, we’ve got to pay our dues. We will therefore start with the slightly less sexy topic of kernel density estimation.

Let X be a random variable with a continuous distribution function (CDF) F(x) = Pr(X \leq x) and probability density function (PDF)

    \[f(x) = \frac{d}{dx} F(x)\]

Our goal is to estimate f(x) from a random sample \lbrace X_1, \hdots, X_n \rbrace. Estimation of f(x) has a number of applications including construction of the popular Naive Bayes classifier,

    \[ \hat{Pr}(C = c | X = x_0) = \frac{\hat{\pi}_c \hat{f}_{c}(x_0)}{\sum_{k=1}^{C} \hat{\pi}_{k} \hat{f}_{k}(x_0)} \]

Derivation

The CDF is naturally estimated by the empirical distribution function (EDF)

    \[ \hat{F}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x)\]

where

    \[ \mathbb{I}(X_i \leq x) = \begin{cases} 1 & X_i \leq x \\ 0 & \text{otherwise} \end{cases} \]

I’m not saying naturally to be a jerk! I know the feeling of reading proof-heavy journal articles that end sections with “extension to the d-dimensional case is trivial”, it’s not fun when it’s not trivial to you. \hat{F}(x) essentially estimates Pr(X \leq x), the probability of X being less than some threshold x, as the proportion of observations in our sample less than or equal to x.

It might seem natural to estimate the density f(x) as the derivative of \hat{F}(x) but \hat{F}(x) is just a collection of mass points and is not continuous. Instead, lets consider a discrete derivative. For some small \lambda > 0,

    \[ \hat{f}(x) = \frac{\hat{F}(x + \lambda) - \hat{F}(x - \lambda)}{2\lambda} \]

This can be re-expressed as

    \[ \hat{f}(x) = \frac{\sum_{i=1}^{n} \mathbb{I}(x - \lambda \leq X_i \leq x + \lambda)}{2n\lambda} \]

since Pr(X \leq b) - Pr(X \leq a) = Pr(a \leq X \leq b) (draw a picture if you need to convince yourself)! Simplifying further,

    \begin{align*} \hat{f}(x) &=\frac{\sum_{i=1}^{n} \mathbb{I}(-\lambda \leq X_i - x \leq \lambda)}{2n\lambda} \\ &= \frac{1}{2n\lambda} \sum_{i=1}^{n} \mathbb{I} \left( |X_i - x| \leq \lambda \right) \\ &= \frac{1}{2n\lambda} \sum_{i=1}^{n} \mathbb{I} \left( \frac{|X_i - x|}{\lambda} \leq 1 \right) \\ &= \frac{1}{n\lambda} \sum_{i=1}^{n} K \left(\frac{X_i - x}{\lambda} \right) \end{align*}

where

    \[ K(u) = \begin{cases} \frac{1}{2} & |u| \leq 1 \\ 0 & |u| > 1 \end{cases} \]

is the uniform density function on [-1, 1]. Mathemagic!

From our derivation, we see that \hat{f}(x) essentially determines the number of observations within a small distance, \lambda, of x. The bandwidth \lambda dictates the size of the window for which X_i are considered. That is, the bandwidth controls the degree of smoothing. The greater the number of observations within this window, the greater \hat{f}(x) is.

Our derived estimate \hat{f}(x) is a special case of what is referred to as a kernel estimator. The general case is

(1)   \begin{equation*} \hat{f}(x) = \frac{1}{n\lambda} \sum_{i=1}^{n} K \left(\frac{X_i - x}{\lambda} \right) \end{equation*}

where K(u) is a kernel function.

Kernel Functions

A kernel function K(u) : \mathbb{R} \rightarrow \mathbb{R} is any function which satisfies

    \[ \int_{-\infty}^{\infty} K(u) du = 1 \]

The kernel function acts as our weighting function, assigning less mass to observations farther from x. This helps to ensure that our fitted curve is smooth.

Non-negative kernels satisfy K(u) \geq 0 for all u and are therefore probability density functions. Symmetric kernels satisfy K(u) = K(-u) for all u. The Gaussian, or Normal, distribution is a popular symmetric, non-negative kernel.

The moments of a kernel are defined as

    \[ \kappa_j(K) = \int_{-\infty}^{\infty} u^{j} K(u) du \]

The order of a kernel, \nu, is defined as the order of the first non-zero moment. For example, if \kappa_1(K) = 0 and \kappa_2(K) > 0, then K is a second-order kernel and \nu = 2. Symmetric non-negative kernels are second-order and hence second-order kernels are the most common in practice.

Other popular kernels include the Epanechnikov, uniform, bi-weight, and tri-weight kernels. The Epanechnikov kernel is considered to be the optimal kernel as it minimizes error. Choice of the bandwidth, however, is often more influential on estimation quality than choice of kernel.

Kernel density estimates for various bandwidths.
Kernel density estimates for various bandwidths. The thick black line represents the optimal bandwidth, \lambda = 0.344. The jagged dotted line is the estimate of f(x) when the bandwidth is halved, \lambda = 0.172. The flatter, bell-shaped curve represents \lambda = 1.5 which clearly oversmooths the data.

Bandwidth Considerations

As noted above, the bandwidth \lambda determines the size of the envelope around x and thus the number of x_i used for estimation. In the case of a Gaussian kernel, \lambda would translate to the standard deviation. For k-nearest neighbours, \lambda would translate to the size of the neighbourhood expressed as the span k/N (k points within the N observation training set).

The infamous bias-variance trade-off must be considered when selecting \lambda. If we choose a small value of \lambda, we consider a smaller number of x_i. This results in higher variance due to smaller sample size but less bias as each x_i will be closer to x. As we increase \lambda, our window size increases and we consider a larger number of x_i. This reduces our variance but our bias will now be higher as we are using x_i that are further from x and thus information that might not be particularly relevant.

In other words, if \lambda is too large we will smooth out important information but if it is too small, our estimate will be too rough and contain unnecessary noise. Choosing \lambda is no easy task and several methods for bandwidth selection have been proposed including cross-validation methods, rules of thumb, and visual inspection.

Personally, I prefer to use cross-validation as a starting point since I try to minimize the effect of my own biases on estimation. However, these methods aren’t perfect and if feasible, I will follow this up with visual inspection to ensure that the CV bandwidth makes sense in the context of my data and problem. I will generally select a slightly rougher fit over a smoother fit as it is easier for the human eye to imagine a smoother fit than a rougher fit!

Properties of the Kernel Density Estimator

The kernel density estimator

    \[ \hat{f}(x) = \frac{1}{n\lambda} \sum_{i=1}^{n} K \left( \frac{X_i - x}{\lambda} \right) \]

has several convenient numerical properties:

  1. If the kernel function K(u) is non-negative, the density estimator \hat{f}(x) is also non-negative.
  2. \hat{f}(x) integrates to one, making it a valid density function when K is non-negative.

    To prove this, let

        \[u = \frac{X_i - x}{\lambda}\]

    Then, via a change-of-variables,

        \begin{align*} \int_x \hat{f}(x) ~dx &= \frac{1}{n} \sum_{i=1}^{n} \int_{-\infty}^{\infty} \frac{1}{\lambda} K \left(\frac{X_i - x}{\lambda} \right)~dx \\ &= \frac{1}{n} \sum_{i=1}^{n} \int_{-\infty}^{\infty} K(u) ~ du \\ &= \frac{1}{n} \sum_{i=1}^{n} 1 \\  &= 1 \end{align*}

  3. The mean of the estimated density is \bar{X}.

    Using the following transformation,

        \[ u = \frac{X_i - x}{\lambda} \Rightarrow x = u\lambda + X_i \]

    and thus

        \begin{align*} E[X] = \int_{x} x \hat{f}(x) ~dx &= \frac{1}{n} \sum_{n} \int_{x} \frac{x}{\lambda} K \left(\frac{X_i - x}{\lambda} \right)~dx \\ &= \frac{1}{n} \sum_{n} \int_{u} (X_i + u\lambda) K(u) ~du \\ &= \frac{1}{n} \sum_{n} X_i \int_{u} K(u) ~du + \frac{1}{n} \sum_{n} \int_{u} u K(u) ~du \end{align*}

    Recall that the kernel function K(u) must integrate to one and that for second-order kernels,

        \[ \kappa_1(K) = \int_{u} u K(u) ~ du = 0 \]

    Therefore,

        \[ \int_{x} x \hat{f}(x) ~dx = \frac{1}{n} \sum_{n} X_i = \bar{X}\]

  4. The variance of the estimated density is \hat{\sigma}^2 + \lambda^{2} \kappa_2(K). That is, the sample variance is inflated by a factor of \lambda^{2} \kappa_2(K) when the density is estimated.

    The variance of X is given by

        \[ Var[X] = E[X^2] - (E[X])^2 \]

    The second moment of the estimated density is

        \begin{align*}  E[X^2] &= \int_{x} x^2 \hat{f}(x) ~dx  = \frac{1}{n} \sum_{n} \int_{x} \frac{x^2}{\lambda} K\left(\frac{X_i - x}{\lambda} \right)~dx \\  &= \frac{1}{n} \sum_n \int_{x} (X_i + u\lambda)^2 K(u)~du \\  &= \frac{1}{n} \sum_{n} X_i^2 + \frac{2}{n} \sum_{n} \lambda X_i \int_{u} u K(u)~du + \frac{1}{n} \sum_{n} \lambda^2 \int_{u} u^2 K(u) ~du \\  &= \frac{1}{n} \sum_{n} X_i^2 + \frac{2}{n} \sum_{n}\lambda  X_i \kappa_1(K) + \lambda^2 \kappa_2(K) \\  &= \frac{1}{n} \sum_{n} X_i^2 + \lambda^2 \kappa_2(K)  \end{align*}

    Thus,

        \begin{align*}  Var[X] &= \frac{1}{n} X_i^2 + \lambda^2 \kappa_2(K) - \left( \frac{1}{n} \sum_{n} X_i \right)^2 \\  &= \left\lbrace \frac{1}{n} \sum_{n} X_i^2 - \left( \frac{1}{n} \sum_{n} X_i \right)^2 \right\rbrace + \lambda^2 \kappa_2(K) \\  &= \hat{\sigma}^2 + \lambda^2 \kappa_2(K)  \end{align*}

Summary

  • The empirical distribution function (EDF) assigns a mass of 1/N to each x_i, resulting in a discrete or “jumpy” estimate.
  • Kernel density estimators (KDE) estimate f(x) by constructing a neighbourhood around the point of interest x. Observations within this neighbourhood are then assigned a mass based on their distance from x via a kernel function, resulting in a smooth estimate.
  • Popular kernel choices are the Gaussian and Epanechnikov kernels. These kernels are second-order kernels, suggesting they are both proper, symmetric density functions.
  • The size of the neighbourhood is dictated by the bandwidth \lambda. Special care must be taken when selecting \lambda in order to ensure that the bias-variance trade-off is balanced for your problem. Different methods such as CV are available to assist you with optimal bandwidth selection.
  • Choice of bandwidth has a larger impact on estimation quality than your choice of kernel.

I will be extending the kernel density estimator to kernel regression in my future blog posts and conducting a case study in R that uses these methods, stay tuned!

Published by

Emma Smith

Emma Smith is a young statistician who's on a mission to convince the masses statistics is as awesome as she *knows* it is! When she's not working on expanding her knowledge of machine learning and mathematical statistics, she's busy petting cats and unsuccessfully convincing her boyfriend to let her adopt them, hiking, concocting indie and folk rock playlists, and kicking butt in roller derby.

3 thoughts on “ML Theory: Kernel Density Estimation”

Leave a Reply

Your email address will not be published. Required fields are marked *