learning Archives • Statisticelle

Motivation

For observed pairs $(x_i, y_i)$ , $i = 1, …, n$ , the relationship between $x$ and $y$ can be defined generally as

$y_i = m(x_i) + \varepsilon_i$

where $f(x_i) = E[y_i | x = x_i]$ and $\varepsilon_i \stackrel{iid}{\sim} (0, \sigma^2)$ . If we are unsure about the form of $m(\cdot)$ , our objective may be to estimate $m(\cdot)$ without making too many assumptions about its shape. In other words, we aim to “let the data speak for itself”.

Simulated scatterplot of $y = f(x) + \epsilon$ . Here, $x \sim Uniform(0, 10)$ and $\epsilon \sim N(0, 1)$ . The true function $f(x) = sin(x)$ is displayed in green.

Non-parametric approaches require only that $m(\cdot)$ be smooth and continuous. These assumptions are far less restrictive than alternative parametric approaches, thereby increasing the number of potential fits and providing additional flexibility. This makes non-parametric models particularly appealing when prior knowledge about $m(\cdot)$ ‘s functional form is limited.

Estimating the Regression Function

If multiple values of $y$ were observed at each $x$ , $f(x)$ could be estimated by averaging the value of the response at each $x$ . However, since $x$ is often continuous, it can take on a wide range of values making this quite rare. Instead, a neighbourhood of $x$ is considered.

Result of averaging $y_i$ at each $x_i$ . The fit is extremely rough due to gaps in $x$ and low $y$ frequency at each $x$ .

Define the neighbourhood around $x$ as $x \pm \lambda$ for some bandwidth $\lambda > 0$ . Then, a simple non-parametric estimate of $m(x)$ can be constructed as average of the $y_i$ ‘s corresponding to the $x_i$ within this neighbourhood. That is,

(1) $\begin{equation*} \hat{f}_{\lambda}(x) = \frac{\sum_{n} \mathbb{I}(|x - x_i| \leq \lambda)~ y_i}{\sum_{n} \mathbb{I}(|x - x_i| \leq \lambda)} = \frac{\sum_n K\left( \frac{x - x_i}{\lambda} \right) y_i}{\sum_n K\left( \frac{x - x_i}{\lambda} \right) } \end{equation*}$

where

$K(u) = \begin{cases} \frac{1}{2} & |u| \leq 1 \\ 0 & \text{o.w.} \end{cases}$

is the uniform kernel. This estimator, referred to as the Nadaraya-Watson estimator, can be generalized to any kernel function $K(u)$ (see my previous blog bost). It is, however, convention to use kernel functions of degree $\nu = 2$ (e.g. the Gaussian and Epanechnikov kernels).

The red line is the result of estimating $f(x)$ with a Gaussian kernel and arbitrarily selected bandwidth of $\lambda = 1.25$ . The green line represents the true function $sin(x)$ .

Continue reading Kernel Regression