Kernel Regression

Motivation

For observed pairs (x_i, y_i), i = 1, …, n, the relationship between x and y can be defined generally as

    \[ y_i = m(x_i) + \varepsilon_i \]

where f(x_i) = E[y_i | x = x_i] and \varepsilon_i \stackrel{iid}{\sim} (0, \sigma^2). If we are unsure about the form of m(\cdot), our objective may be to estimate m(\cdot) without making too many assumptions about its shape. In other words, we aim to “let the data speak for itself”.

Simulated scatterplot of y = f(x) + \epsilon. Here, x \sim Uniform(0, 10) and \epsilon \sim N(0, 1). The true function f(x) = sin(x) is displayed in green.

Non-parametric approaches require only that m(\cdot) be smooth and continuous. These assumptions are far less restrictive than alternative parametric approaches, thereby increasing the number of potential fits and providing additional flexibility. This makes non-parametric models particularly appealing when prior knowledge about m(\cdot)‘s functional form is limited.

Estimating the Regression Function

If multiple values of y were observed at each x, f(x) could be estimated by averaging the value of the response at each x. However, since x is often continuous, it can take on a wide range of values making this quite rare. Instead, a neighbourhood of x is considered.

Result of averaging y_i at each x_i. The fit is extremely rough due to gaps in x and low y frequency at each x.

Define the neighbourhood around x as x \pm \lambda for some bandwidth \lambda > 0. Then, a simple non-parametric estimate of m(x) can be constructed as average of the y_i‘s corresponding to the x_i within this neighbourhood. That is,

(1)   \begin{equation*} \hat{f}_{\lambda}(x) = \frac{\sum_{n} \mathbb{I}(|x - x_i| \leq \lambda)~ y_i}{\sum_{n} \mathbb{I}(|x - x_i| \leq \lambda)} = \frac{\sum_n K\left( \frac{x - x_i}{\lambda} \right) y_i}{\sum_n K\left( \frac{x - x_i}{\lambda} \right) } \end{equation*}

where

    \[ K(u) = \begin{cases} \frac{1}{2} & |u| \leq 1 \\ 0 & \text{o.w.} \end{cases} \]

is the uniform kernel. This estimator, referred to as the Nadaraya-Watson estimator, can be generalized to any kernel function K(u) (see my previous blog bost). It is, however, convention to use kernel functions of degree \nu = 2 (e.g. the Gaussian and Epanechnikov kernels).

The red line is the result of estimating f(x) with a Gaussian kernel and arbitrarily selected bandwidth of \lambda = 1.25. The green line represents the true function sin(x).

Continue reading Kernel Regression

Kernel Density Estimation

Motivation

It is important to have an understanding of some of the more traditional approaches to function estimation and classification before delving into the trendier topics of neural networks and decision trees. Many of these methods build on an understanding of each other and thus to truly be a MACHINE LEARNING MASTER, we’ve got to pay our dues. We will therefore start with the slightly less sexy topic of kernel density estimation.

Let X be a random variable with a continuous distribution function (CDF) F(x) = Pr(X \leq x) and probability density function (PDF)

    \[f(x) = \frac{d}{dx} F(x)\]

Our goal is to estimate f(x) from a random sample \lbrace X_1, …, X_n \rbrace. Estimation of f(x) has a number of applications including construction of the popular Naive Bayes classifier,

    \[ \hat{Pr}(C = c | X = x_0) = \frac{\hat{\pi}_c \hat{f}_{c}(x_0)}{\sum_{k=1}^{C} \hat{\pi}_{k} \hat{f}_{k}(x_0)} \]

Continue reading Kernel Density Estimation

Advent of Code 2017 in R: Day 2

Day 2 of the Advent of Code provides us with a tab delimited input consisting of numbers 2-4 digits long and asks us to calculate its “checksum”. checksum is defined as the sum of the difference between each row’s largest and smallest values. Awesome! This is a problem that is well-suited for base R.

I started by reading the file in using read.delim, specifying header = F in order to ensure that numbers within the first row of the data are not treated as variable names.

When working with short problems like this where I know I won’t be rerunning my code or reloading my data often, I will use file.choose() in my read.whatever functions for speed. file.choose() opens Windows Explorer, allowing you to navigate to your file path.

input <- read.delim(file.choose(), header = F)

# Check the dimensions of input to ensure the data read in correctly.
dim(input)

After checking the dimensions of our input, everything looks good. As suspected, this is a perfect opportunity to use some vectorization via the apply function.

row_diff <- apply(input, 1, function(x) max(x) - min(x))
checksum <- sum(row_diff)
checksum

Et voilà, the answer is 45,972! Continue reading Advent of Code 2017 in R: Day 2