regression Archives • Statisticelle

Two of my recent blog posts focused on two different, but as we will see related, methods which essentially transform observed responses into a summary of their contribution to an estimate: structural components resulting from Sen’s (1960) decomposition of U-statistics and pseudo-observations resulting from application of the leave-one-out jackknife. As I note in this comment, I think the real value of deconstructing estimators in this way results from the use of these quantities, which in special (but common) cases are asymptotically uncorrelated and identically distributed, to: (1) simplify otherwise complex variance estimates and construct interval estimates, and (2) apply regression methods to estimators without an existing regression framework.

As discussed by Miller (1974), pseudo-observations may be treated as approximately independent and identically distributed random variables when the quantity of interest is a function of the mean or variance, and more generally, any function of a U-statistic. Several other scenarios where these methods are applicable are also outlined. Many estimators of popular “parameters” can actually be expressed as U-statistics. Thus, these methods are quite broadly applicable. A review of basic U-statistic theory and some common examples, notably the difference in means or the Wilcoxon Mann-Whitney test statistic, can be found within my blog post: One, Two, U: Examples of common one- and two-sample U-statistics.

As an example of use case (1), Delong et al. (1988) used structural components to estimate the variances and covariances of the areas under multiple, correlated receiver operator curves or multiple AUCs. Hanley and Hajian-Tilaki (1997) later referred to the methods of Delong et al. (1988) as “the cleanest and most elegant approach to variances and covariances of AUCs.” As an example of use case (2), Andersen & Pohar Perme (2010) provide a thorough summary of how pseudo-observations can be used to construct regression models for important survival parameters like survival at a single time point and the restricted mean survival time.

Now, structural components are restricted to U-statistics while pseudo-observations may be used more generally, as discussed. But, if we construct pseudo-observations for U-statistics, one of several “valid” scenarios, what is the relationship between these two quantities? Hanley and Hajian-Tilaki (1997) provide a lovely discussion of the equivalence of these two methods when applied to the area under the receiver operating characteristic curve or simply the AUC. This blog post follows their discussion, providing concrete examples of computing structural components and pseudo-observations using R, and demonstrating their equivalence in this special case.

Continue reading Nonparametric neighbours: U-statistic structural components and jackknife pseudo-observations for the AUC

Motivation

For observed pairs $(x_i, y_i)$ , $i = 1, …, n$ , the relationship between $x$ and $y$ can be defined generally as

$y_i = m(x_i) + \varepsilon_i$

where $f(x_i) = E[y_i | x = x_i]$ and $\varepsilon_i \stackrel{iid}{\sim} (0, \sigma^2)$ . If we are unsure about the form of $m(\cdot)$ , our objective may be to estimate $m(\cdot)$ without making too many assumptions about its shape. In other words, we aim to “let the data speak for itself”.

Simulated scatterplot of $y = f(x) + \epsilon$ . Here, $x \sim Uniform(0, 10)$ and $\epsilon \sim N(0, 1)$ . The true function $f(x) = sin(x)$ is displayed in green.

Non-parametric approaches require only that $m(\cdot)$ be smooth and continuous. These assumptions are far less restrictive than alternative parametric approaches, thereby increasing the number of potential fits and providing additional flexibility. This makes non-parametric models particularly appealing when prior knowledge about $m(\cdot)$ ‘s functional form is limited.

Estimating the Regression Function

If multiple values of $y$ were observed at each $x$ , $f(x)$ could be estimated by averaging the value of the response at each $x$ . However, since $x$ is often continuous, it can take on a wide range of values making this quite rare. Instead, a neighbourhood of $x$ is considered.

Result of averaging $y_i$ at each $x_i$ . The fit is extremely rough due to gaps in $x$ and low $y$ frequency at each $x$ .

Define the neighbourhood around $x$ as $x \pm \lambda$ for some bandwidth $\lambda > 0$ . Then, a simple non-parametric estimate of $m(x)$ can be constructed as average of the $y_i$ ‘s corresponding to the $x_i$ within this neighbourhood. That is,

(1) $\begin{equation*} \hat{f}_{\lambda}(x) = \frac{\sum_{n} \mathbb{I}(|x - x_i| \leq \lambda)~ y_i}{\sum_{n} \mathbb{I}(|x - x_i| \leq \lambda)} = \frac{\sum_n K\left( \frac{x - x_i}{\lambda} \right) y_i}{\sum_n K\left( \frac{x - x_i}{\lambda} \right) } \end{equation*}$

where

$K(u) = \begin{cases} \frac{1}{2} & |u| \leq 1 \\ 0 & \text{o.w.} \end{cases}$

is the uniform kernel. This estimator, referred to as the Nadaraya-Watson estimator, can be generalized to any kernel function $K(u)$ (see my previous blog bost). It is, however, convention to use kernel functions of degree $\nu = 2$ (e.g. the Gaussian and Epanechnikov kernels).

The red line is the result of estimating $f(x)$ with a Gaussian kernel and arbitrarily selected bandwidth of $\lambda = 1.25$ . The green line represents the true function $sin(x)$ .

Continue reading Kernel Regression