nonparametric Archives • Statisticelle

Getting to know U: the asymptotic distribution of a single U-statistic

After my last grand slam title, U-, V-, and Dupree statistics I was really feeling the pressure to keep my title game strong. Thank you to my wonderful friend Steve Lee for suggesting this beautiful title.

Overview

A statistical functional is any real-valued function of a distribution function $F$ such that

$\theta = T(F)$

and represents characteristics of the distribution $F$ and include the mean, variance, and quantiles.

Often times $F$ is unknown but is assumed to belong to a broad class of distribution functions $\mathcal{F}$ subject only to mild restrictions such as continuity or existence of specific moments.

A random sample $X_1, …, X_n \stackrel{i.i.d}{\sim} F$ can be used to construct the empirical cumulative distribution function (ECDF) $\hat{F}_n$ ,

$\hat{F}_{n}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x)$

which assigns mass $\frac{1}{n}$ to each $X_i$ .

$\hat{F}_{n}$ is a valid, discrete CDF which can be substituted for $F$ to obtain $\hat{\theta} = T(\hat{F}_n)$ . These estimators are referred to as plug-in estimators for obvious reasons.

For more details on statistical functionals and plug-in estimators, you can check out my blog post Plug-in estimators of statistical functionals!

Many statistical functionals take the form of an expectation of a real-valued function $\phi$ with respect to $F$ such that for $a \leq n$ ,

$\theta = T(F) = \mathbb{E}_{F}~ \phi(X_1, …, X_a) .$

When $\phi(x_1, …, x_a)$ is a function symmetric in its arguments such that, for e.g. $\phi(x_1, x_2) = \phi(x_2, x_1)$ , it is referred to as a symmetric kernel of degree $a$ . If $\phi$ is not symmetric, a symmetric equivalent $\phi^{*}$ can always be found,

$\phi^{*}(x_1, …, x_a) = \frac{1}{a!} \sum_{\pi ~\in~ \Pi} \phi(x_{\pi(1)}, …, x_{\pi(a)})$

where $\Pi$ represents the set of all permutations of the indices $1, …, a$ .

A statistical functional $\theta = T(F)$ belongs to a special family of expectation functionals when:

$T(F) = \mathbb{E}_F ~\phi(X_1, …, X_a)$ , and
$\phi(X_1, …, X_a)$ is a symmetric kernel of degree $a$ .

Plug-in estimators of expectation functionals are referred to as V-statistics and can be expressed explicitly as,

$V_n = \frac{1}{n^a} \sum_{i_1 = 1}^{n} … \sum_{i_a = 1}^{n} \phi(X_{i_1}, …, X_{i_a})$

so that $V_n$ is the average of $\phi$ evaluated at all possible permutations of size $a$ from $X_1, …, X_n$ . Since the $X_i$ can appear more than once within each summand, $V_n$ is generally biased.

By restricting the summands to distinct indices only an unbiased estimator known as a U-statistic arises. In fact, when the family of distributions $\mathcal{F}$ is large enough, it can be shown that a U-statistic can always be constructed for expectation functionals.

Since $\phi$ is symmetric, we can require that $1 \leq i_1 < ... < i_a \leq n$ , resulting in ${n \choose a}$ combinations of the subscripts $1, ..., a$ . The U-statistic is then the average of $\phi$ evaluated at all ${n \choose a}$ distinct combinations of $X_1, ..., X_n$ ,

$U_n = \frac{1}{{n \choose a}} \mathop{\sum … \sum} \limits_{1 \leq i_1 < ... < i_a \leq n} \phi(X_{i_1}, ..., X_{i_a}).$

While $i_j \neq i_k$ within each summand now, each $X_i$ still appears in multiple summands, suggesting that $U_n$ is the sum of correlated terms. As a result, the central limit theorem cannot be relied upon to determine the limiting distribution of $U_n$ .

For more details on expectation functionals and their estimators, you can check out my blog post U-, V-, and Dupree statistics!

This blog post provides a walk-through derivation of the limiting, or asymptotic, distribution of a single U-statistic $U_n$ .

Continue reading Getting to know U: the asymptotic distribution of a single U-statistic

U-, V-, and Dupree statistics

To start, I apologize for this blog’s title but I couldn’t resist referencing to the Owen Wilson classic You, Me, and Dupree – wow! The other gold-plated candidate was U-statistics and You. Please, please, hold your applause.

My previous blog post defined statistical functionals as any real-valued function of an unknown CDF, $T(F)$ , and explained how plug-in estimators could be constructed by substituting the empirical cumulative distribution function (ECDF) $\hat{F}_{n}$ for the unknown CDF $F$ . Plug-in estimators of the mean and variance were provided and used to demonstrate plug-in estimators’ potential to be biased.

$\hat{\mu} = \mathbb{E}_{\hat{F}_n}[X] = \sum_{i=1}^{n} X_i P(X = X_i) = \frac{1}{n} \sum_{i=1}^{n} X_i = \bar{X}_{n}$

$\hat{\sigma}^{2} = \mathbb{E}_{\hat{F}_{n}}[(X- \mathbb{E}_{\hat{F}_n}[X])^2] = \mathbb{E}_{\hat{F}_n}[(X - \bar{X}_{n})^2] = \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X}_{n})^2.$

Statistical functionals that meet the following two criteria represent a special family of functionals known as expectation functionals:

1) $T(F)$ is the expectation of a function $g$ with respect to the distribution function $F$ ; and

$T(F) = \mathbb{E}_{F} ~g(X)$

2) the function $g(\cdot)$ takes the form of a symmetric kernel.

Expectation functionals encompass many common parameters and are well-behaved. Plug-in estimators of expectation functionals, named V-statistics after von Mises, can be obtained but may be biased. It is, however, always possible to construct an unbiased estimator of expectation functionals regardless of the underlying distribution function $F$ . These estimators are named U-statistics, with the “U” standing for unbiased.

This blog post provides 1) the definitions of symmetric kernels and expectation functionals; 2) an overview of plug-in estimators of expectation functionals or V-statistics; 3) an overview of unbiased estimators for expectation functionals or U-statistics.

Continue reading U-, V-, and Dupree statistics

Parametric vs. Nonparametric Approach to Estimations

Parametric statistics assume that the unknown CDF $F$ belongs to a family of CDFs characterized by a parameter (vector) $\theta$ . As the form of $F$ is assumed, the target of estimation is its parameters $\theta$ . Thus, all uncertainty about $F$ is comprised of uncertainty about its parameters. Parameters are estimated by $\hat{\theta}$ , and estimates are be substituted into the assumed distribution to conduct inference for the quantities of interest. If the assumed distribution $F$ is incorrect, inference may also be inaccurate, or trends in the data may be missed.

To demonstrate the parametric approach, consider $n = 100$ independent and identically distributed random variables $X_1, …, X_n$ generated from an exponential distribution with rate $\lambda = 2$ . Investigators wish to estimate the 75 $^{th}$ percentile and erroneously assume that their data is normally distributed. Thus, $F$ is assumed to be the Normal CDF but $\mu$ and $\sigma^2$ are unknown. The parameters $\mu$ and $\sigma$ are estimated in their typical way by $\bar{x}$ and $\sigma^2$ , respectively. Since the normal distribution belongs to the location-scale family, an estimate of the $p^{th}$ percentile is provided by,

$x_p = \bar{x} + s\Phi^{-1}(p)$

where $\Phi^{-1}$ is the standard normal quantile function, also known as the probit.

set.seed(12345)
library(tidyverse, quietly = T)

# Generate data from Exp(2)
x <- rexp(n = 100, rate = 2)

# True value of 75th percentile with rate = 2
true <- qexp(p = 0.75, rate = 2) 
true

## [1] 0.6931472

# Estimate mu and sigma
xbar <- mean(x)
s    <- sd(x)

# Estimate 75th percentile assuming mu = xbar and sigma = s
param_est <- xbar + s * qnorm(p = 0.75)
param_est

## [1] 0.8792925

The true value of the 75 $^{th}$ percentile of $\text{Exp}(2)$ is 0.69 while the parametric estimate is 0.88.

Nonparametric statistics make fewer distributions about the unknown distribution $F$ , requiring only mild assumptions such as continuity or the existence of specific moments. Instead of estimating parameters of $F$ , $F$ itself is the target of estimation. $F$ is commonly estimated by the empirical cumulative distribution function (ECDF) $\hat{F}$ ,

$\hat{F}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x).$

Any statistic that can be expressed as a function of the CDF, known as a statistical functional and denoted $\theta = T(F)$ , can be estimated by substituting $\hat{F}$ for $F$ . That is, plug-in estimators can be obtained as $\hat{\theta} = T(\hat{F})$ .

Continue reading Parametric vs. Nonparametric Approach to Estimations

The Probabilistic Index for Two Normally Distributed Outcomes

Consider a two-armed study comparing a placebo and treatment. In general, the probabilistic index (PI) is defined as,

$\text{PI} = P(X < Y) + \frac{1}{2} P(X = Y)$

and is interpreted as the probability that a subject in the treatment group will have an increased response compared to a subject in the placebo group. The probabilistic index is a particularly useful effect measure for ordinal data, where effects can be difficult to define and interpret owing to absence of a meaningful difference. However, it can also be used for continuous data, noting that when the outcome is continuous, $P(X = Y) = 0$ and the PI reduces to $P(X < Y)$ .

$PI = 0.5$ suggests an increased outcome is equally likely for subjects in the placebo and treatment group, while $PI > 0.5$ suggests an increased outcome is more likely for subjects in the treatment group compared to the placebo group, and the opposite is true when $PI < 0.5$ .

Simulation

Suppose $X \sim N(\mu_X, \sigma^{2}_{X})$ and $Y \sim N(\mu_Y, \sigma^{2}_{Y})$ represent the independent outcomes in the placebo and treatment groups, respectively and an increased value of the outcome is the desired response.

We simulate $n_X = n_Y = 50$ observations from each group such that treatment truly increases the outcome and the variances within each group are equal such that $\sigma^{2}_{X} = \sigma^{2}_{Y}$ .

# Loading required libraries
library(tidyverse)
library(gridExtra)

# Setting seed for reproducibility
set.seed(12345)

# Simulating data
n_X = n_Y = 50
sigma_X = sigma_Y = 1
mu_X = 5; mu_Y = 7

outcome_X = rnorm(n = n_X, mean = mu_X, sd = sigma_X)
outcome_Y = rnorm(n = n_Y, mean = mu_Y, sd = sigma_Y)

df <- data.frame(Group = c(rep('Placebo', n_X), rep('Treatment', n_Y)),
                 Outcome = c(outcome_X, outcome_Y))

Examining side-by-side histograms and boxplots of the outcomes within each group, there appears to be strong evidence that treatment increases the outcome as desired. Thus, we would expect a probabilistic index close to 1 as most outcomes in the treatment group appear larger than those of the placebo group.

# Histogram by group
hist_p <- df %>%
  ggplot(aes(x = Outcome, fill = Group)) +
    geom_histogram(position = 'identity', alpha = 0.75, bins = 10) + 
    theme_bw() +
    labs(y = 'Frequency')

# Boxplot by group
box_p <- df %>%
  ggplot(aes(x = Outcome, fill = Group)) +
    geom_boxplot() + 
    theme_bw() +
    labs(y = 'Frequency')

# Combine plots
grid.arrange(hist_p, box_p, nrow = 2)

plot of chunk unnamed-chunk-3

Continue reading The Probabilistic Index for Two Normally Distributed Outcomes