Getting to know U: the asymptotic distribution of a single U-statistic

After my last grand slam title, U-, V-, and Dupree statistics I was really feeling the pressure to keep my title game strong. Thank you to my wonderful friend Steve Lee for suggesting this beautiful title.

Overview

A statistical functional is any real-valued function of a distribution function F such that

    \[ \theta = T(F) \]

and represents characteristics of the distribution F and include the mean, variance, and quantiles.

Often times F is unknown but is assumed to belong to a broad class of distribution functions \mathcal{F} subject only to mild restrictions such as continuity or existence of specific moments.

A random sample X_1, …, X_n \stackrel{i.i.d}{\sim} F can be used to construct the empirical cumulative distribution function (ECDF) \hat{F}_n,

    \[ \hat{F}_{n}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x) \]

which assigns mass \frac{1}{n} to each X_i.

\hat{F}_{n} is a valid, discrete CDF which can be substituted for F to obtain \hat{\theta} = T(\hat{F}_n). These estimators are referred to as plug-in estimators for obvious reasons.

For more details on statistical functionals and plug-in estimators, you can check out my blog post Plug-in estimators of statistical functionals!

Many statistical functionals take the form of an expectation of a real-valued function \phi with respect to F such that for a \leq n,

    \[ \theta = T(F) = \mathbb{E}_{F}~ \phi(X_1, …, X_a) .\]

When \phi(x_1, …, x_a) is a function symmetric in its arguments such that, for e.g. \phi(x_1, x_2) = \phi(x_2, x_1), it is referred to as a symmetric kernel of degree a. If \phi is not symmetric, a symmetric equivalent \phi^{*} can always be found,

    \[\phi^{*}(x_1, …, x_a) = \frac{1}{a!} \sum_{\pi ~\in~ \Pi} \phi(x_{\pi(1)}, …, x_{\pi(a)})\]

where \Pi represents the set of all permutations of the indices 1, …, a.

A statistical functional \theta = T(F) belongs to a special family of expectation functionals when:

  1. T(F) = \mathbb{E}_F ~\phi(X_1, …, X_a), and
  2. \phi(X_1, …, X_a) is a symmetric kernel of degree a.

Plug-in estimators of expectation functionals are referred to as V-statistics and can be expressed explicitly as,

    \[V_n = \frac{1}{n^a} \sum_{i_1 = 1}^{n} … \sum_{i_a = 1}^{n} \phi(X_{i_1}, …, X_{i_a}) \]

so that V_n is the average of \phi evaluated at all possible permutations of size a from X_1, …, X_n. Since the X_i can appear more than once within each summand, V_n is generally biased.

By restricting the summands to distinct indices only an unbiased estimator known as a U-statistic arises. In fact, when the family of distributions \mathcal{F} is large enough, it can be shown that a U-statistic can always be constructed for expectation functionals.

Since \phi is symmetric, we can require that 1 \leq i_1 < ... < i_a \leq n, resulting in {n \choose a} combinations of the subscripts 1, ..., a. The U-statistic is then the average of \phi evaluated at all {n \choose a} distinct combinations of X_1, ..., X_n,

    \[U_n = \frac{1}{{n \choose a}} \mathop{\sum … \sum} \limits_{1 \leq i_1 < ... < i_a \leq n} \phi(X_{i_1}, ..., X_{i_a}).\]

While i_j \neq i_k within each summand now, each X_i still appears in multiple summands, suggesting that U_n is the sum of correlated terms. As a result, the central limit theorem cannot be relied upon to determine the limiting distribution of U_n.

For more details on expectation functionals and their estimators, you can check out my blog post U-, V-, and Dupree statistics!

This blog post provides a walk-through derivation of the limiting, or asymptotic, distribution of a single U-statistic U_n.

Continue reading Getting to know U: the asymptotic distribution of a single U-statistic

U-, V-, and Dupree statistics

To start, I apologize for this blog’s title but I couldn’t resist referencing to the Owen Wilson classic You, Me, and Dupree ā€“ wow! The other gold-plated candidate was U-statistics and You. Please, please, hold your applause.


My previous blog post defined statistical functionals as any real-valued function of an unknown CDF, T(F), and explained how plug-in estimators could be constructed by substituting the empirical cumulative distribution function (ECDF) \hat{F}_{n} for the unknown CDF F. Plug-in estimators of the mean and variance were provided and used to demonstrate plug-in estimators’ potential to be biased.

    \[ \hat{\mu} = \mathbb{E}_{\hat{F}_n}[X] = \sum_{i=1}^{n} X_i P(X = X_i) = \frac{1}{n} \sum_{i=1}^{n} X_i = \bar{X}_{n} \]

    \[ \hat{\sigma}^{2} = \mathbb{E}_{\hat{F}_{n}}[(X- \mathbb{E}_{\hat{F}_n}[X])^2] = \mathbb{E}_{\hat{F}_n}[(X - \bar{X}_{n})^2] = \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X}_{n})^2. \]

Statistical functionals that meet the following two criteria represent a special family of functionals known as expectation functionals:

1) T(F) is the expectation of a function g with respect to the distribution function F; and

    \[ T(F) = \mathbb{E}_{F} ~g(X)\]

2) the function g(\cdot) takes the form of a symmetric kernel.

Expectation functionals encompass many common parameters and are well-behaved. Plug-in estimators of expectation functionals, named V-statistics after von Mises, can be obtained but may be biased. It is, however, always possible to construct an unbiased estimator of expectation functionals regardless of the underlying distribution function F. These estimators are named U-statistics, with the ā€œUā€ standing for unbiased.

This blog post provides 1) the definitions of symmetric kernels and expectation functionals; 2) an overview of plug-in estimators of expectation functionals or V-statistics; 3) an overview of unbiased estimators for expectation functionals or U-statistics.

Continue reading U-, V-, and Dupree statistics

Parametric vs. Nonparametric Approach to Estimations

Parametric statistics assume that the unknown CDF F belongs to a family of CDFs characterized by a parameter (vector) \theta. As the form of F is assumed, the target of estimation is its parameters \theta. Thus, all uncertainty about F is comprised of uncertainty about its parameters. Parameters are estimated by \hat{\theta}, and estimates are be substituted into the assumed distribution to conduct inference for the quantities of interest. If the assumed distribution F is incorrect, inference may also be inaccurate, or trends in the data may be missed.

To demonstrate the parametric approach, consider n = 100 independent and identically distributed random variables X_1, …, X_n generated from an exponential distribution with rate \lambda = 2. Investigators wish to estimate the 75^{th} percentile and erroneously assume that their data is normally distributed. Thus, F is assumed to be the Normal CDF but \mu and \sigma^2 are unknown. The parameters \mu and \sigma are estimated in their typical way by \bar{x} and \sigma^2, respectively. Since the normal distribution belongs to the location-scale family, an estimate of the p^{th} percentile is provided by,

    \[x_p = \bar{x} + s\Phi^{-1}(p)\]

where \Phi^{-1} is the standard normal quantile function, also known as the probit.

set.seed(12345)
library(tidyverse, quietly = T)
# Generate data from Exp(2)
x <- rexp(n = 100, rate = 2)

# True value of 75th percentile with rate = 2
true <- qexp(p = 0.75, rate = 2) 
true
## [1] 0.6931472
# Estimate mu and sigma
xbar <- mean(x)
s    <- sd(x)

# Estimate 75th percentile assuming mu = xbar and sigma = s
param_est <- xbar + s * qnorm(p = 0.75)
param_est
## [1] 0.8792925

The true value of the 75^{th} percentile of \text{Exp}(2) is 0.69 while the parametric estimate is 0.88.

Nonparametric statistics make fewer distributions about the unknown distribution F, requiring only mild assumptions such as continuity or the existence of specific moments. Instead of estimating parameters of F, F itself is the target of estimation. F is commonly estimated by the empirical cumulative distribution function (ECDF) \hat{F},

    \[\hat{F}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x).\]

Any statistic that can be expressed as a function of the CDF, known as a statistical functional and denoted \theta = T(F), can be estimated by substituting \hat{F} for F. That is, plug-in estimators can be obtained as \hat{\theta} = T(\hat{F}).

Continue reading Parametric vs. Nonparametric Approach to Estimations

The Probabilistic Index for Two Normally Distributed Outcomes

Consider a two-armed study comparing a placebo and treatment. In general, the probabilistic index (PI) is defined as,

    \[\text{PI} = P(X < Y) + \frac{1}{2} P(X = Y)\]

and is interpreted as the probability that a subject in the treatment group will have an increased response compared to a subject in the placebo group. The probabilistic index is a particularly useful effect measure for ordinal data, where effects can be difficult to define and interpret owing to absence of a meaningful difference. However, it can also be used for continuous data, noting that when the outcome is continuous, P(X = Y) = 0 and the PI reduces to P(X < Y).

PI = 0.5 suggests an increased outcome is equally likely for subjects in the placebo and treatment group, while PI > 0.5 suggests an increased outcome is more likely for subjects in the treatment group compared to the placebo group, and the opposite is true when PI < 0.5.

Simulation

Suppose X \sim N(\mu_X, \sigma^{2}_{X}) and Y \sim N(\mu_Y, \sigma^{2}_{Y}) represent the independent outcomes in the placebo and treatment groups, respectively and an increased value of the outcome is the desired response.

We simulate n_X = n_Y = 50 observations from each group such that treatment truly increases the outcome and the variances within each group are equal such that \sigma^{2}_{X} = \sigma^{2}_{Y}.

# Loading required libraries
library(tidyverse)
library(gridExtra)

# Setting seed for reproducibility
set.seed(12345)
# Simulating data
n_X = n_Y = 50
sigma_X = sigma_Y = 1
mu_X = 5; mu_Y = 7

outcome_X = rnorm(n = n_X, mean = mu_X, sd = sigma_X)
outcome_Y = rnorm(n = n_Y, mean = mu_Y, sd = sigma_Y)

df <- data.frame(Group = c(rep('Placebo', n_X), rep('Treatment', n_Y)),
                 Outcome = c(outcome_X, outcome_Y))

Examining side-by-side histograms and boxplots of the outcomes within each group, there appears to be strong evidence that treatment increases the outcome as desired. Thus, we would expect a probabilistic index close to 1 as most outcomes in the treatment group appear larger than those of the placebo group.

# Histogram by group
hist_p <- df %>%
  ggplot(aes(x = Outcome, fill = Group)) +
    geom_histogram(position = 'identity', alpha = 0.75, bins = 10) + 
    theme_bw() +
    labs(y = 'Frequency')

# Boxplot by group
box_p <- df %>%
  ggplot(aes(x = Outcome, fill = Group)) +
    geom_boxplot() + 
    theme_bw() +
    labs(y = 'Frequency')

# Combine plots
grid.arrange(hist_p, box_p, nrow = 2)

plot of chunk unnamed-chunk-3

Continue reading The Probabilistic Index for Two Normally Distributed Outcomes