# Getting to know U: the asymptotic distribution of a single U-statistic

After my last grand slam title, U-, V-, and Dupree statistics I was really feeling the pressure to keep my title game strong. Thank you to my wonderful friend Steve Lee for suggesting this beautiful title.

## Overview

A statistical functional is any real-valued function of a distribution function such that

and represents characteristics of the distribution and include the mean, variance, and quantiles.

Often times is unknown but is assumed to belong to a broad class of distribution functions subject only to mild restrictions such as continuity or existence of specific moments.

A random sample can be used to construct the empirical cumulative distribution function (ECDF) ,

which assigns mass to each .

is a valid, discrete CDF which can be substituted for to obtain . These estimators are referred to as plug-in estimators for obvious reasons.

For more details on statistical functionals and plug-in estimators, you can check out my blog post Plug-in estimators of statistical functionals!

Many statistical functionals take the form of an expectation of a real-valued function with respect to such that for ,

When is a function symmetric in its arguments such that, for e.g. , it is referred to as a symmetric kernel of degree . If is not symmetric, a symmetric equivalent can always be found,

where represents the set of all permutations of the indices .

A statistical functional belongs to a special family of expectation functionals when:

1. , and
2. is a symmetric kernel of degree .

Plug-in estimators of expectation functionals are referred to as V-statistics and can be expressed explicitly as,

so that is the average of evaluated at all possible permutations of size from . Since the can appear more than once within each summand, is generally biased.

By restricting the summands to distinct indices only an unbiased estimator known as a U-statistic arises. In fact, when the family of distributions is large enough, it can be shown that a U-statistic can always be constructed for expectation functionals.

Since is symmetric, we can require that , resulting in combinations of the subscripts . The U-statistic is then the average of evaluated at all distinct combinations of ,

While within each summand now, each still appears in multiple summands, suggesting that is the sum of correlated terms. As a result, the central limit theorem cannot be relied upon to determine the limiting distribution of .

For more details on expectation functionals and their estimators, you can check out my blog post U-, V-, and Dupree statistics!

This blog post provides a walk-through derivation of the limiting, or asymptotic, distribution of a single U-statistic .

## Variance derivation

We must start by assuming the form of the family of distributions to which can belong. It suffices that is the family of all distributions for which,

such that exists. Then, since , we have

Equivalently, let represent the set of all combinations of the subscripts , then

Recalling that

we can re-express as,

Let’s focus attention on a single summand,

Let represent the common elements between and . Then, as and can be identical, such that they share all elements, or completely distinct, such that they share no elements, or anything in between.

Recall that

and by definition,

Now, to simplify notation, let represent the elements and represent the elements . With common elements,

To simplify this statement, we need to combine some clever conditioning with the all powerful law of iterated expectation (a.k.a law of total expectation).
Conditioning on the common elements will make the two terms within the expectation independent as they share no other elements.

For , define

Then, applying the law of iterated expectation,

We also define .

Note that when ,

such that and when ,

such that .

Let and . Then, the law of iterated expectation tells us,

Since and become independent when we condition on , we have

and since is just a constant,

Noting that ,

How many and have elements in common?

1. There are ways to select the indices .
2. Then, there are ways of selecting the elements from shared by .
3. Finally, there are ways of selecting the remaining elements of from the remaining elements available.

Combining all of these components together, we obtain

but since , simplifying yields,

Note that for large ,

Then, we can rewrite as,

Expanding out the first few terms,

Since and , simplifying yields

The first term of the variance dominates and we have,

where

So, now we know that and , but we want to know if

Again, we cannot use the central limit theorem (CLT) to show that is asymptotically normal since it is the sum of dependent terms. However, we can:

1. Construct a surrogate for which the CLT does apply, and
2. Demonstrate that converges to the same distribution as .

## Selection and distribution of a surrogate

What should we choose as the surrogate? Well, we want something that will have the same mean and variance as . The variance only involves which suggests conditioning on a single . Thus, as a surrogate, lets consider

Let’s start by expanding the summand,

If then and,

else if then and,

Of the terms, how many include ? If , there are possible choices for the remaining subscripts, of which we require . Thus we have,

The are independent and identically distributed with mean and variance . The expectation of with respect to is,

and the corresponding variance is,

Finally, is the sum of independent and identically distributed random variables and so the central limit theorem tells us,

## Convergence to surrogate

We can express our quantity of interest in terms of our surrogate as,

Thus if we can show that,

we have our desired result! To prove this, it is sufficient to demonstrate

So expanding the quadratic mean, we have

Since , , and is a constant, this simplifies to

We know and from our earlier work, both of which equal . We can recycle our tricks from earlier to figure out .

Expanding the covariance yields,

Conditioning on and applying the law of iterated expectation,

Plugging in our previous results yields,

Finally, for large , we have

Huzzah! Finally we have shown that a single U-statistic is asymptotically normally distributed with mean and variance ,