After my last grand slam title, U-, V-, and Dupree statistics I was really feeling the pressure to keep my title game strong. Thank you to my wonderful friend Steve Lee for suggesting this beautiful title.
Overview
A statistical functional is any real-valued function of a distribution function
such that
![]()
and represents characteristics of the distribution
and include the mean, variance, and quantiles.
Often times
is unknown but is assumed to belong to a broad class of distribution functions
subject only to mild restrictions such as continuity or existence of specific moments.
A random sample
can be used to construct the empirical cumulative distribution function (ECDF)
,
![]()
which assigns mass
to each
.
is a valid, discrete CDF which can be substituted for
to obtain
. These estimators are referred to as plug-in estimators for obvious reasons.
For more details on statistical functionals and plug-in estimators, you can check out my blog post Plug-in estimators of statistical functionals!
Many statistical functionals take the form of an expectation of a real-valued function
with respect to
such that for
,
![]()
When
is a function symmetric in its arguments such that, for e.g.
, it is referred to as a symmetric kernel of degree
. If
is not symmetric, a symmetric equivalent
can always be found,
![]()
where
represents the set of all permutations of the indices
.
A statistical functional
belongs to a special family of expectation functionals when:
, and
is a symmetric kernel of degree
.
Plug-in estimators of expectation functionals are referred to as V-statistics and can be expressed explicitly as,
![]()
so that
is the average of
evaluated at all possible permutations of size
from
. Since the
can appear more than once within each summand,
is generally biased.
By restricting the summands to distinct indices only an unbiased estimator known as a U-statistic arises. In fact, when the family of distributions
is large enough, it can be shown that a U-statistic can always be constructed for expectation functionals.
Since
is symmetric, we can require that
, resulting in
combinations of the subscripts
. The U-statistic is then the average of
evaluated at all
distinct combinations of
,
![]()
While
within each summand now, each
still appears in multiple summands, suggesting that
is the sum of correlated terms. As a result, the central limit theorem cannot be relied upon to determine the limiting distribution of
.
For more details on expectation functionals and their estimators, you can check out my blog post U-, V-, and Dupree statistics!
This blog post provides a walk-through derivation of the limiting, or asymptotic, distribution of a single U-statistic
.
Variance derivation
We must start by assuming the form of the family of distributions
to which
can belong. It suffices that
is the family of all distributions for which,
![]()
such that
exists. Then, since
, we have
![]()
Equivalently, let
represent the set of all combinations of the subscripts
, then
![]()
Recalling that
![]()
we can re-express
as,
![]()
Let’s focus attention on a single summand, ![]()
Let
represent the common elements between
and
. Then,
as
and
can be identical, such that they share all
elements, or completely distinct, such that they share no elements, or anything in between.
Recall that
![]()
and by definition,
![]()
Now, to simplify notation, let
represent the elements
and
represent the elements
. With
common elements,
![]()
To simplify this statement, we need to combine some clever conditioning with the all powerful law of iterated expectation (a.k.a law of total expectation).
Conditioning on the
common elements
will make the two terms within the expectation independent as they share no other elements.
For
, define
![]()
Then, applying the law of iterated expectation,
![]()
We also define
.
Note that when
,
![]()
such that
and when
,
![]()
such that
.
Let
and
. Then, the law of iterated expectation tells us,
![]()
Since
and
become independent when we condition on
, we have
![]()
and since
is just a constant,
![]()
Noting that
,
![]()
How many
and
have
elements in common?
- There are
ways to select the indices
. - Then, there are
ways of selecting the
elements from
shared by
. - Finally, there are
ways of selecting the remaining
elements of
from the remaining
elements available.
Combining all of these components together, we obtain
![]()
but since
, simplifying yields,
![]()
Note that for large
,
![]()
Then, we can rewrite
as,
![]()
Expanding out the first few terms,
![]()
Since
and
, simplifying yields
![]()
The first term of the variance dominates and we have,
![]()
where
![]()
So, now we know that
and
, but we want to know if
![]()
Again, we cannot use the central limit theorem (CLT) to show that
is asymptotically normal since it is the sum of dependent terms. However, we can:
- Construct a surrogate
for which the CLT does apply, and - Demonstrate that
converges to the same distribution as
.
Selection and distribution of a surrogate
What should we choose as the surrogate? Well, we want something that will have the same mean and variance as
. The variance only involves
which suggests conditioning on a single
. Thus, as a surrogate, lets consider
![]()
Let’s start by expanding the summand,
![]()
If
then
and,
![]()
else if
then
and,
![]()
Of the
terms, how many include
? If
, there are
possible choices for the remaining subscripts, of which we require
. Thus we have,
![]()
The
are independent and identically distributed with mean
and variance
. The expectation of
with respect to
is,
![]()
and the corresponding variance is,
![]()
Finally,
is the sum of independent and identically distributed random variables and so the central limit theorem tells us,
![]()
Convergence to surrogate
We can express our quantity of interest
in terms of our surrogate
as,
![]()
Thus if we can show that,
![]()
we have our desired result! To prove this, it is sufficient to demonstrate
![]()
So expanding the quadratic mean, we have
![]()
Since
,
, and
is a constant, this simplifies to
![]()
We know
and
from our earlier work, both of which equal
. We can recycle our tricks from earlier to figure out
.
Expanding the covariance yields,
![]()
Conditioning on
and applying the law of iterated expectation,
![]()
Plugging in our previous results yields,
![]()
Finally, for large
, we have
![]()
Huzzah! Finally we have shown that a single U-statistic
is asymptotically normally distributed with mean
and variance
,
![]()
Click here to download this blog post as an RMarkdown (.Rmd) file!

3 thoughts on “Getting to know U: the asymptotic distribution of a single U-statistic”