After my last grand slam title, U-, V-, and Dupree statistics I was really feeling the pressure to keep my title game strong. Thank you to my wonderful friend Steve Lee for suggesting this beautiful title.
Overview
A statistical functional is any real-valued function of a distribution function such that
and represents characteristics of the distribution and include the mean, variance, and quantiles.
Often times is unknown but is assumed to belong to a broad class of distribution functions
subject only to mild restrictions such as continuity or existence of specific moments.
A random sample can be used to construct the empirical cumulative distribution function (ECDF)
,
which assigns mass to each
.
is a valid, discrete CDF which can be substituted for
to obtain
. These estimators are referred to as plug-in estimators for obvious reasons.
For more details on statistical functionals and plug-in estimators, you can check out my blog post Plug-in estimators of statistical functionals!
Many statistical functionals take the form of an expectation of a real-valued function with respect to
such that for
,
When is a function symmetric in its arguments such that, for e.g.
, it is referred to as a symmetric kernel of degree
. If
is not symmetric, a symmetric equivalent
can always be found,
where represents the set of all permutations of the indices
.
A statistical functional belongs to a special family of expectation functionals when:
, and
is a symmetric kernel of degree
.
Plug-in estimators of expectation functionals are referred to as V-statistics and can be expressed explicitly as,
so that is the average of
evaluated at all possible permutations of size
from
. Since the
can appear more than once within each summand,
is generally biased.
By restricting the summands to distinct indices only an unbiased estimator known as a U-statistic arises. In fact, when the family of distributions is large enough, it can be shown that a U-statistic can always be constructed for expectation functionals.
Since is symmetric, we can require that
, resulting in
combinations of the subscripts
. The U-statistic is then the average of
evaluated at all
distinct combinations of
,
While within each summand now, each
still appears in multiple summands, suggesting that
is the sum of correlated terms. As a result, the central limit theorem cannot be relied upon to determine the limiting distribution of
.
For more details on expectation functionals and their estimators, you can check out my blog post U-, V-, and Dupree statistics!
This blog post provides a walk-through derivation of the limiting, or asymptotic, distribution of a single U-statistic .
Variance derivation
We must start by assuming the form of the family of distributions to which
can belong. It suffices that
is the family of all distributions for which,
such that exists. Then, since
, we have
Equivalently, let represent the set of all combinations of the subscripts
, then
Recalling that
we can re-express as,
Let’s focus attention on a single summand,
Let represent the common elements between
and
. Then,
as
and
can be identical, such that they share all
elements, or completely distinct, such that they share no elements, or anything in between.
Recall that
and by definition,
Now, to simplify notation, let represent the elements
and
represent the elements
. With
common elements,
To simplify this statement, we need to combine some clever conditioning with the all powerful law of iterated expectation (a.k.a law of total expectation).
Conditioning on the common elements
will make the two terms within the expectation independent as they share no other elements.
For , define
Then, applying the law of iterated expectation,
We also define .
Note that when ,
such that and when
,
such that .
Let and
. Then, the law of iterated expectation tells us,
Since and
become independent when we condition on
, we have
and since is just a constant,
Noting that ,
How many and
have
elements in common?
- There are
ways to select the indices
.
- Then, there are
ways of selecting the
elements from
shared by
.
- Finally, there are
ways of selecting the remaining
elements of
from the remaining
elements available.
Combining all of these components together, we obtain
but since , simplifying yields,
Note that for large ,
Then, we can rewrite as,
Expanding out the first few terms,
Since and
, simplifying yields
The first term of the variance dominates and we have,
where
So, now we know that and
, but we want to know if
Again, we cannot use the central limit theorem (CLT) to show that is asymptotically normal since it is the sum of dependent terms. However, we can:
- Construct a surrogate
for which the CLT does apply, and
- Demonstrate that
converges to the same distribution as
.
Selection and distribution of a surrogate
What should we choose as the surrogate? Well, we want something that will have the same mean and variance as . The variance only involves
which suggests conditioning on a single
. Thus, as a surrogate, lets consider
Let’s start by expanding the summand,
If then
and,
else if then
and,
Of the terms, how many include
? If
, there are
possible choices for the remaining subscripts, of which we require
. Thus we have,
The are independent and identically distributed with mean
and variance
. The expectation of
with respect to
is,
and the corresponding variance is,
Finally, is the sum of independent and identically distributed random variables and so the central limit theorem tells us,
Convergence to surrogate
We can express our quantity of interest in terms of our surrogate
as,
Thus if we can show that,
we have our desired result! To prove this, it is sufficient to demonstrate
So expanding the quadratic mean, we have
Since ,
, and
is a constant, this simplifies to
We know and
from our earlier work, both of which equal
. We can recycle our tricks from earlier to figure out
.
Expanding the covariance yields,
Conditioning on and applying the law of iterated expectation,
Plugging in our previous results yields,
Finally, for large , we have
Huzzah! Finally we have shown that a single U-statistic is asymptotically normally distributed with mean
and variance
,
Click here to download this blog post as an RMarkdown (.Rmd) file!
3 thoughts on “Getting to know U: the asymptotic distribution of a single U-statistic”