Getting to know U: the asymptotic distribution of a single U-statistic

After my last grand slam title, U-, V-, and Dupree statistics I was really feeling the pressure to keep my title game strong. Thank you to my wonderful friend Steve Lee for suggesting this beautiful title.

Overview

A statistical functional is any real-valued function of a distribution function F such that

    \[ \theta = T(F) \]

and represents characteristics of the distribution F and include the mean, variance, and quantiles.

Often times F is unknown but is assumed to belong to a broad class of distribution functions \mathcal{F} subject only to mild restrictions such as continuity or existence of specific moments.

A random sample X_1, …, X_n \stackrel{i.i.d}{\sim} F can be used to construct the empirical cumulative distribution function (ECDF) \hat{F}_n,

    \[ \hat{F}_{n}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x) \]

which assigns mass \frac{1}{n} to each X_i.

\hat{F}_{n} is a valid, discrete CDF which can be substituted for F to obtain \hat{\theta} = T(\hat{F}_n). These estimators are referred to as plug-in estimators for obvious reasons.

For more details on statistical functionals and plug-in estimators, you can check out my blog post Plug-in estimators of statistical functionals!

Many statistical functionals take the form of an expectation of a real-valued function \phi with respect to F such that for a \leq n,

    \[ \theta = T(F) = \mathbb{E}_{F}~ \phi(X_1, …, X_a) .\]

When \phi(x_1, …, x_a) is a function symmetric in its arguments such that, for e.g. \phi(x_1, x_2) = \phi(x_2, x_1), it is referred to as a symmetric kernel of degree a. If \phi is not symmetric, a symmetric equivalent \phi^{*} can always be found,

    \[\phi^{*}(x_1, …, x_a) = \frac{1}{a!} \sum_{\pi ~\in~ \Pi} \phi(x_{\pi(1)}, …, x_{\pi(a)})\]

where \Pi represents the set of all permutations of the indices 1, …, a.

A statistical functional \theta = T(F) belongs to a special family of expectation functionals when:

  1. T(F) = \mathbb{E}_F ~\phi(X_1, …, X_a), and
  2. \phi(X_1, …, X_a) is a symmetric kernel of degree a.

Plug-in estimators of expectation functionals are referred to as V-statistics and can be expressed explicitly as,

    \[V_n = \frac{1}{n^a} \sum_{i_1 = 1}^{n} … \sum_{i_a = 1}^{n} \phi(X_{i_1}, …, X_{i_a}) \]

so that V_n is the average of \phi evaluated at all possible permutations of size a from X_1, …, X_n. Since the X_i can appear more than once within each summand, V_n is generally biased.

By restricting the summands to distinct indices only an unbiased estimator known as a U-statistic arises. In fact, when the family of distributions \mathcal{F} is large enough, it can be shown that a U-statistic can always be constructed for expectation functionals.

Since \phi is symmetric, we can require that 1 \leq i_1 < ... < i_a \leq n, resulting in {n \choose a} combinations of the subscripts 1, ..., a. The U-statistic is then the average of \phi evaluated at all {n \choose a} distinct combinations of X_1, ..., X_n,

    \[U_n = \frac{1}{{n \choose a}} \mathop{\sum … \sum} \limits_{1 \leq i_1 < ... < i_a \leq n} \phi(X_{i_1}, ..., X_{i_a}).\]

While i_j \neq i_k within each summand now, each X_i still appears in multiple summands, suggesting that U_n is the sum of correlated terms. As a result, the central limit theorem cannot be relied upon to determine the limiting distribution of U_n.

For more details on expectation functionals and their estimators, you can check out my blog post U-, V-, and Dupree statistics!

This blog post provides a walk-through derivation of the limiting, or asymptotic, distribution of a single U-statistic U_n.

Variance derivation

We must start by assuming the form of the family of distributions \mathcal{F} to which F can belong. It suffices that \mathcal{F} is the family of all distributions for which,

    \[ \text{Var}_{F}~ \phi(X_1, …, X_a) < \infty \]

such that \text{Var}_{F} ~ U_n exists. Then, since \text{Var}_F~ [aX] = a^2 ~\text{Var}_F ~[X], we have

    \[ \text{Var}_{F}~ U_n = \frac{1}{{n \choose a}^2} \text{Var}_{F} \mathop{\sum … \sum} \limits_{1 \leq i_1 < ... < i_a \leq n} \phi(X_{i_1}, ..., X_{i_a}) .\]

Equivalently, let \mathcal{C} represent the set of all combinations of the subscripts (1, ..., a), then

    \[ \text{Var}_{F}~ U_n = \frac{1}{{n \choose a}^2} \text{Var}_{F} \sum_{\mathbf{i} ~\in~ \mathcal{C}} \phi(X_{i_1}, ..., X_{i_a}) .\]

Recalling that

    \[ \text{Var}_{F} \sum_{i=1}^{n} a_i Z_i = \sum_{i=1}^{n} \sum_{j=1}^{n} a_i a_j ~\text{Cov}[Z_i, Z_j]\]

we can re-express \text{Var}_{F} ~ U_n as,

    \[\text{Var}_{F}~ U_n = \frac{1}{{n \choose a}^2} \sum_{\mathbf{i} ~\in~ \mathcal{C}} \sum_{\mathbf{j} ~\in~ \mathcal{C}} \text{Cov}\left[\phi(X_{i_1}, …, X_{i_a}),~ \phi(X_{j_1}, …, X_{j_a})\right].\]

Let’s focus attention on a single summand, \text{Cov}\left[\phi(X_{i_1}, …, X_{i_a}),~ \phi(X_{j_1}, …, X_{j_a}) \right].

Let c represent the common elements between (i_1, …, i_a) and (j_1, …, j_a). Then, 0 \leq c \leq a as (i_1, …, i_a) and (j_1, …, j_a) can be identical, such that they share all a elements, or completely distinct, such that they share no elements, or anything in between.

Recall that

    \[ \text{Cov}[X, Y] = \mathbb{E}_{F} \left[(X - \mathbb{E}_F~X) (Y - \mathbb{E}_F~Y)\right]\]

and by definition,

    \[ \mathbb{E}_{F}~ \phi(X_{i_1}, …, X_{i_a}) = \theta .\]

Now, to simplify notation, let (X_1, …, X_a) represent the elements (X_{i_1}, …, X_{i_a}) and (X'_{1}, …, X'_{a}) represent the elements (X_{j_1}, …, X_{j_a}). With c common elements,

    \[ \text{Cov}\left[\phi(X_{i_1}, …, X_{i_a}),~ \phi(X_{j_1}, …, X_{j_a})\right] = \mathbb{E}_{F} \left[ \left\{\phi(X_1, …, X_c, X_{c+1}, …, X_a) - \theta \right\}\left\{\phi(X_1, …, X_c, X'_{c+1}, …, X'_a) - \theta \right\} \right] \]

To simplify this statement, we need to combine some clever conditioning with the all powerful law of iterated expectation (a.k.a law of total expectation).
Conditioning on the c common elements X_1, …, X_c will make the two terms within the expectation independent as they share no other elements.

For c = 0, 1, …, a, define

    \[ \phi_{c}(X_1, …, X_c) = \mathbb{E}_F \left[ \phi(X_1, …, X_a) | X_1, …, X_c \right].\]

Then, applying the law of iterated expectation,

    \[ \mathbb{E}_F~ \phi_c(X_1, …, X_c) = \mathbb{E}_F \Big[ \mathbb{E}_F\left[\phi(X_1, …, X_a) | X_1, …, X_c\right]\Big] = \mathbb{E}_F~ \phi(X_1, …, X_a) = \theta.\]

We also define \sigma_{c}^2 = \text{Var}_{F}~ \phi_{c} (X_1, …, X_c).

Note that when c=0,

    \[\phi_0 = \mathbb{E}_F~ \phi(X_1, …, X_a) = \theta\]

such that \sigma^{2}_{0} = 0 and when c=a,

    \[\phi_a = \mathbb{E}_F \Big[ \mathbb{E}_F\left[\phi(X_1, …, X_a) | X_1, …, X_a \right]\Big] = \phi(X_1, …, X_a)\]

such that \sigma_{a}^2 = \text{Var}_{F}~ \phi(X_1, …, X_a).

Let S_1 = (X_1, …, X_c, X_{c+1}, …, X_a) and S_2 = (X_1, …, X_c, X'_{c+1}, …, X'_{a}). Then, the law of iterated expectation tells us,

    \[\text{Cov}[\phi(S_1), \phi(S_2)] = \mathbb{E}_F \Big[ \mathbb{E}_F \big[ (\phi(S_1) - \theta)(\phi(S_2) - \theta) | X_1, …, X_c\big] \Big].\]

Since \phi(S_1) and \phi(S_2) become independent when we condition on X_1, …, X_c, we have

    \[ \text{Cov}[\phi(S_1), \phi(S_2)] = \mathbb{E}_{F} \Big[ \mathbb{E}_F \left[ \phi(S_1) - \theta | X_1, …., X_c \right] \mathbb{E}_F \left[ \phi(S_2) - \theta | X_1, …., X_c \right] \Big]\]

and since \theta is just a constant,

    \[ \text{Cov}[\phi(S_1), \phi(S_2)] = \mathbb{E}_{F} \Big[ \Big(\mathbb{E}_F \left[ \phi(S_1) | X_1, …., X_c \right] - \theta \Big) \Big(\mathbb{E}_F \left[ \phi(S_2) | X_1, …., X_c \right] - \theta \Big)\Big].\]

Noting that \mathbb{E}_F \left[ \phi(S_1) | X_1, …., X_c \right] = \mathbb{E}_F \left[ \phi(S_2) | X_1, …., X_c \right] = \phi_{c}(X_1, …, X_c),

    \[ \text{Cov}[\phi(S_1), \phi(S_2)] = \mathbb{E}_{F} \Big[ \Big( \phi_{c}(X_1, …, X_c) - \theta \Big)^2\Big] = \text{Var}_F~ \phi_c (X_1, …, X_c) = \sigma^2_c.\]

How many (X_{i_1}, …, X_{i_a}) and (X_{j_1}, …, X_{j_a}) have c elements in common?

  1. There are {n \choose a} ways to select the indices (i_{1}, …, i_{a}).
  2. Then, there are {a \choose c} ways of selecting the c elements from (i_{1}, …, i_{a}) shared by (j_{1}, …, j_{a}).
  3. Finally, there are {n-a \choose a-c} ways of selecting the remaining a-c elements of (j_{1}, …, j_{a}) from the remaining n-a elements available.

Combining all of these components together, we obtain

    \[ \text{Var}_{F}~U_n = \frac{1}{{n \choose a}^2} \sum_{c=0}^{a} {n \choose a}{a \choose c}{n-a \choose a-c} \sigma^{2}_c\]

but since \sigma^{2}_0 = 0, simplifying yields,

    \[ \text{Var}_{F}~U_n = \frac{1}{{n \choose a}} \sum_{c=1}^{a} {a \choose c}{n-a \choose a-c} \sigma^{2}_c.\]

Note that for large n,

    \[{n-a \choose k} = \frac{(n-a)!}{k! (n-a-k)!} = \frac{1}{k!} (n-a)(n-a-1)…(n-a-k+1) \sim \frac{n^k}{k!}.\]

Then, we can rewrite \text{Var}_{F}~U_n as,

    \[\text{Var}_F ~ U_n = \frac{a! (n-a)!}{n!} \sum_{c=1}^{a} \frac{a!}{c!(a-c)!} \frac{n^{a-c}}{(a-c)!} \sigma^{2}_c.\]

Expanding out the first few terms,

    \[\text{Var}_F ~ U_n = \frac{a! (n-a)!}{n!} \left[ \frac{a!}{(a-1)!} \frac{n^{a-1}}{(a-1)!} \sigma^{2}_1 + \frac{a!}{2!(a-2)!} \frac{n^{a-2}}{(a-2)!} \sigma^{2}_2 + … \right].\]

Since a!/(a-1)! = a and n! / (n-a)! \sim n^{a}, simplifying yields

    \[\text{Var}_F ~ U_n = \frac{a^2}{n} \sigma^{2}_1 + \frac{a^2 (a-1)^2}{2 n^2} \sigma^{2}_2 + … ~.\]

The first term of the variance dominates and we have,

    \[\text{Var}_F \approx \frac{a^2}{n} \sigma^{2}_1\]

where

    \[\sigma^{2}_1 = \text{Var}_F~ \phi_1(X_1) = \text{Var}_F \Big[ \mathbb{E}_F \left[ \phi(X_1, …, X_a) | X_1 \right]\Big].\]

So, now we know that \mathbb{E}_F~U_n = \theta and \text{Var}_F~U_n = \frac{a^2}{n} \sigma^{2}_1, but we want to know if

    \[ \sqrt{n} (U_n - \theta) \rightarrow N \left(0, a^2 \sigma^{2}_1 \right).\]

Again, we cannot use the central limit theorem (CLT) to show that U_n is asymptotically normal since it is the sum of dependent terms. However, we can:

  1. Construct a surrogate U^{*}_n for which the CLT does apply, and
  2. Demonstrate that U_n converges to the same distribution as U^{*}_n.

Selection and distribution of a surrogate

What should we choose as the surrogate? Well, we want something that will have the same mean and variance as U_n - \theta. The variance only involves \sigma^{2}_1 which suggests conditioning on a single X_i. Thus, as a surrogate, lets consider

    \[U_n^{*} = \sum_{i=1}^{n} \mathbb{E}_F \left[ U_n - \theta | X_i \right]\]

Let’s start by expanding the summand,

    \[ \mathbb{E}_F \left[U_n - \theta | X_i \right] = \frac{1}{{n \choose a}} \mathop{\sum … \sum} \limits_{1 \leq i_1 < ... < i_a \leq n} \Big( \mathbb{E}_F \left[\phi(X_{i_1}, ..., X_{i_a})|X_i\right] - \theta \Big)\]

If X_i \in (X_{i1}, …, X_{i_a}) then c = 1 and,

    \[\mathbb{E}_F [\phi(X_{i_1}, …, X_{i_a}|X_i] - \theta = \phi_1(X_i) - \theta\]

else if X_i \notin (X_{i1}, …, X_{i_a}) then c = 0 and,

    \[\mathbb{E}_F [\phi(X_{i_1}, …, X_{i_a})|X_i] - \theta = \mathbb{E}_F [\phi(X_{i_1}, …, X_{i_a})] - \theta=0.\]

Of the {n \choose a} terms, how many include X_i? If i \in (i_1, …, i_a), there are n-1 possible choices for the remaining subscripts, of which we require a-1. Thus we have,

    \[U_n^{*} = \frac{{n-1 \choose a-1}}{{n \choose a}} \sum_{i=1}^{n} \Big[ \phi_1(X_i) - \theta \Big]= \frac{a}{n} \sum_{i=1}^{n} \Big[ \phi_1(X_i) - \theta \Big].\]

The \phi_1(X_i) are independent and identically distributed with mean \theta and variance \sigma^{2}_1. The expectation of U_n^{*} with respect to F is,

    \[\mathbb{E}_F~U_n^{*} = a ~\mathbb{E}_F \Big[ \phi_1(X_i) - \theta \Big] = 0\]

and the corresponding variance is,

    \[\text{Var}_F~ U_n^{*} = \frac{a^2}{n^2} \sum_{i=1}^{n} \text{Var}_F~ \phi_1(X_i) = \frac{a^2}{n} \sigma_1^{2}. \]

Finally, U_n^{*} is the sum of independent and identically distributed random variables and so the central limit theorem tells us,

    \[ \sqrt{n} U_n^{*} \rightarrow N(0, a^2 \sigma_1^2).\]

Convergence to surrogate

We can express our quantity of interest \sqrt{n} (U_n - \theta) in terms of our surrogate U^{*}_n as,

    \[ \sqrt{n} (U_n - \theta) = \sqrt{n} U_{n}^{*} + \sqrt{n} (U_n - \theta - U_n^{*}). \]

Thus if we can show that,

    \[ \sqrt{n} (U_n - \theta - U_n^{*}) \stackrel{P}{\rightarrow} 0 \]

we have our desired result! To prove this, it is sufficient to demonstrate

    \[\mathbb{E}_F~ \left[ \sqrt{n} (U_n - \theta - U_n^{*}) \right]^2 \rightarrow 0.\]

So expanding the quadratic mean, we have

    \[\mathbb{E}_F~ \left[ \sqrt{n} (U_n - \theta - U_n^{*}) \right]^2 = n \mathbb{E}_F~ {U_n^{*}}^{2} - 2n \mathbb{E}_F \Big[(U_n - \theta) U_n^{*} \Big] + n \mathbb{E}_F~ (U_n - \theta)^2.\]

Since \mathbb{E}_F~ U_n^{*} = \mathbb{E}_F~ (U_n - \theta)^2, Var[X] = E[X^2] - E[X]^2, and \theta is a constant, this simplifies to

    \[\mathbb{E}_F~ \left[ \sqrt{n} (U_n - \theta - U_n^{*}) \right]^2 = n \text{Var}_F ~{U_n^{*}} - 2n \text{Cov} \Big[U_n - \theta, U_n^{*} \Big] + n \text{Var}_F~ U_n.\]

We know \text{Var}_F ~U_n^{*} and \text{Var}_F~ U_n from our earlier work, both of which equal \frac{a^2}{n}\sigma^{2}_1. We can recycle our tricks from earlier to figure out \text{Cov}[U_n-\theta, U_n^{*}].

Expanding the covariance yields,

    \[\text{Cov}[U_n, U_n^{*}] = \mathbb{E}_F [(U_n - \theta) U_n] = \frac{a}{n} \sum_{i=1}^{n} \mathbb{E}_F \Big[ \Big(U_n - \theta \Big) \Big( \phi_1(X_i) - \theta\Big) \Big].\]

Conditioning on X_i and applying the law of iterated expectation,

    \[ \text{Cov}[U_n, U_n^{*}] = \frac{a}{n} \sum_{i=1}^{n} \mathbb{E}_F \Big[ \mathbb{E}_F \left[\Big(\phi_1(X_i) - \theta\Big) \Big(U_n - \theta\Big)\Big| X_i \right]\Big] = \frac{a}{n} \sum_{i=1}^{n} \mathbb{E}_F \Big[ \Big(\phi_1(X_i) - \theta\Big)\mathbb{E}_F \left[ \Big(U_n - \theta\Big)\Big| X_i \right]\Big].\]

Plugging in our previous results yields,

    \[ \text{Cov}[U_n, U_n^{*}] = \frac{a^2}{n^2} \sum_{i=1}^{n} \mathbb{E}_F ~ \Big[\phi_1(X_i) - \theta)\Big]^2 = \frac{a^2}{n^2} \sum_{i=1}^{n} \text{Var}_F~ \phi_1(X_i) = \frac{a^2}{n} \sigma^{2}_1.\]

Finally, for large n, we have

    \[\mathbb{E}_F~ \left[ \sqrt{n} (U_n - \theta - U_n^{*}) \right]^2 = n \left( \frac{a^2}{n} \sigma^{2}_1 \right) - 2n \left( \frac{a^2}{n} \sigma^{2}_1 \right) + n \left( \frac{a^2}{n} \sigma^{2}_1 \right) = 0.\]

Huzzah! Finally we have shown that a single U-statistic U_n is asymptotically normally distributed with mean \theta = T(F) and variance \frac{a^2}{n} \sigma^{2}_1,

    \[\sqrt{n} (U_n - \theta) \rightarrow N(0, a^2 \sigma_1^{2}).\]


Click here to download this blog post as an RMarkdown (.Rmd) file!

Published by

Emma Smith

Emma Smith is a young statistician who's on a mission to convince the masses statistics is as awesome as she *knows* it is! When she's not working on expanding her knowledge of machine learning and mathematical statistics, she's busy petting cats and unsuccessfully convincing her boyfriend to let her adopt them, hiking, concocting indie and folk rock playlists, and kicking butt in roller derby.

2 thoughts on “Getting to know U: the asymptotic distribution of a single U-statistic”

Leave a Reply

Your email address will not be published. Required fields are marked *