U-, V-, and Dupree statistics

To start, I apologize for this blog’s title but I couldn’t resist referencing to the Owen Wilson classic You, Me, and Dupree – wow! The other gold-plated candidate was U-statistics and You. Please, please, hold your applause.

My previous blog post defined statistical functionals as any real-valued function of an unknown CDF, T(F), and explained how plug-in estimators could be constructed by substituting the empirical cumulative distribution function (ECDF) \hat{F}_{n} for the unknown CDF F. Plug-in estimators of the mean and variance were provided and used to demonstrate plug-in estimators’ potential to be biased.

    \[ \hat{\mu} = \mathbb{E}_{\hat{F}_n}[X] = \sum_{i=1}^{n} X_i P(X = X_i) = \frac{1}{n} \sum_{i=1}^{n} X_i = \bar{X}_{n} \]

    \[ \hat{\sigma}^{2} = \mathbb{E}_{\hat{F}_{n}}[(X- \mathbb{E}_{\hat{F}_n}[X])^2] = \mathbb{E}_{\hat{F}_n}[(X - \bar{X}_{n})^2] = \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X}_{n})^2. \]

Statistical functionals that meet the following two criteria represent a special family of functionals known as expectation functionals:

1) T(F) is the expectation of a function g with respect to the distribution function F; and

    \[ T(F) = \mathbb{E}_{F} ~g(X)\]

2) the function g(\cdot) takes the form of a symmetric kernel.

Expectation functionals encompass many common parameters and are well-behaved. Plug-in estimators of expectation functionals, named V-statistics after von Mises, can be obtained but may be biased. It is, however, always possible to construct an unbiased estimator of expectation functionals regardless of the underlying distribution function F. These estimators are named U-statistics, with the “U” standing for unbiased.

This blog post provides 1) the definitions of symmetric kernels and expectation functionals; 2) an overview of plug-in estimators of expectation functionals or V-statistics; 3) an overview of unbiased estimators for expectation functionals or U-statistics.

Kernels, degree, and symmetry

Consider a real-valued function \phi(x_1, …, x_a) such that for random variables X_1, …, X_n \stackrel{i.i.d}{\sim} F,

    \[ \theta = T(F) = \mathbb{E}_{F} ~\phi(X_1, …, X_a) ~,~ 1 \leq a \leq n . \]

\phi(x_1, …, x_a) is referred to as a kernel of degree a and is by definition, an unbiased estimator of \theta. \phi is said to be symmetric in its arguments, or symmetric, if it is invariant to permutation of its arguments x_1, …, x_a. For example, a kernel of degree 2 is symmetric if \phi(x_1, x_2) = \phi(x_2, x_1).

If \phi is not symmetric, a function \phi^{*} can always be found that is symmetric. As a result of the X_i‘s being identically distributed, they may be considered “exchangeable” such that for any permutation \pi = \{\pi(1), …, \pi(a)\} of the a random variables,

    \[ \mathbb{E}_{F}~\phi(X_1, …, X_a) = \mathbb{E}_{F}~\phi\left(X_{\pi(1)}, …. X_{\pi(a)}\right) . \]

There are a! possible permutations of \phi‘s a arguments. Then, since \phi is an unbiased estimator of \theta, the average of \phi across all a! permutations of its arguments,

    \[ \phi^{*}(x_1, …, x_a) = \frac{1}{a!} \sum_{\pi \in \Pi} \phi\left(x_{\pi(1)}, …, x_{\pi(a)}\right) \]

is both an unbiased estimator for \theta and symmetric in its arguments such that

    \[ \mathbb{E}_{F} ~\phi(X_1, …, X_a) = \mathbb{E}_{F} ~\phi^{*}(X_1, …, X_a) = \theta. \]

Thus, without loss of generality, the kernel \phi may always be assumed to be symmetric.

As an example, \phi(x_1, x_2) = x_1 x_2 is a symmetric kernel of degree 2 since \phi(x_2, x_1) = x_2 x_1 = x_1 x_2 = \phi(x_1, x_2) and \mathbb{E}_{F} (X_1X_2) = \mu^2.

On the other hand, \phi(x_1, x_2) = x_{1}^{2} - x_1 x_2 is an unbiased estimator of \sigma^2 such that

    \[\mathbb{E}_{F}(X_1^2 - X_1 X_2) = \mathbb{E}_{F}(X_1^2) - \mu^2 = \sigma^2\]

but it is not symmetric as x_1^2 - x_1 x_2 \neq x_2^2 - x_2 x_1. The corresponding symmetric kernel of degree 2 \phi^{*}(x_1, x_2) is then

    \[ \phi^{*}(x_1, x_2) = \frac{1}{2} \left[\phi (x_1, x_2) + \phi(x_2, x_1)\right] = \frac{x_1^2 - 2x_1 x_2 + x_2^2}{2} = \frac{(x_1 - x_2)^2}{2}. \]

Expectation functionals

Any statistical functionals that can be expressed as the expectation of a symmetric kernel of degree a with respect to F

    \[ T(F) = \mathbb{E}_{F} ~\phi(X_1, …, X_a) \]

represent a special family known as expectation functionals or regular functionals. For a refresher on what \mathbb{E}_F means, refer to my previous blog post on plug-in estimators!

Common examples of expectation functionals include moments,

    \[\mu_{r} = \mathbb{E}_{F}~X^{r} = \int x^{r} ~dF(x) \]


    \[\sigma^2 = \int \int \frac{(x_1 - x_2)^2}{2} ~dF(x_1)~dF(x_2)\]

and covariance,

    \[\sigma_{XY} = \int \int \frac{(x_1 - x_2)(y_1 - y_2)}{2} ~dF(x_1, y_1) ~dF(x_2, y_2).\]

When a = 1, expectation functionals may also be referred to as linear functionals since for some mixture F_\alpha = \alpha F_1 + (1-\alpha) F_2, ~ \alpha \in (0,1),

    \[ T (F_\alpha) = \int g(x) ~d\{\alpha F_1(x) + (1-\alpha) F_2(x) \} = \alpha \int g(x) ~dF_1(x) + (1-\alpha) \int g(x) ~dF_2(x) \]

    \[T(F_\alpha) = \alpha~ T(F_1) + (1-\alpha)~ T(F_2).\]


V-statistics are plug-in estimators of expectation functionals such that for X_1, …, X_n \stackrel{i.i.d}{\sim} F,

    \[ V_n = T(\hat{F}_n) = \mathbb{E}_{\hat{F}_{n}} \phi(X_1, …, X_a) ~,~a \leq n. \]

As noted previously, the empirical cumulative distribution function \hat{F}_n(x) defined as

    \[ \hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x)\]

is a valid distribution function which assigns mass \frac{1}{n} to each X_i. That is, for a random variable Y \sim \hat{F}_{n},

    \[ P(Y = X_1) = P(Y = X_2) = … = P(Y = X_n) = \frac{1}{n}.\]

Now, consider a such random variables Y_1, …, Y_a \stackrel{i.i.d}{\sim} \hat{F}_{n}. Then, there are a total of n^{a} equally-likely realizations of (Y_1, …, Y_a). For example, all Y_i could equal X_1, all Y_i could equal X_n, or anything in between such as (Y_1 = X_3, Y_2 = X_n, Y_3 = X_n, …, Y_a = X_4).

The Y_i are independent and thus the plug-in estimator of an expectation functional is equal to

    \[ V_n = \mathbb{E}_{\hat{F}_{n}} ~\phi(Y_1, …, Y_a) = \int … \int \phi(Y_1, … Y_a) ~d\hat{F}_{n}(y_1)~… ~ d\hat{F}_{n}(y_n). \]

Since \hat{F}_n is discrete and the support of each Y_i is the random sample X_1, …, X_n, the plug-in estimator V_n can be explicitly expressed as,

    \[ V_n = \frac{1}{n^a} \sum_{i_1 = 1}^{n} \sum_{i_2 = 1}^{n} … \sum_{i_a = 1}^{n} \phi(X_{i_1}, X_{i_2}, …, X_{i_a}) \]

so that V_n is the sample average of \phi evaluated at all n^a possible realizations of (Y_1, …, Y_a), or equivalently, the sample average of \phi evaluated at all n^{a} permutations of size a from X_1, …, X_n.

Bias of V-statistics

When a = 1, the V-statistic takes the form of a traditional sample mean,

    \[ V_n = \frac{1}{n} \sum_{i_1 = 1}^{n} \phi(X_{i_1}) \]

which is unbiased and its asymptotic distribution is provided by the central limit theorem,

    \[ \sqrt{n} (V_n - T(F)) \rightarrow N(0, \sigma^2). \]

Now, consider the form of V_n when a = 2,

    \[ V_n = \frac{1}{n^2} \sum_{i_1 = 1}^{n} \sum_{i_2 = 1}^{n} \phi(X_{i_1}, X_{i_2}).\]

Notice that the sum contains terms for which i_1 = i_2. We can expand the sum to make these terms explicit,

    \[V_n = \frac{1}{n^2} \left\{ \mathop{\sum\sum}\limits_{i_1 \neq i_2} \phi(X_{i_1}, X_{i_2}) + \sum_{i_1 = 1}^{n} \phi(X_{i_1}, X_{i_1})\right\}.\]

There are n(n-1) terms for which i_1 \neq i_2 and n terms for which i_1 = i_2. Taking the expectation of V_n with respect to F yields,

    \[\mathbb{E}_{F} V_n = \frac{n-1}{n} \mathbb{E}_{F} ~\phi(X_{i_1}, X_{i_2}) + \frac{1}{n} \mathbb{E}_{F} ~\phi(X_{i_1}, X_{i_1}).\]

Since X_{i_1} and X_{i_2} are independent when i_1 \neq i_2, \mathbb{E}_{F} ~\phi(X_{i_1}, X_{i_2}) = T(F) by definition such that,

    \[\mathbb{E}_{F}~ V_n = \frac{n-1}{n} T(F) + \frac{1}{n} \mathbb{E}_{F}~ \phi(X_{i_1}, X_{i_1}).\]

Clearly \mathbb{E}_{F}~ V_n \neq T(F) and thus V_n is a biased estimator of T(F). V_n is generally biased for a > 1 as subscript duplication (i_{j} = i_{k}) results in dependence within its summands. Note however, the bias of V_n approaches 0 as n \rightarrow \infty.

As an example, consider the plug-in estimator for the variance using the symmetric kernel we defined above,

    \[ V_n = \frac{1}{n^2} \sum_{i_1 = 1}^{n} \sum_{i_2 = 1}^{n} \frac{(X_{i_1} - X_{i_2})^2}{2}. \]

Expanding the square and splitting the sum into i_1 \neq i_2 and i_1 = i_2,

    \[ V_n = \frac{1}{n^2} \left\{ \mathop{\sum\sum}\limits_{i_1 \neq i_2} \frac{X_{i_1}^2 - 2 X_{i_1} X_{i_2} + X_{i_2}^2}{2} + \sum_{i_1 = 1}^{n} \frac{2 X_{i_1}^2 - 2 X_{i_1}^2}{2}\right\} . \]

The second sum is clearly zero and we are left with,

    \[ V_n = \frac{1}{n^2} \mathop{\sum\sum}\limits_{i_1 \neq i_2} \frac{X_{i_1}^2 - 2 X_{i_1} X_{i_2} + X_{i_2}^2}{2}. \]

Now, taking the expectation of V_n with respect to F,

    \[ \mathbb{E}_{F}~ V_n = \frac{1}{n^2} \mathop{\sum\sum}\limits_{i_1 \neq i_2} \frac{ \mathbb{E}_F [X_{i_1}^2] - 2 \mathbb{E}_F [X_{i_1}]~ \mathbb{E}_F[X_{i_2}] + \mathbb{E}_F [X_{i_2}^2]}{2}. \]

Since the X_i are identically distributed, let’s drop the subscripts and aggregate terms recalling that there are n(n-1) terms for which i_1 \neq i_2,

    \[ \mathbb{E}_{F}~ V_n = \frac{n(n-1)}{n^2} \frac{ 2\mathbb{E}_F [X^2] - 2 \mathbb{E}_F [X]^2}{2}. \]

Recalling that \sigma^2 = E[X^2] - E[X]^2, simplifying leaves us with

    \[ \mathbb{E}_F ~V_n = \frac{(n-1)}{n} \sigma^2 \]

which takes the general form of \mathbb{E}_F~ V_n we derived previously.


Since V_n is biased as a result of subscript duplication, a possible solution would be to restrict the sums to distinct combinations of subscripts (i_1, …, i_a). We can require that 1 \leq i_1 < ... < i_a \leq n as a result of the kernel \phi‘s symmetry. Thus, there are {n \choose a} such subscript combinations to consider. Let \mathcal{C} represent the set of all {n \choose a} subscript combinations. Then, the resulting estimator of T(F) is the U-statistic

    \[ U_n = \frac{1}{{n \choose a}} \sum_{c ~\in~ \mathcal{C}} \phi (X_{i_1}, ..., X_{i_a}) \]

or equivalently,

    \[ U_n = \frac{1}{{n \choose a}} \mathop{\sum...\sum}\limits_{1 \leq i_1 < ... < i_a \leq n} \phi (X_{i_1}, ..., X_{i_a}). \]

Now that all subscripts within the summands are distinct, we have

    \[ \mathbb{E}_F~ U_n = \mathbb{E}_F~ \phi (X_{i_1}, ..., X_{i_a}) = T(F) \]

so that U_n is unbiased for T(F), hence the name!

Returning to the variance example, the corresponding U-statistic is

    \[ U_n = \frac{1}{{n \choose 2}} \sum_{c ~\in~ \mathcal{C}} \frac{(X_{i_1} - X_{i_2})^2}{2} = \frac{1}{{n \choose 2}} \sum_{c ~\in~ \mathcal{C}} \frac{X_{i_1}^2 - 2 X_{i_1} X_{i_2} + X_{i_2}^2}{2}. \]

Taking the expectation of U_n with respect to F yields,

    \[ \mathbb{E}_F ~U_n = \frac{1}{{n \choose 2}} \sum_{c ~\in~ \mathcal{C}} \frac{ \mathbb{E}_F [X_{i_1}^2] - 2 \mathbb{E}_F [X_{i_1}]~ \mathbb{E}_F[X_{i_2}] + \mathbb{E}_F [X_{i_2}^2]}{2}. \]

Since the X_i are identically distributed, dropping the subscripts and aggregating terms gives,

    \[ \mathbb{E}_F ~U_n = \mathbb{E}_F [X^2] - \mathbb{E}_F [X]^2 = \sigma^2 \]

so that the U-statistic is unbiased for the population variance as desired.

It can be shown that U-statistics are asymptotically normal. However, the central limit cannot be used to prove this result. Each X_i appears in more than one summand of U_n, making U_n the sum of dependent terms. As a result, a clever technique known as H-projection, named after Wassily Hoeffding, is required.

Dupree statistics

D_n =

Download this blogpost as an RMarkdown (.Rmd) file!

Published by

Emma Smith

Emma Smith is a young statistician who's on a mission to convince the masses statistics is as awesome as she *knows* it is! When she's not working on expanding her knowledge of machine learning and mathematical statistics, she's busy petting cats and unsuccessfully convincing her boyfriend to let her adopt them, hiking, concocting indie and folk rock playlists, and kicking butt in roller derby.

2 thoughts on “U-, V-, and Dupree statistics”

Leave a Reply

Your email address will not be published. Required fields are marked *