Simplifying U-statistic variance estimation with Sen's structural components • Statisticelle

Sen (1960) proved that U-statistics could be decomposed into identically distributed and asymptotically uncorrelated “structural components.”

The mean of these structural components is equivalent to the U-statistic and the variance of the structural components can be used to estimate the variance of the U-statistic, bypassing the need for often challenging derivation of conditional variance components.

One-sample U-statistics

Review of known properties

Consider an independent and identically distributed sample of $m$ random variables: $X_{1}, …, X_{m} \stackrel{i.i.d.}{\sim} F$ . Then, a one-sample U-statistic with a symmetric kernel $\phi$ consisting of $a$ arguments takes the form

$U = \frac{1}{{m \choose a}} \mathop{\sum … \sum} \limits_{1 \leq i_1 < ... < i_{a} \leq m} \phi(X_{i_1}, ..., X_{i_a})$

where the sum extends over all ${m \choose a}$ distinct combinations of $a$ of the $m$ random variables. U-statistic theory then tells us that

$\text{Var}_{F}~U = \frac{1}{{m \choose a}} \sum_{c=1}^{a} {a \choose c} {m-a \choose a-c} \sigma^{2}_c$

where

$\sigma^{2}_{c} = \text{Var}_F \left[ \mathbb{E}_{F} \left( \phi(X_1, ..., X_a) | X_1, ..., X_c \right) \right].$

For large samples, the variance of $U$ can be approximated by

$\text{Var}~U \approx \frac{a^2}{m} \sigma_1^2.$

For further information on one-sample U-statistics, including derivation of variance estimates, see: Getting to know U: the asymptotic distribution of a single U-statistic.

Commonly, $a=1$ or $a=2$ . When $a=1$ , the form of the U-statistic reduces to

$U = \frac{1}{m} \sum_{i=1}^{m} \phi(X_i)$

so that $U$ takes the form of the mean of the kernel across the $n$ observations and,

$\text{Var} ~U = \frac{\sigma_1^{2}}{m}$

resembles the variance of a sample mean.

When $a=2$ ,

$U = \frac{1}{{m \choose 2}} \mathop{\sum \sum} \limits_{1 \leq i_1 < i_{2} \leq m} \phi(X_{i_1}, X_{i_2})$

or equivalently,

$U = \frac{1}{m(m-1)} \mathop{\sum \sum} \limits_{i_1 \neq i_{2}} \phi(X_{i_1}, X_{i_2})$

and,

$\text{Var} ~U = \frac{2}{m(m-1)} \left[ 2(m-2)\sigma_1^2 + \sigma_2^2 \right] \approx \frac{4}{m} \sigma_1^2.$

For examples of how to derive $\sigma_c^2$ for common one-sample U-statistics, see: One, Two, U: Examples of common one- and two-sample U-statistics.

Structural components

In 1960, Sen demonstrated that a one-sample U-statistic based on a sample of size $m$ could be decomposed into $m$ structural components, denoted $V_i$ for $i = 1, …, m$ . Sen also proved that the $V_i$ are identically distributed and asymptotically uncorrelated.

When $a=1$ , the $i^{th}$ structural component is simply

$V_i = \phi(x_i)$

or the kernel evaluated at the $i^{th}$ observation $X_i$ .

When $a=2$ ,

$V_i = \frac{1}{m-1} \sum_{\substack{i_1 = 1 \ i_1 \neq i}}^{m} \phi(x_i, X_{i_1})$

or the mean of all $(m-1)$ kernel terms involving $X_i$ .

The mean of the $m$ structural components is equivalent to $U$ ,

$U = \frac{1}{m} \sum_{i=1}^{m} V_i = \overline{V}_m$

Sen also proved that the first conditional variance component $\sigma_1^2$ can be consistently estimated as the sample variance of the components,

$s_{m}^2 = \frac{1}{m-1} \sum_{i=1}^{m} (V_i - U)^2$

such that

$\text{Var}~ U = \frac{a^2}{m} s_m^{2}.$

That is, $\text{Var}~U = s_m^{2}/m$ when $a=1$ and $\text{Var}~U = 4 s_m^{2}/m$ when $a=2$ .

Example: Variance

As an example, consider the one-sample U-statistic for sample variance. I previously used U-statistic theory to derive the limiting distribution of this U-statistic in a previous blog post, One, Two, U: Examples of common one- and two-sample U-statistics.

In summary, the U-statistic for the unbiased sample variance $s^2$ takes the form,

$U = \frac{1}{{m \choose 2}} \mathop{\sum \sum} \limits_{1 \leq i < j \leq n} \frac{(X_{i} - X_{j})^2}{2} = \frac{1}{m(m-1)} \mathop{\sum \sum} \limits_{i \neq j} \frac{(X_{i} - X_{j})^2}{2}$

where $\phi(x_1, x_2) = (x_1 - x_2)^2/2$ is an unbiased, symmetric kernel for $\sigma^2$ .
Using U-statistic theory, the asymptotic distribution of this estimator $U$ is given by

$\sqrt{m} \left( U - \sigma^2 \right) \rightarrow N \left(0, \mu_4 - \sigma^4 \frac{(m-3)}{(m-1)} \right)$

where $\mu_4$ is the fourth central moment. Note that this variance considers both variance components $\sigma^2_1$ and $\sigma^2_2$ .

However, recall that with large $m$ , the variance of $U$ can be approximated using only the first conditional variance component $\sigma^2_1$ such that,

$\text{Var}_{F} ~U \approx \frac{a^2}{m} \sigma^{2}_1$

where $a$ represents the number of arguments within the kernel $\phi$ , here $a=2$ and

$\sigma^2_1 = \frac{\mu_4 - \sigma^4}{4}$

so that

$\text{Var}_{F} ~U \approx \frac{\mu_4 - \sigma^4}{m}.$

Following Equation (2.1) of Sen (1960), the $i^{th}$ structural component of $U_n$ is defined in this scenario as,

$V_{i} = \frac{1}{m-1} \sum_{\substack{i_1 = 1 \ i_1 \neq i}}^{m} \frac{(x_i - X_{i_1})^2}{2}$

or the average of the $(m-1)$ kernel terms involving $X_i$ (as $X_i$ not compared with itself within $U$ ).
Then,

$U = \frac{1}{m} \sum_{i=1}^{n} V_i = \frac{1}{m(m-1)} \sum_{i \neq j} \frac{(X_{i} - x_{j})^2}{2}$

as expected!

Since $U$ can be expressed as a mean of the structural components $V_i$ which are asymptotically uncorrelated, it follows that the variance of $U$ can also be estimated by considering

$s_m^2 = \frac{1}{m-1} \sum_{i=1}^{m} \left(V_i - U \right)^2$

such that

$\widehat{\text{Var}}~U \approx \frac{4}{m} s_{m}^{2}.$

Now, let’s simulate a large dataset to explore!

set.seed(12345)

# Sim n=1000 Xi from N(0, 10)
m <- 1000
mu <- 0
sigma <- 10
X <- rnorm(n = m, mean=mu, sd=sigma)

# Parameter values
# True value of fourth central moment of N(mu, sigma) = 3*sigma^4
mu4 <- 3*sigma^4
sigma4 <- sigma^4

# Variance approximation using sigma_1^2
true_VarU <- (mu4 - sigma4)/m

true_VarU

## [1] 20

# Estimates
U <- var(X)

Xbar <- mean(X)
Xbar4 <- 1/(m-1) * sum((X-Xbar)^4)

# Approximated variance using estimates
est_VarU <- (Xbar4 - U^2)/m

est_VarU

## [1] 19.62728

The true (approximate) variance of $U$ , $\text{Var}~U \approx 4 \sigma_1^2/m$ is 20, and the corresponding sample estimate is 19.63.

Next, lets construct the $n$ structural components $V_i ~(i = 1, …, m)$ of $U$ and check that their mean is equal to $U$ .

kernel <- function(x1, x2){
  (x1 - x2)^2/2
}

V <- c()
for (i in 1:m){
  V[i] <- 1/(m-1) * sum(kernel(X[i], X[-i]))
}

U == mean(V)

## [1] TRUE

Awesome, the mean of the structural components $\bar{V}=$ 99.75 is exactly equal to the value of our U-statistic $U=$ 99.75 as expected.

Next, let’s compare the sample variance of the structural components to the true and estimated variance of $U$ constructed with only the first conditional variance component…

sm2 <- var(V)
est_VarV <- sm2*4/m

true_VarU; est_VarU; est_VarV

## [1] 20

## [1] 19.62728

## [1] 19.67656

Huzzah, we see that all three quantities are very similar indeed!

Two-sample U-statistics

Review of known properties

Sen (1960) also demonstrated that this result can be extended to multiple independent i.i.d. samples. We focus on 2 independent samples: $X_{1}, …, X_{m} \stackrel{i.i.d.}{\sim} F$ and $Y_{1}, …, Y_{n} \stackrel{i.i.d.}{\sim} G$ in which case the two-sample U-statistic takes the general form,

$U = \frac{1}{{m \choose a}{n \choose b}} \mathop{\sum \sum} \limits_{\substack{1 \leq i_1 < ... < i_{a} \leq m \\ 1 \leq j_1 < ... < j_b \leq n}} \phi(X_{i_1}, ..., X_{i_a}; Y_{j_1}, ..., Y_{j_b})$

where $a$ and $b$ represent the number of arguments within the kernel for each sample, respectively. The variance of the two-sample U-statistic for large samples is provided by

$\text{Var}~U = \frac{a^2}{m} \sigma_{10}^2 + \frac{b^2}{n} \sigma_{01}^2.$

For further information on two-sample U-statistics, including a sketch derivation of variance estimates, see: Much Two U About Nothing: Extension of U-statistics to multiple independent samples.

Focusing on the common scenario in which $a=b=1$ , the two-sample U-statistic reduces to

$U = \frac{1}{mn} \sum_{i=1}^{m} \sum_{j=1}^{n} \phi(X_i; Y_j)$

and

$\text{Var}~U = \frac{\sigma_{10}^2}{m} + \frac{\sigma_{01}^{2}}{n}.$

For examples of how to derive $\sigma_{01}^2$ and $\sigma_{10}^2$ for common two-sample U-statistics, see: One, Two, U: Examples of common one- and two-sample U-statistics.

Structural components

With two samples, structural components and their variance are constructed for each sample. Within the first sample,

$V_i = \frac{1}{n} \sum_{j=1}^{n} \phi(x_i; Y_j)$

and within the second sample,

$V_j = \frac{1}{m} \sum_{j=1}^{m} \phi(X_i; y_j).$

The mean of the structural components within either sample $\overline{V}_{X}$ or $\overline{V}_{Y}$ is equal to $U$ and

$\widehat{\text{Var}}~U \approx \frac{s_m^{2}}{m} + \frac{s_n^{2}}{n}$

where $s_m^{2}$ and $s_n^{2}$ are the estimated variances of the structural components within the first and second sample, respectively.

Example: Mann-Whitney U-statistic when F=G

Consider the form of the Mann-Whitney U-statistic for two independent random samples of a continuous outcome so that $P(X=Y) = 0$ ,

$U = \frac{1}{mn} \sum_{i=1}^{m} \sum_{j=1}^{n} \mathbb{I}(X_i < Y_j).$

Thus, the form of the structural components within the first sample is

$V_i = \frac{1}{n} \sum_{j=1}^{n} \mathbb{I}(x_i < Y_j)$

and within the second sample,

$V_j = \frac{1}{m} \sum_{i=1}^{m} \mathbb{I}(X_i < y_j).$

The Mann-Whitney U-statistic is commonly used to conduct testing of the null hypothesis that the distribution of the outcome within each group is the same, or $H_0: F=G$ . In a previous blog post, it was demonstrated that under this null hypothesis,
$\sigma_{10}^{2} = \sigma_{01}^{2} = \frac{1}{12}$
and

$\text{Var}~U = \frac{N}{12mn}$

where $N = m + n$ .

Let’s conduct another small simulation to see how the true variance compares to the estimated variance using the structural components!

m = 500
n = 1000
mu = 0
sigma = 10

# Simulate X and Y so that H0: F = G is true
X <- rnorm(m, mean = mu, sd = sigma)
Y <- rnorm(n, mean = mu, sd = sigma)

kernel <- function(xi, yj){
  ifelse(xi < yj, 1, 0)
}

# Compute kernel value for each pair of Xi and Yj
k <- matrix(NA, nrow=m, ncol=n)
for (i in 1:m){
  for (j in 1:n){
    k[i, j] <- kernel(X[i], Y[j])
  }
}

VarU <- (m+n+1)/(12*m*n)
VarU

## [1] 0.0002501667

U <- sum(k) / length(k)
Vi <- 1/n * rowSums(k)
Vj <- 1/m * colSums(k)

mean(Vi) ; mean(Vj)

## [1] 0.501896

## [1] 0.501896

sm2 <- var(Vi)
sn2 <- var(Vj)

VarV <- sm2/m + sn2/n

VarU; VarV

## [1] 0.0002501667

## [1] 0.0002545398

Again, we see the estimated variance using the structural components is very close to the true variance!

Concluding thoughts

The structural components introduced by Sen (1960) significantly reduce the complexity of variance estimation for U-statistics. The structural components, however, are only asymptotically uncorrelated and performance may suffer for small samples (not investigated here). However, nonparametric estimation suffers in general with small samples as the empirical cumulative distribution, the focus of nonparametric estimation, may be significantly biased (e.g. due to “unlucky” or unrepresentative sample). Thus, these approaches may be most suitable for reasonably sized samples.

Sen, P. K. (1960). On Some Convergence Properties of U-Statistics. Calcutta Statistical Association Bulletin, 10(1-2), 1–18. doi:10.1177/0008068319600101

3 thoughts on “Simplifying U-statistic variance estimation with Sen’s structural components”

Glen Satten says:

December 5, 2022 at 10:13 pm

Thanks very much for this post. I’ve been looking for this result for quite a while! It is not very well known, and deserves to be in every book that covers U-statistics. I’m curious how you found it?

1. Emma Smith says:
  
  December 6, 2022 at 4:20 pm
  
  Hi Glen,
  
  I’m glad you found the blog post helpful. 🙂
  I’m part of a research group working on estimating nonparametric treatment effects in two-arm clinical trials.
  
  The effect we are interested in is equivalent to the area under the receiver operating curve, for which estimation has long been of interest to Radiology.
  Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36. doi:10.1148/radiology.143.1.7063747
  
  This result was employed by
  DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics, 44(3), 837. doi:10.2307/2531595
  to derive variance and covariance estimators for multiple AUCs, and seems to have been rediscovered several times in different ways!
  
  Cheers,
  Emma
  
Pingback: Nonparametric neighbours: U-statistic structural components and jackknife pseudo-observations for the AUC • Statisticelle

Simplifying U-statistic variance estimation with Sen’s structural components

One-sample U-statistics

Review of known properties

Structural components

Example: Variance

Two-sample U-statistics

Review of known properties

Structural components

Example: Mann-Whitney U-statistic when F=G

Concluding thoughts

Published by

Emma Davies Smith

3 thoughts on “Simplifying U-statistic variance estimation with Sen’s structural components”

Leave a Reply Cancel reply