One, Two, U: Examples of common one- and two-sample U-statistics

My previous two blog posts revolved around derivation of the limiting distribution of U-statistics for one sample and multiple independent samples.

For derivation of the limiting distribution of a U-statistic for a single sample, check out Getting to know U: the asymptotic distribution of a single U-statistic.

For derivation of the limiting distribution of a U-statistic for multiple independent samples, check out Much Two U About Nothing: Extension of U-statistics to multiple independent samples.

The notation within these derivations can get quite complicated and it may be a bit unclear as to how to actually derive components of the limiting distribution.

In this blog post, I provide two examples of both common one-sample U-statistics (Variance, Kendall’s Tau) and two-sample U-statistics (Difference of two means, Wilcoxon Mann-Whitney rank-sum statistic) and derive their limiting distribution using our previously developed theory.

Asymptotic distribution of U-statistics

One sample

For a single sample, , the U-statistic is given by

where is a symmetric kernel of degree .

For a review of what it means for to be symmetric, check out U-, V-, and Dupree Statistics.

In the examples covered by this blog post, , so we can re-write as,

Alternatively, this is equivalent to,

The limiting variance of is given by,

where

or equivalently,

Note that when , .

For , these expressions reduce to

where

and

The limiting distribution of for is then,

For derivation of the limiting distribution of a U-statistic for a single sample, check out Getting to know U: the asymptotic distribution of a single U-statistic.

Two independent samples

For two independent samples denoted and , the two-sample U-statistic is given by

where is a kernel that is independently symmetric within the two blocks and .

In the examples covered by this blog post, , reducing the U-statistic to,

The limiting variance of is given by,

where

and

Equivalently,

and

For , these expressions reduce to

where

and

The limiting distribution of for and is then,

For derivation of the limiting distribution of a U-statistic for multiple independent samples, check out Much Two U About Nothing: Extension of U-statistics to multiple independent samples.

Examples of one-sample U-statistics

Variance

Suppose we have an independent and identically distributed random sample of size , .
We wish to estimate the variance, which can be expressed as an expectation functional,

In order to estimate using a U-statistic, we need to identify a kernel function that is unbiased for and symmetric in its argument. We start by considering,

is unbiased for since

but is not symmetric since

Thus, the corresponding symmetric kernel can be constructed as

Here, the number of arguments and is the set of all permutations of the arguments,

Then, the symmetric kernel which is unbiased for the variance is,

An unbiased estimator of is then the U-statistic,

or equivalently,

Focusing on the second form of the sum and recognizing that

and,

we have,

Plugging this simplified expression back into our formula for , we obtain

as desired.

It is well-known that is the unbiased estimator of the sample variance such that,

but what about the variance of ? For a sample size of and ,

To derive the first variance component , we start by taking the expectation of our kernel conditional on ,

Now, our first variance component is just equal to the variance of and since is just a constant, we have

where is the fourth central moment.

Next, recognizing that and recycling our “add zero” trick yields an expression for our second variance component ,

We know by definition that the kernel is an unbiased estimator of by definition so that,

To simplify the remaining expectation, recall that,

and let and . Then,

Substituting this back into our expression for , we have

Finally, plugging our two variance components into our expression for ,

Then, our asymptotic result for tells us,

Kendall’s Tau

Consider bivariate, continuous observations of the form

A pair of observations, is considered “concordant” if

and “discordant” otherwise.

The probability that two observations are concordant is then,

and the probability that two observations are discordant is then,

Kendall’s Tau, denoted , is the proportion of concordant pairs minus the proportion of discordant pairs, or the difference between and such that,

ranges between and and is used as a measure of the strength of monotone increasing/decreasing relationships, with suggesting that and are independent and suggesting a perfect monotonic increasing relationship between and .

Based on our definition of , the form of the symmetric kernel is immediately obvious,

where is an indicator function taking the value when its argument is true and otherwise.

Note that

and

so that our kernel may be re-expressed as,

This will come in handy later.

Now that we have identified our kernel function, we can construct our U-statistic,

It is obvious that . Once again, and the variance of is given by,

For the purposes of demonstration and to simplify derivation of the variance components, suppose we are operating under the null hypothesis that and are independent, or equivalently

To find our first variance component , we must find the expectation of our kernel conditional on ,

If and , then and,

.

Then, the first variance component is given by,

and are independent random variables distributed according to .

If then . Thus, if we let and , and are both distributed according to .

Since and are independent, applying the identity yields,

Recall that if ,

For and , we have

and

The same is true for .

Plugging our results back into our equation for yields,

Next, and,

By definition, so that,

Note that since and are identically distributed and continuous, either or , so that

.

Then we can use the properties of the Bernoulli distribution to derive the properties of we need. That is,

and

Finally, we have

The same arguments hold for and we obtain,

However, since under the null hypothesis, .

Now that we have determined the value of and under the null hypothesis that and are independent, we can plug these components into our formula for , giving us

Our asymptotic result for tells us,

Examples of two-sample U-statistics

Mean comparison

Suppose we have two independent random samples of size and size ,

and

We wish to compare the means of the two groups. The obvious choice for our kernel is,

so that and our corresponding U-statistic is,

Based on our previous derivation of the distribution of two-sample U-statistics, we have

For the first variance component, we need to take the expectation of conditional on a single such that,

Similarly, for the second variance component, we need to condition on a single such that,

Since and are just constants, it is easy to see that,

and,

Finally, plugging these variance components into our formula for , we obtain the variance we would expect for a comparison of two means,

Wilcoxon Mann-Whitney rank-sum test

Suppose we have two independent random samples of size and size ,

and

We assume that and are continuous so that no tied values are possible. Let rpresent the full-sample ranks of the and represent the ranks of the .

Then, the Wilcoxon Mann-Whitney (WMW) rank-sum statistic is,

which can be shown to be equivalent to the number of pairs for which . That is, we can re-express the WMW statistic as,

If we divide by the total number of pairs, we obtain

which is exactly the form of a two-sample U-statistic with and ,

so that . is commonly referred to as the probabilistic index.

For more information on the probabilistic index for two continuous outcomes, check out The probabilistic index for two normally distributed outcomes.

Our previous work tells us that

The first variance component can be expressed as,

Recall that covariance can be expressed in terms of expectation as,

so that,

By definition,

Now, notice that

so that,

Following similar logic for , it should be clear that we have

and

Under the null hypothesis , and have the same (continuous) distribution so that either or , implying under .

Similarly, there are 6 equally likely orderings of , and under : (1) , (2) , (3) , (4) , (5) , and (6) . Then,

Noting that , plugging these values into our expressions for and gives us,

Finally,

Consequently, since , we have

In summary, our multiple-sample U-statistic theory tells us that under the null hypothesis ,

and