# Nonparametric neighbours: U-statistic structural components and jackknife pseudo-observations for the AUC

Two of my recent blog posts focused on two different, but as we will see related, methods which essentially transform observed responses into a summary of their contribution to an estimate: structural components resulting from Sen’s (1960) decomposition of U-statistics and pseudo-observations resulting from application of the leave-one-out jackknife. As I note in this comment, I think the real value of deconstructing estimators in this way results from the use of these quantities, which in special (but common) cases are asymptotically uncorrelated and identically distributed, to: (1) simplify otherwise complex variance estimates and construct interval estimates, and (2) apply regression methods to estimators without an existing regression framework.

As discussed by Miller (1974), pseudo-observations may be treated as approximately independent and identically distributed random variables when the quantity of interest is a function of the mean or variance, and more generally, any function of a U-statistic. Several other scenarios where these methods are applicable are also outlined. Many estimators of popular “parameters” can actually be expressed as U-statistics. Thus, these methods are quite broadly applicable. A review of basic U-statistic theory and some common examples, notably the difference in means or the Wilcoxon Mann-Whitney test statistic, can be found within my blog post: One, Two, U: Examples of common one- and two-sample U-statistics.

As an example of use case (1), Delong et al. (1988) used structural components to estimate the variances and covariances of the areas under multiple, correlated receiver operator curves or multiple AUCs. Hanley and Hajian-Tilaki (1997) later referred to the methods of Delong et al. (1988) as “the cleanest and most elegant approach to variances and covariances of AUCs.” As an example of use case (2), Andersen & Pohar Perme (2010) provide a thorough summary of how pseudo-observations can be used to construct regression models for important survival parameters like survival at a single time point and the restricted mean survival time.

Now, structural components are restricted to U-statistics while pseudo-observations may be used more generally, as discussed. But, if we construct pseudo-observations for U-statistics, one of several “valid” scenarios, what is the relationship between these two quantities? Hanley and Hajian-Tilaki (1997) provide a lovely discussion of the equivalence of these two methods when applied to the area under the receiver operating characteristic curve or simply the AUC. This blog post follows their discussion, providing concrete examples of computing structural components and pseudo-observations using R, and demonstrating their equivalence in this special case.

# Area under the curve (AUC)

The area under the curve, or AUC, is a parameter of interest for many fields, including predictive modelling, machine learning, and medicine. The area under the curve is of particular interest to the field of radiology, as it provides an overall summary of discrimination or both the true positive rate (sensitivity) and the true negative rate (specificity). Hanley and McNeil (1982) provide a nice, though slightly outdated, discussion of the properties of the AUC and its use in radiology. Zou et al. (2023a) recently identified a dozen terms for this parameter across a wide range of research fields, including the Mann-Whitney probability, concordance or the c-index, and the relative effect.

I know the AUC best as a nonparametric treatment effect for two-arm randomized controlled trials. I’ve discussed this quantity previously in the limited context of continuous responses, referring to it as “the probabilistic index” in alignment with the work of Thas et al. (2012). Recent work by my research group has referred to the parameter as “the win probability,” see Zou et al. (2022) and Zou et al. (2023b), as it represents the probability that a randomly selected treatment individual has a better response than, or “wins over,” a randomly selected control individual. “Win probability” also aligns with other emerging win-based treatment effects, notably the “win ratio” discussed by Pocock et al. (2012) and the “win odds” discussed by Dong et al. (2019) and Brunner et al. (2021). Anyways, that’s enough of a literature review. But, we will proceed in this context, simply referring to this parameter as the AUC for now.

Consider a parallel, individually randomized trial with two intervention arms: experimental treatment ($i=1$) or control ($i=0$). Let represent the response of the individual within the arm where . Assume that responses in each arm are independent and identically distributed within each arm, and the two arms are independent. That is, let represent the responses in the treatment arm distributed according to distribution function . Similarly, let represent the responses in the control arm. Dropping the second subscript due to the nature of the responses, the area under the curve, or win probability,
etc. etc., is formally defined as,

Assume that a greater response corresponds to better health. Then, is equal to the probability that a randomly selected treatment individual is in better health, or has a greater response, at the end of the trial compared to a randomly selected control individual, plus half the probability the two individuals have the same response. Thus, suggests no treatment effect, or that responses in the two arms are the same, while suggests treatment benefit relative to the control. When responses are truly continuous, the probability of a “tie”, i.e. , is zero, and reduces to , as discussed in this post.

If we let represent the normalized treatment distribution function in order to permit ties,

or the average of the left- and right-continuous distributions, and , as employed by Brunner and Munzel (2000) in their solution to the nonparametric Behrens-Fisher problem. Define the control distribution function similarly. Then, we can re-express the AUC as the Riemann-Stieltjes integral,

That is, is equal to the expectation of the control distribution function with respect to the treatment distribution function.

The integral form of may help make its relationship with the area under the curve a bit more obvious. If we plot vs. , the resulting area under the curve is equal to the parameter . If both distribution functions are equal so that , the curve will fall on the dashed line, representing no treatment effect. Curves above or below the diagonal indicate benefit or harm .

The integral form of may also remind you of the form of regular or expectation functionals, which can be unbiasedly estimated using U-statistics, discussed in further detail within my blog post, U-, V-, and Dupree Statistics. Indeed, unbiased estimators of take the form of a U-statistic.

# U-statistic estimator and its structural components

A review of U-statistics for one and two samples can be found within this blog post. The U-statistic estimator for the area under the curve takes the form,

where is an indicator function taking the value when its argument is true and otherwise. Equivalently,

where is the Heaviside function, taking the value +1 when , +0.5 when , or +0 when .
Note that is indeed related to the Wilcoxon Mann-Whitney test, and thus further details on this U-statistic can be found here.

Following Simplifying U-statistic variance estimation with Sen’s structural components, the structural component of the observation within the treatment arm is provided by,

or the mean of all kernel terms, or comparisons, involving observation . Here, represents the proportion of “wins” experienced, or responses in the control arm exceeded, by treatment response . Similarly, the structural component of the observation within the control arm is provided by,

or the mean of all kernel terms, or comparisons, involving observations . Here, represents the proportion of “losses” experienced, or responses in the treatment arm subceeded, by control response . Thus, and both range between and , as does .

The estimator $\hat{\theta}$ can be recovered as the mean of the structural components in *either* arm. That is, using the treatment arm,

or, using the control arm,

Since the structural components are asymptotically uncorrelated within and between arms, we can approximate the variance of $\hat{\theta}$ as,

where $S_{1}^2$ represents the unbiased sample variance of the structural components within the treatment arm,

and $S_{0}^2$ represents the unbiased sample variance of the structural components within and the control arm. Equivalently,

Since U-statistics are unbiased and asymptotically normal, we can conduct inference in the usual way. For example, we might construct large-sample confidence intervals as,

# Jackknife estimator and its pseudo-observations

The leave-one-out jackknife deletes one observation from a sample at a time, creating many resamples. Note there are as many resamples as there are observations within the sample. Each resample is then used to obtain an estimate of the parameter of interest, referred to as a jackknife replicate. The mean of the jackknife replicates yields the jackknife estimate of the parameter of interest. When the parameter of interest is estimated in the usual way, using all sample observations, the estimate may be referred to as the full-sample estimate.

Differences between each jackknife replicate and the full sample estimate summarize the contribution of the deleted observation to the full sample estimate and are referred to as pseudo-observations. The mean of the pseudo-observations also estimates the parameter of interest, and pseudo-observations for U-statistics are approximately independent and identically distributed. As a result, the variance of the estimator may also be approximated using the sample variance of the pseudo-observations.

Here, with two samples, we delete one observation at a time from the combined sample of size . Following Resampling, the jackknife, and pseudo-observations, define the pseudo-observation corresponding to the treatment response as,

where $\hat{\theta}$ is the full-sample or U-statistic estimator defined previously and $\hat{\theta}_{(1j)}$ is the jackknife replicate resulting from the deletion of treatment observation $1j$. Similarly for the $j^{th}$ control response,

The mean of all $N$ pseudo-observations provides the jackknife estimate of $\theta$,

and their sample variance of the pseudo-observation mean estimates the variance of $\hat{\theta}_{jack}$,

Again, we can then construct large-sample interval estimators as,

# Equivalence

Unlike the structural components, the pseudo-observations are not bounded between 0 and 1, and do not feature a straight-forward interpretation. However, as demonstrated by Hanley and Hajian-Tilaki (1997), the resulting structural components and pseudo-observations for the AUC are one-to-one functions of each other.

Recall that,

Expanding this based on the explicit form of $\hat{\theta}$,

or equivalently,

Collecting like terms,

and recognizing the first term as a function of $\hat{\theta}$ and the second as a function of the structural component $Y_{1j}$,

Finally, with some simplification of the first term, it follows that,

and

Or generally,

and when the two arms are of equal size so that $N_1 = N_0 = N/2$,

Now, what about their corresponding variance estimators? Using the structural components, $\widehat{\text{Var}}(\hat{\theta}) = S_1^2/N_1 + S_0^2/N_0$, where $S_i^2$ is the sample variance of the structural components in the $i^{th}$ arm of size $N_i$. Using the pseudo-observations, $\widehat{\text{Var}}(\hat{\theta}) = S^2 / N$ where $S^2$ is the sample variance of pseudo-observations for the combined sample of size $N$.

Hanley and Hajian-Tilaki (1997) claim that the pseudovalues fluctuate twice as much around the point estimate compared to the point estimates so that should be expected to be equal to approximately or . Thus,

I’m not entirely sure where this claim comes from, perhaps the form of when sample sizes are equal? Anyhow, we can check out these results using simulation, wahoo!!

# Simulated example

For a binary response, the AUC is a function of the risk difference. That is, if , , and the treatment event or success probability is greater than that of the control , then,

Thus, to simulate an AUC , let and . We’ll start by simulating a single with two equally sized arms such that .

# Simulation parameters and helper functions -----------------------------------

# Random seed
set.seed(123)

# Sample size
N1 = N0 = 30
N = N1 + N0

# Event probabilities
Pi1 = 0.7
Pi0 = 0.42

# Heaviside function for comparing all treatment-control pairs
H <- function(X1j, X0j){
ifelse(X1j > X0j, 1, ifelse(X1j < X0j, 0, 0.5))
}


To start, we generate our desired responses from the specified Bernoulli distributions independently for each arm. Then, all pairs consisting of one treatment individual and one control individual are constructed and their responses are compared. Based on the outcome of comparison, a score is assigned based on the Heaviside function previously defined within the U-statistic section using outer.

# Simulate data independently for each arm from Binom(n=1, Pi) -----------------
X1 = rbinom(N1, 1, Pi1)
X0 = rbinom(N0, 1, Pi0)

# Construct all treatment-control comparisons, and score according to H
scores <- outer(X1, X0, H)
print(scores[1:5, 1:5])


##      [,1] [,2] [,3] [,4] [,5]
## [1,]  0.5  0.5  0.5  0.5  1.0
## [2,]  0.0  0.0  0.0  0.0  0.5
## [3,]  0.5  0.5  0.5  0.5  1.0
## [4,]  0.0  0.0  0.0  0.0  0.5
## [5,]  0.0  0.0  0.0  0.0  0.5


Each row within scores corresponds to a single treatment response while each column corresponds to a single control response. Thus, we can obtain the treatment and control structural components by taking the row and column means of the score matrix, respectively.

# U-statistic Estimation -------------------------------------------------------
# Structural components for treat and control arm, respectively
Y1 <- rowMeans(scores)
Y0 <- colMeans(scores)


Next, the U-statistic estimator of the AUC is equal to the mean of all scores and its variance is obtained using the sample variances of the structural components in each arm.

# U-statistic estimator of AUC and structural component var est.
U <- mean(scores)
S1 <- var(Y1); S0 <- var(Y0)
VarU <- var(Y1) / N1 + var(Y0) / N0

U; VarU


## [1] 0.6833


## [1] 0.003726


Using the structural components, the estimate of the AUC is 0.68 and its corresponding variance estimate is 0.0037.

We can construct the pseudo-observations using the same score matrix. Once again, the full-sample estimate is equal the mean of all scores while the jackknife replicate is equal to the mean of all scores, excluding the relevant row (treatment) or column (control).

# Jackknife Estimation ---------------------------------------------------------
# Treatment pseudo-observations
P1 <- c()
for (j in 1:N1){
P1[j] <- N * mean(scores) - (N-1) * mean(scores[-j,])
}

# Control pseudo-observations
P0 <- c()
for(j in 1:N0){
P0[j] <- N * mean(scores) - (N-1) * mean(scores[,-j])
}


Focusing on the treatment pseudo-observations, we can see that they are indeed equal to a linear transformation of the corresponding structural components.

cat('First 5 pseudo-observations:', head(P1))


## First 5 pseudo-observations: 1.022 0.005172 1.022 0.005172 0.005172 1.022


cat('First 5 transformed structural components:', (2*N-2)/(N-2) * head(Y1) - N/(N-2) * U)


## First 5 transformed structural components: 1.022 0.005172 1.022 0.005172 0.005172 1.022


Next, the jackknife estimator of the AUC is equivalent to the mean of all pseudo-observations and its variance is obtained using the sample variances of the pseudo-observations within the combined sample.

# Jackknife estimator of AUC and pseudoobs var est.
J <- mean(c(P1, P0))
S <- var(c(P1, P0))
VarJ <- var(c(P1, P0)) / N

J; VarJ


## [1] 0.6833


## [1] 0.00379


Using pseudo-observations, the estimate of the AUC is 0.68 and its corresponding variance estimate is 0.0038.

The point estimators for both methods are identical while their variance estimates differ only slightly. However, = 0.2274, = 0.0575, and = 0.0543. That is, does appear to be approximately 4 times larger than both and as noted by Hanley and Hajian-Tilaki (1997).

Now, what if we repeat our simulation times?

# Simulation repetitions
B <- 1000

# Create empty data frame to store results for multiple repetitions
results <- data.frame(matrix(ncol=7, nrow=B))
colnames(results) <- c('U', 'S1', 'S0', 'VarU', 'J', 'S', 'VarJ')


for (b in 1:B){
# Simulate data independently for each arm from Binom(n=1, Pi)
X1 = rbinom(N1, 1, Pi1)
X0 = rbinom(N0, 1, Pi0)

# Construct all treatment-control comparisons, and score according to H
scores <- outer(X1, X0, H)

# Structural components for treat and control arm, respectively
Y1 <- rowMeans(scores)
Y0 <- colMeans(scores)

# U-statistic estimator of AUC and structural component var est.
results[b, 'U'] <- mean(scores)
results[b, 'S1'] <- var(Y1)
results[b, 'S0'] <- var(Y0)
results[b, 'VarU'] <- var(Y1) / N1 + var(Y0) / N0

# Construct treatment pseudo-observations
P1 <- c()
for (j in 1:N1){
P1[j] <- N * mean(scores) - (N-1) * mean(scores[-j,])
}

# Construct control pseudo-observations
P0 <- c()
for(j in 1:N0){
P0[j] <- N * mean(scores) - (N-1) * mean(scores[,-j])
}

# Jackknife estimator of AUC and pseudoobs var est.
results[b, 'J'] <- mean(c(P1, P0))
results[b, 'S'] <- var(c(P1, P0))
results[b, 'VarJ'] <- var(c(P1, P0)) / N
}


options(digits=4)
colMeans(results)


##        U       S1       S0     VarU        J        S     VarJ
## 0.642483 0.052562 0.060823 0.003780 0.642483 0.230681 0.003845


Once again, we obtain exactly the same point estimate from both methods and very similar variance estimates. Also, .

# Conclusion

Structural components derived by Sen (1960) for U-statistics and pseudo-observations resulting from the jackknife are based on very similar ideas: deconstruct a point estimator into approximately independent and identically distributed measures of individual contribution. As a result, it’s not unnatural to expect both methods to provide similar results. In this blog post, we investigated similarities between the results of these two methods for the area under the curve, an important parameter across many diverse research areas. Following Hanley and Hajian-Tilaki (1997), we demonstrated how structural components and pseudo-observations for the AUC are equivalent quantities and discussed expected similarities in their variance estimators. Finally, we conducted a small simulation study to confirm these properties. Of particular note, results demonstrated that point estimates are the same for both methods while variance estimates differ very slightly. I’m not entirely sure which of the two variance estimates should be preferred. My gut tells me the structural component variance estimate, but only because I like that it resembles the variance of two sample means. 😉 I also feel that since responses are only identically distribution within each arm, it makes the most sense to estimate the variance within each arm and then combine.

As I noted at the start of this blog post, I think there are a lot of possibilities for these “estimate decomposition”-type methods. Here, we’ve focused on just one application: variance estimation. In the future, I hope to write about regression applications. Hopefully you find this as interesting as I do!

1. Sen, P. K. (1960). On some convergence properties of U-statistics. Calcutta Statistical Association Bulletin, 10(1-2), 1-18. doi.org/10.1177/0008068319600101
2. Miller, R. G. (1974). The jackknife-a review. Biometrika, 61(1), 1-15. doi.org/10.2307/2334280
3. DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 837-845. doi.org/10.2307/2531595
4. Hanley, J. A., & Hajian-Tilaki, K. O. (1997). Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves: an update. Academic radiology, 4(1), 49-58. doi.org/10.1016/s1076-6332(97)80161-4
5. Andersen, P. K., & Pohar Perme, M. (2010). Pseudo-observations in survival analysis. Statistical methods in medical research, 19(1), 71-99. doi.org/10.1177/0962280209105020
6. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36. doi.org/10.1148/radiology.143.1.7063747
7. Thas, O., Neve, J. D., Clement, L., & Ottoy, J. P. (2012). Probabilistic index models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 74(4), 623-671. https://doi.org/10.1111/j.1467-9868.2011.01020.x
8. Zou, G., Zou, L., & Qiu, S. F. (2023a). Parametric and nonparametric methods for confidence intervals and sample size planning for win probability in parallel‐group randomized trials with Likert item and Likert scale data. Pharmaceutical Statistics, 22(3), 418-439. https://doi.org/10.1002/pst.2280
9. Zou, G., Zou, L., & Choi, Y. H. (2022). Distribution-free approach to the design and analysis of randomized stroke trials with the modified Rankin scale. Stroke, 53(10), 3025-3031. https://doi.org/10.1161/STROKEAHA.121.037744
10. Zou, G., Smith, E. J., Zou, L., Qiu, S. F., & Shu, D. (2023b). A rank-based approach to design and analysis of pretest-posttest randomized trials, with application to COVID-19 ordinal scale data. Contemporary Clinical Trials, 126, 107085. https://doi.org/10.1016/j.cct.2023.107085
11. Pocock, S. J., Ariti, C. A., Collier, T. J., & Wang, D. (2012). The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. European heart journal, 33(2), 176-182. https://doi.org/10.1093/eurheartj/ehr352
12. Dong, G., Hoaglin, D. C., Qiu, J., Matsouaka, R. A., Chang, Y. W., Wang, J., & Vandemeulebroecke, M. (2019). The win ratio: on interpretation and handling of ties. Statistics in Biopharmaceutical Research. https://doi.org/10.1080/19466315.2019.1575279
13. Brunner, E., Vandemeulebroecke, M., & Mütze, T. (2021). Win odds: An adaptation of the win ratio to include ties. Statistics in Medicine, 40(14), 3367-3384. https://doi.org/10.1002/sim.8967
14. Brunner, E., & Munzel, U. (2000). The nonparametric Behrens‐Fisher problem: asymptotic theory and a small‐sample approximation. Biometrical Journal, 42(1), 17-25.https://doi.org/10.1002/(SICI)1521-4036(200001)42:1%3C17::AID-BIMJ17%3E3.0.CO;2-U