## U-, V-, and Dupree statistics

To start, I apologize for this blog’s title but I couldn’t resist referencing to the Owen Wilson classic You, Me, and Dupree ā wow! The other gold-plated candidate was U-statistics and You. Please, please, hold your applause.

My previous blog post defined statistical functionals as any real-valued function of an unknown CDF, , and explained how plug-in estimators could be constructed by substituting the empirical cumulative distribution function (ECDF) for the unknown CDF . Plug-in estimators of the mean and variance were provided and used to demonstrate plug-in estimators’ potential to be biased.

Statistical functionals that meet the following two criteria represent a special family of functionals known as expectation functionals:

1) is the expectation of a function with respect to the distribution function ; and

2) the function takes the form of a symmetric kernel.

Expectation functionals encompass many common parameters and are well-behaved. Plug-in estimators of expectation functionals, named V-statistics after von Mises, can be obtained but may be biased. It is, however, always possible to construct an unbiased estimator of expectation functionals regardless of the underlying distribution function . These estimators are named U-statistics, with the āUā standing for unbiased.

This blog post provides 1) the definitions of symmetric kernels and expectation functionals; 2) an overview of plug-in estimators of expectation functionals or V-statistics; 3) an overview of unbiased estimators for expectation functionals or U-statistics.

## The Probabilistic Index for Two Normally Distributed Outcomes

Consider a two-armed study comparing a placebo and treatment. In general, the probabilistic index (PI) is defined as,

and is interpreted as the probability that a subject in the treatment group will have an increased response compared to a subject in the placebo group. The probabilistic index is a particularly useful effect measure for ordinal data, where effects can be difficult to define and interpret owing to absence of a meaningful difference. However, it can also be used for continuous data, noting that when the outcome is continuous, and the PI reduces to .

suggests an increased outcome is equally likely for subjects in the placebo and treatment group, while suggests an increased outcome is more likely for subjects in the treatment group compared to the placebo group, and the opposite is true when .

## Simulation

Suppose and represent the independent outcomes in the placebo and treatment groups, respectively and an increased value of the outcome is the desired response.

We simulate observations from each group such that treatment truly increases the outcome and the variances within each group are equal such that .

# Loading required libraries
library(tidyverse)
library(gridExtra)

# Setting seed for reproducibility
set.seed(12345)

# Simulating data
n_X = n_Y = 50
sigma_X = sigma_Y = 1
mu_X = 5; mu_Y = 7

outcome_X = rnorm(n = n_X, mean = mu_X, sd = sigma_X)
outcome_Y = rnorm(n = n_Y, mean = mu_Y, sd = sigma_Y)

df <- data.frame(Group = c(rep('Placebo', n_X), rep('Treatment', n_Y)),
Outcome = c(outcome_X, outcome_Y))


Examining side-by-side histograms and boxplots of the outcomes within each group, there appears to be strong evidence that treatment increases the outcome as desired. Thus, we would expect a probabilistic index close to 1 as most outcomes in the treatment group appear larger than those of the placebo group.

# Histogram by group
hist_p <- df %>%
ggplot(aes(x = Outcome, fill = Group)) +
geom_histogram(position = 'identity', alpha = 0.75, bins = 10) +
theme_bw() +
labs(y = 'Frequency')

# Boxplot by group
box_p <- df %>%
ggplot(aes(x = Outcome, fill = Group)) +
geom_boxplot() +
theme_bw() +
labs(y = 'Frequency')

# Combine plots
grid.arrange(hist_p, box_p, nrow = 2)