Statisticelle • Page 3 of 6 • Girl Meets Stats

Pride and Probability

To celebrate pride month, my husband Ethan’s workplace Desire2Learn organized virtual Drag Queen BINGO hosted by the fabulous Astala Vista. Even within the confines of a Zoom meeting, Astala Vista put on a great show!

Astala Vista, previously a self-proclaimed cat *lady* of drag, now a cat *cougar* at 30, demonstrating her “roar” on Zoom!

To keep things interesting (and I’m sure to reduce the odds of winning to keep the show going), different BINGO patterns besides the traditional “5 across” BINGO were used. This included a “4 corners” BINGO and a “cover-all” BINGO. To obtain a cover-all BINGO, all numbers on a traditional 5 by 5 BINGO card must be called (noting that of the 25 spaces on the card, 1 is a free space).

Up until this point, probability had not entered the discussion between my husband and I. However, with the cover-all BINGO, Ethan began wondering how many draws it would take to call a cover-all BINGO.

I became quiet, and my husband thought he had perhaps annoyed me with all of his probability questions. In fact, I was thinking about how I could easily simulate the answer to his question (and the corresponding combinatorics answer)!

First, we need to randomly generate a BINGO card. A BINGO card features five columns, with five numbers each. The exception is the N column which features a FREE space, given to all players. The B column features the numbers 1 through 15, the I column 16 through 30, etc. The numbers in each column are drawn without replacement for each card.

Continue reading Pride and Probability

Simplifying U-statistic variance estimation with Sen’s structural components

Sen (1960) proved that U-statistics could be decomposed into identically distributed and asymptotically uncorrelated “structural components.”

The mean of these structural components is equivalent to the U-statistic and the variance of the structural components can be used to estimate the variance of the U-statistic, bypassing the need for often challenging derivation of conditional variance components.

Continue reading Simplifying U-statistic variance estimation with Sen’s structural components

One, Two, U: Examples of common one- and two-sample U-statistics

My previous two blog posts revolved around derivation of the limiting distribution of U-statistics for one sample and multiple independent samples.

For derivation of the limiting distribution of a U-statistic for a single sample, check out Getting to know U: the asymptotic distribution of a single U-statistic.

For derivation of the limiting distribution of a U-statistic for multiple independent samples, check out Much Two U About Nothing: Extension of U-statistics to multiple independent samples.

The notation within these derivations can get quite complicated and it may be a bit unclear as to how to actually derive components of the limiting distribution.

In this blog post, I provide two examples of both common one-sample U-statistics (Variance, Kendall’s Tau) and two-sample U-statistics (Difference of two means, Wilcoxon Mann-Whitney rank-sum statistic) and derive their limiting distribution using our previously developed theory.

Continue reading One, Two, U: Examples of common one- and two-sample U-statistics

Much Two U About Nothing: Extension of U-statistics to multiple independent samples

Thank you very much to the lovely Feben Alemu for pointing me in the direction of https://pungenerator.org/ as a means of ensuring we never have to go without a brilliant title! With great power comes great responsibility.

Season 2 Crying GIF by Pose FX

Review

Statistical functionals are any real-valued function of a distribution function $F$ , $\theta = T(F)$ . When $F$ is unknown, nonparametric estimation only requires that $F$ belong to a broad class of distribution functions $\mathcal{F}$ , typically subject only to mild restrictions such as continuity or existence of specific moments.

For a single independent and identically distributed random sample of size $n$ , $X_1, …, X_n \stackrel{i.i.d}{\sim} F$ , a statistical functional $\theta = T(F)$ is said to belong to the family of expectation functionals if:

$T(F)$ takes the form of an expectation of a function $\phi$ with respect to $F$ ,
$T(F) = \mathbb{E}_F~ \phi(X_1, …, X_a)$
$\phi(X_1, …, X_a)$ is a symmetric kernel of degree $a \leq n$ .

A kernel is symmetric if its arguments can be permuted without changing its value. For example, if the degree $a = 2$ , $\phi$ is symmetric if $\phi(x_1, x_2) = \phi(x_2, x_1)$ .

If $\theta = T(F)$ is an expecation functional and the class of distribution functions $\mathcal{F}$ is broad enough, an unbiased estimator of $\theta = T(F)$ can always be constructed. This estimator is known as a U-statistic and takes the form,

$U_n = \frac{1}{{n \choose a}} \mathop{\sum … \sum} \limits_{1 \leq i_1 < ... < i_a \leq n} \phi(X_{i_1}, ..., X_{i_a})$

such that $U_n$ is the average of $\phi$ evaluated at all ${n \choose a}$ distinct combinations of size $a$ from $X_1, …, X_n$ .

For more detail on expectation functionals and their estimators, check out my blog post U-, V-, and Dupree statistics.

Since each $X_i$ appears in more than one summand of $U_n$ , the central limit theorem cannot be used to derive the limiting distribution of $U_n$ as it is the sum of dependent terms. However, clever conditioning arguments can be used to show that $U_n$ is in fact asymptotically normal with mean

$\mathbb{E}_F~ U_n = \theta = T(F)$

and variance

$\text{Var}_F~U_n = \frac{a^2}{n} \sigma_1^{2}$

where

$\sigma_1^{2} = \text{Var}_F \Big[ \mathbb{E}_F [\phi(X_1, …, X_a)|X_1] \Big].$

The sketch of the proof is as follows:

Express the variance of $U_n$ in terms of the covariance of its summands,

$\text{Var}_{F}~ U_n = \frac{1}{{n \choose a}^2} \mathop{\sum \sum} \limits_{\substack{1 \leq i_1 < ... < i_{a} \leq n \\ 1 \leq j_1 < ... < j_{a} \leq n}} \text{Cov}\left[\phi(X_{i_1}, ..., X_{i_a}),~ \phi(X_{j_1}, ..., X_{j_a})\right].$

Recognize that if two terms share $c$ common elements such that,
$\text{Cov} [\phi(X_1, …, X_c, X_{c+1}, …, X_a), \phi(X_1, …, X_c, X'_{c+1}, …, X'_a)]$

conditioning on their $c$ shared elements will make the two terms independent.
For $0 \leq c \leq n$ , define
$\phi_c(X_1, …, X_c) = \mathbb{E}_F \Big[\phi(X_1, …, X_a) | X_1, …, X_c \Big]$

such that

$\mathbb{E}_F~ \phi_c(X_1, …, X_c) = \theta = T(F)$

and

$\sigma_{c}^2 = \text{Var}_{F}~ \phi_c(X_1, …, X_c).$

Note that when $c = 0$ , $\phi_0 = \theta$ and $\sigma_0^2 = 0$ , and when $c=a$ , $\phi_a = \phi(X_1, …, X_a)$ and $\sigma_a^2 = \text{Var}_F~\phi(X_1, …, X_a)$ .
Use the law of iterated expecation to demonstrate that
$\sigma^{2}_c = \text{Cov} [\phi(X_1, …, X_c, X_{c+1}, …, X_a), \phi(X_1, …, X_c, X'_{c+1}, …, X'_a)]$

and re-express $\text{Var}_{F}~U_n$ as the sum of the $\sigma_{c}^2$ ,

$\text{Var}_F~U_n = \frac{1}{{n \choose a}} \sum_{c=1}^{a} {a \choose c}{n-a \choose a-c} \sigma^{2}_c.$

Recognizing that the first variance term dominates for large $n$ , approximate $\text{Var}_F~ U_n$ as

$\text{Var}_F~U_n \sim \frac{a^2}{n} \sigma^{2}_1.$
Identify a surrogate $U^{*}_n$ that has the same mean and variance as $U_n-\theta$ but is the sum of independent terms,
$U_n^{*} = \sum_{i=1}^{n} \mathbb{E}_F [U_n - \theta|X_i]$

so that the central limit may be used to show

$\sqrt{n} U_n^{*} \rightarrow N(0, a^2 \sigma_1^2).$
Demonstrate that $U_n - \theta$ and $U_n^{*}$ converge in probability,
$\sqrt{n} \Big((U_n - \theta) - U_n^{*}\Big) \stackrel{P}{\rightarrow} 0$

and thus have the same limiting distribution so that

$\sqrt{n} (U_n - \theta) \rightarrow N(0, a^2 \sigma_1^2).$

For a walkthrough derivation of the limiting distribution of $U_n$ for a single sample, check out my blog post Getting to know U: the asymptotic distribution of a single U-statistic.

This blog post aims to provide an overview of the extension of kernels, expectation functionals, and the definition and distribution of U-statistics to multiple independent samples, with particular focus on the common two-sample scenario.

Continue reading Much Two U About Nothing: Extension of U-statistics to multiple independent samples