Plug-in estimators of statistical functionals

Consider a sequence of n independent and identically distributed random variables X_1, X_2, …, X_n \sim F. The distribution function F is unknown but belongs to a known set of distribution functions \mathcal{F}. In parametric estimation, \mathcal{F} may represent a family of distributions specified by a vector of parameters, such as (\mu, \sigma) in the case of the location-scale family. In nonparametric estimation, \mathcal{F} is much more broad and is subject to milder restrictions, such as the existence of moments or continuity. For example, we may define \mathcal{F} as the family of distributions for which the mean exists or all distributions defined on the real line \mathbb{R}.

As mentioned in my previous blog post comparing nonparametric and parametric estimation, a statistical functional is any real-valued function of the cumulative distribution function F, denoted \theta = T(F). Statistical functionals can be thought of as characteristics of F, and include moments

    \[T(F) = \mathbb{E}_{F}[X^{k}]\]

and quantiles

    \[T(F) = F^{-1}(p)\]

as examples.

An infinite population may be considered as completely determined by its distribution function, and any numerical characteristic of an infinite population with distribution function F that is used in statistics is a [statistical] functional of F.

Wassily Hoeffding. “A Class of Statistics with Asymptotically Normal Distribution.” Ann. Math. Statist. 19 (3) 293 – 325, September, 1948.

This blog post aims to provide insight into estimators of statistical functionals based on a sample of n independent and identically random variables, known as plug-in estimators or empirical functionals.

Continue reading Plug-in estimators of statistical functionals

Using a DAG to simulate data with the dagR library

Directed acyclic graphs (DAGs), and causal graphs in general, provide a framework for making assumptions explicit and identifying confounders or mediators of the relationship between the exposure of interest and outcome that need to be adjusted for in analysis. Recently, I ran into the need to generate data from a DAG for a paper I am writing with my peers Kevin McIntyre and Joshua Wiener. After a quick Google search, I was pleasantly surprised to see there were several options to do so. In particular, the dagR library provides “functions to draw, manipulate, [and] evaluate directed acyclic graphs and simulate corresponding data”.

Besides dagR‘s reference manual, a short letter published in Epidemiology, and a limited collection of examples, I couldn’t find too many resources regarding how to use the functionality provided by dagR. The goal of this blog post is to provide an expository example of how to create a DAG and generate data from it using the dagR library.

To simulate data from a DAG with dagR, we need to:

  1. Create the DAG of interest using the dag.init function by specifying its nodes (exposure, outcome, and covariates) and their directed arcs (directed arrows to/from nodes).
  2. Pass the DAG from (1) to the dag.sim function and specify the number of observations to be generated, arc coefficients, node types (binary or continuous), and parameters of the node distributions (Normal or Bernoulli).

For this tutorial, we are going to try to replicate the simple confounding/common cause DAG presented in Figure 1b as well as the more complex DAG in Figure 2a of Shier and Platt’s (2008) paper, Reducing bias through directed acyclic graphs.


Continue reading Using a DAG to simulate data with the dagR library

Parametric vs. Nonparametric Approach to Estimations

Parametric statistics assume that the unknown CDF F belongs to a family of CDFs characterized by a parameter (vector) \theta. As the form of F is assumed, the target of estimation is its parameters \theta. Thus, all uncertainty about F is comprised of uncertainty about its parameters. Parameters are estimated by \hat{\theta}, and estimates are be substituted into the assumed distribution to conduct inference for the quantities of interest. If the assumed distribution F is incorrect, inference may also be inaccurate, or trends in the data may be missed.

To demonstrate the parametric approach, consider n = 100 independent and identically distributed random variables X_1, …, X_n generated from an exponential distribution with rate \lambda = 2. Investigators wish to estimate the 75^{th} percentile and erroneously assume that their data is normally distributed. Thus, F is assumed to be the Normal CDF but \mu and \sigma^2 are unknown. The parameters \mu and \sigma are estimated in their typical way by \bar{x} and \sigma^2, respectively. Since the normal distribution belongs to the location-scale family, an estimate of the p^{th} percentile is provided by,

    \[x_p = \bar{x} + s\Phi^{-1}(p)\]

where \Phi^{-1} is the standard normal quantile function, also known as the probit.

library(tidyverse, quietly = T)
# Generate data from Exp(2)
x <- rexp(n = 100, rate = 2)

# True value of 75th percentile with rate = 2
true <- qexp(p = 0.75, rate = 2) 
## [1] 0.6931472
# Estimate mu and sigma
xbar <- mean(x)
s    <- sd(x)

# Estimate 75th percentile assuming mu = xbar and sigma = s
param_est <- xbar + s * qnorm(p = 0.75)
## [1] 0.8792925

The true value of the 75^{th} percentile of \text{Exp}(2) is 0.69 while the parametric estimate is 0.88.

Nonparametric statistics make fewer distributions about the unknown distribution F, requiring only mild assumptions such as continuity or the existence of specific moments. Instead of estimating parameters of F, F itself is the target of estimation. F is commonly estimated by the empirical cumulative distribution function (ECDF) \hat{F},

    \[\hat{F}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x).\]

Any statistic that can be expressed as a function of the CDF, known as a statistical functional and denoted \theta = T(F), can be estimated by substituting \hat{F} for F. That is, plug-in estimators can be obtained as \hat{\theta} = T(\hat{F}).

Continue reading Parametric vs. Nonparametric Approach to Estimations

The Probabilistic Index for Two Normally Distributed Outcomes

Consider a two-armed study comparing a placebo and treatment. In general, the probabilistic index (PI) is defined as,

    \[\text{PI} = P(X < Y) + \frac{1}{2} P(X = Y)\]

and is interpreted as the probability that a subject in the treatment group will have an increased response compared to a subject in the placebo group. The probabilistic index is a particularly useful effect measure for ordinal data, where effects can be difficult to define and interpret owing to absence of a meaningful difference. However, it can also be used for continuous data, noting that when the outcome is continuous, P(X = Y) = 0 and the PI reduces to P(X < Y).

PI = 0.5 suggests an increased outcome is equally likely for subjects in the placebo and treatment group, while PI > 0.5 suggests an increased outcome is more likely for subjects in the treatment group compared to the placebo group, and the opposite is true when PI < 0.5.


Suppose X \sim N(\mu_X, \sigma^{2}_{X}) and Y \sim N(\mu_Y, \sigma^{2}_{Y}) represent the independent outcomes in the placebo and treatment groups, respectively and an increased value of the outcome is the desired response.

We simulate n_X = n_Y = 50 observations from each group such that treatment truly increases the outcome and the variances within each group are equal such that \sigma^{2}_{X} = \sigma^{2}_{Y}.

# Loading required libraries

# Setting seed for reproducibility
# Simulating data
n_X = n_Y = 50
sigma_X = sigma_Y = 1
mu_X = 5; mu_Y = 7

outcome_X = rnorm(n = n_X, mean = mu_X, sd = sigma_X)
outcome_Y = rnorm(n = n_Y, mean = mu_Y, sd = sigma_Y)

df <- data.frame(Group = c(rep('Placebo', n_X), rep('Treatment', n_Y)),
                 Outcome = c(outcome_X, outcome_Y))

Examining side-by-side histograms and boxplots of the outcomes within each group, there appears to be strong evidence that treatment increases the outcome as desired. Thus, we would expect a probabilistic index close to 1 as most outcomes in the treatment group appear larger than those of the placebo group.

# Histogram by group
hist_p <- df %>%
  ggplot(aes(x = Outcome, fill = Group)) +
    geom_histogram(position = 'identity', alpha = 0.75, bins = 10) + 
    theme_bw() +
    labs(y = 'Frequency')

# Boxplot by group
box_p <- df %>%
  ggplot(aes(x = Outcome, fill = Group)) +
    geom_boxplot() + 
    theme_bw() +
    labs(y = 'Frequency')

# Combine plots
grid.arrange(hist_p, box_p, nrow = 2)

plot of chunk unnamed-chunk-3

Continue reading The Probabilistic Index for Two Normally Distributed Outcomes