Parametric vs. Nonparametric Approach to Estimations

Parametric statistics assume that the unknown CDF F belongs to a family of CDFs characterized by a parameter (vector) \theta. As the form of F is assumed, the target of estimation is its parameters \theta. Thus, all uncertainty about F is comprised of uncertainty about its parameters. Parameters are estimated by \hat{\theta}, and estimates are be substituted into the assumed distribution to conduct inference for the quantities of interest. If the assumed distribution F is incorrect, inference may also be inaccurate, or trends in the data may be missed.

To demonstrate the parametric approach, consider n = 100 independent and identically distributed random variables X_1, …, X_n generated from an exponential distribution with rate \lambda = 2. Investigators wish to estimate the 75^{th} percentile and erroneously assume that their data is normally distributed. Thus, F is assumed to be the Normal CDF but \mu and \sigma^2 are unknown. The parameters \mu and \sigma are estimated in their typical way by \bar{x} and \sigma^2, respectively. Since the normal distribution belongs to the location-scale family, an estimate of the p^{th} percentile is provided by,

    \[x_p = \bar{x} + s\Phi^{-1}(p)\]

where \Phi^{-1} is the standard normal quantile function, also known as the probit.

set.seed(12345)
library(tidyverse, quietly = T)
# Generate data from Exp(2)
x <- rexp(n = 100, rate = 2)

# True value of 75th percentile with rate = 2
true <- qexp(p = 0.75, rate = 2) 
true
## [1] 0.6931472
# Estimate mu and sigma
xbar <- mean(x)
s    <- sd(x)

# Estimate 75th percentile assuming mu = xbar and sigma = s
param_est <- xbar + s * qnorm(p = 0.75)
param_est
## [1] 0.8792925

The true value of the 75^{th} percentile of \text{Exp}(2) is 0.69 while the parametric estimate is 0.88.

Nonparametric statistics make fewer distributions about the unknown distribution F, requiring only mild assumptions such as continuity or the existence of specific moments. Instead of estimating parameters of F, F itself is the target of estimation. F is commonly estimated by the empirical cumulative distribution function (ECDF) \hat{F},

    \[\hat{F}(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{I}(X_i \leq x).\]

Any statistic that can be expressed as a function of the CDF, known as a statistical functional and denoted \theta = T(F), can be estimated by substituting \hat{F} for F. That is, plug-in estimators can be obtained as \hat{\theta} = T(\hat{F}).

Returning to the previous example, if nonparametric estimation is used, F is estimated by the ECDF \hat{F}. Then, the 75^{th} percentile is estimated as the observation corresponding to the 75^{th} percentile of \hat{F}.

# Estimate CDF 
Fhat <- ecdf(x)

# Find 75th percentile
nonparam_est <- quantile(Fhat, p = 0.75)
nonparam_est
##       75% 
## 0.6174771

The nonparametric estimate of 0.62 is much closer to the true value 0.69 than the parametric estimate of 0.88.

A plot comparing the three distribution functions (true, empirical/nonparametric, and normal/parametric) is presented below to provide insight into why the nonparametric approach performed better in this scenario. Note that the exponential and normal distribution functions are continuous while the ECDF is discrete, assigning mass \frac{1}{n} to each X_i.

The magnitude of the discrepancy between the parametric estimate and true value of the 75^{th} percentile can be attributed to the incorrect assumption that the data is normally distributed when it is, in fact, exponentially distributed.

The exponential distribution cannot yield negative realizations, so its CDF (blue) promptly increases starting at x=0, resembling logarithmic growth. On the other hand, the normal distribution can yield negative realizations, with its CDF (green) featuring non-zero probabilities for values below 0.

Comparing the two distribution functions, it is apparent that the ā€œSā€-shaped normal CDF overestimates P(X \leq x) when x < 0 or x > 1 while underestimating the probability of any values falling in the middle, x \in [0, 1]. The true 75^{th} percentile 0.69 happens to fall within this range.

The nonparametric estimate is based solely on the n = 100 observations. The ECDF, therefore, 1) does not assign probability to the impossible negative values and 2) better approximates the shape of the true distribution. As a result, the nonparametric approach yields a better estimate of the 75^{th} percentile in this scenario.

# Generate sequence between -1 and 3 with step size of 0.001 for plotting.
xx <- seq(-1, 3, 0.001)

# Get values of CDFs for each xx
cdf_vals <- data.frame(Type = c(rep('True', length(xx)),
                                rep('Normal (parameteric)', length(xx)),
                                rep('Empirical (nonparametric)', length(xx))),
                       Support = rep(xx, 3),
                       CDF     = c(pexp(xx, rate = 2),
                                   pnorm(xx, mean = xbar, sd = s),
                                   Fhat(xx)))

# Plot comparison of CDFs
cdf_vals %>%
  ggplot(aes(x = Support, y = CDF, col = Type)) +
  geom_step(lwd = 1.25, alpha = 0.75) +
  geom_abline(intercept = 0.75, slope = 0, lty = 2) +
  labs(x = 'x', y = 'Pr(X <= x)', col = 'CDF') +
  theme_bw()

Comparison of true, empirical (nonparametric), and normal (parametric) distribution functions.

What would the result have been if the investigators correctly assumed an exponential distribution?

If the rate parameter \lambda is estimated using its MLE as \hat{\lambda} = \frac{1}{\bar{x}}, the p^{th} percentile can be estimated as,

    \[ x_p = -\frac{\ln(1-p)}{\hat{\lambda}}.\]

# Estimate mu and sigma
lambdahat <- 1 / mean(x)
lambdahat
## [1] 2.000546
# Estimate 75th percentile assuming mu = xbar and sigma = s
param_est2 <- -log(1 - 0.75) / lambdahat
param_est2
## [1] 0.6929579

The parametric estimate is now much closer to the true value of 0.69 than the nonparametric estimate! In fact, the parametric estimate is basically identical to the true value.

Parametric and nonparametric methods both have their pros and cons. If the wrong distribution is assumed, parametric methods can provide flawed or misleading estimates and result in missed trends. If the correct distribution is assumed, parametric methods can leverage the knowledge of the distribution to provide precise and accurate estimates. If the investigator is unsure which assumptions to make, nonparametric methods offer a safety net by relying on observed data only and often provide reasonable estimates. However, (small) noisy samples can negatively impact the estimation of the ECDF and thus yield poor estimates of the quantities of interest.

If you have a strong hunch about the actual distribution of the data, parametric estimation is likely the way to go. But, if not, it never hurts to explore nonparametric estimation.

Published by

Emma Smith

Emma Smith is a young statistician who's on a mission to convince the masses statistics is as awesome as she *knows* it is! When she's not working on expanding her knowledge of machine learning and mathematical statistics, she's busy petting cats and unsuccessfully convincing her boyfriend to let her adopt them, hiking, concocting indie and folk rock playlists, and kicking butt in roller derby.

3 thoughts on “Parametric vs. Nonparametric Approach to Estimations”

  1. on line 4, paragraph one, a word has been omitted. It should read: estimates are to be substituted…
    Just saying
    Jackie

Leave a Reply

Your email address will not be published. Required fields are marked *