Resampling, the jackknife, and pseudo-observations

Resampling methods approximate the sampling distribution of a statistic or estimator. In essence, a sample taken from the population is treated as a population itself. A large number of new samples, or resamples, are taken from this “new population”, commonly with replacement, and within each of these resamples, the estimate of interest is re-obtained. A large number of these estimate replicates can then be used to construct the empirical sampling distribution from which confidence intervals, bias, and variance may be estimated. These methods are particularly advantageous for statistics or estimators for which no standard methods apply or are difficult to derive.

The jackknife is a popular resampling method, first introduced by Quenouille in 1949 as a method of bias estimation. In 1958, jackknifing was both named by Tukey and expanded to include variance estimation. A jackknife is a multipurpose tool, similar to a swiss army knife, that can get its user out of tricky situations. Efron later developed the arguably most popular resampling method, the bootstrap, in 1979 after being inspired by the jackknife.

In Efron’s (1982) book The jackknife, the bootstrap, and other resampling plans, he states,

Good simple ideas, of which the jackknife is a prime example, are our most precious intellectual commodity, so there is no need to apologize for the easy mathematical level.

Despite existing since the 1940’s, resampling methods were infeasible due to the computational power required to perform resampling and recalculate estimates many times. With today’s computing power, the uncomplicated yet powerful jackknife, and resampling methods more generally, should be a tool in every analyst’s toolbox.

Continue reading Resampling, the jackknife, and pseudo-observations

Pride and Probability

To celebrate pride month, my husband Ethan’s workplace Desire2Learn organized virtual Drag Queen BINGO hosted by the fabulous Astala Vista. Even within the confines of a Zoom meeting, Astala Vista put on a great show!

Astala Vista, previously a self-proclaimed cat *lady* of drag, now a cat *cougar* at 30, demonstrating her “roar” on Zoom!

To keep things interesting (and I’m sure to reduce the odds of winning to keep the show going), different BINGO patterns besides the traditional “5 across” BINGO were used. This included a “4 corners” BINGO and a “cover-all” BINGO. To obtain a cover-all BINGO, all numbers on a traditional 5 by 5 BINGO card must be called (noting that of the 25 spaces on the card, 1 is a free space).

Up until this point, probability had not entered the discussion between my husband and I. However, with the cover-all BINGO, Ethan began wondering how many draws it would take to call a cover-all BINGO.

I became quiet, and my husband thought he had perhaps annoyed me with all of his probability questions. In fact, I was thinking about how I could easily simulate the answer to his question (and the corresponding combinatorics answer)!

First, we need to randomly generate a BINGO card. A BINGO card features five columns, with five numbers each. The exception is the N column which features a FREE space, given to all players. The B column features the numbers 1 through 15, the I column 16 through 30, etc. The numbers in each column are drawn without replacement for each card.

Continue reading Pride and Probability

Using a DAG to simulate data with the dagR library

Directed acyclic graphs (DAGs), and causal graphs in general, provide a framework for making assumptions explicit and identifying confounders or mediators of the relationship between the exposure of interest and outcome that need to be adjusted for in analysis. Recently, I ran into the need to generate data from a DAG for a paper I am writing with my peers Kevin McIntyre and Joshua Wiener. After a quick Google search, I was pleasantly surprised to see there were several options to do so. In particular, the dagR library provides “functions to draw, manipulate, [and] evaluate directed acyclic graphs and simulate corresponding data”.

Besides dagR‘s reference manual, a short letter published in Epidemiology, and a limited collection of examples, I couldn’t find too many resources regarding how to use the functionality provided by dagR. The goal of this blog post is to provide an expository example of how to create a DAG and generate data from it using the dagR library.

To simulate data from a DAG with dagR, we need to:

  1. Create the DAG of interest using the dag.init function by specifying its nodes (exposure, outcome, and covariates) and their directed arcs (directed arrows to/from nodes).
  2. Pass the DAG from (1) to the dag.sim function and specify the number of observations to be generated, arc coefficients, node types (binary or continuous), and parameters of the node distributions (Normal or Bernoulli).

For this tutorial, we are going to try to replicate the simple confounding/common cause DAG presented in Figure 1b as well as the more complex DAG in Figure 2a of Shier and Platt’s (2008) paper, Reducing bias through directed acyclic graphs.

library(dagR)
set.seed(12345)

Continue reading Using a DAG to simulate data with the dagR library

Advent of Code 2017 in R: Day 2

Day 2 of the Advent of Code provides us with a tab delimited input consisting of numbers 2-4 digits long and asks us to calculate its “checksum”. checksum is defined as the sum of the difference between each row’s largest and smallest values. Awesome! This is a problem that is well-suited for base R.

I started by reading the file in using read.delim, specifying header = F in order to ensure that numbers within the first row of the data are not treated as variable names.

When working with short problems like this where I know I won’t be rerunning my code or reloading my data often, I will use file.choose() in my read.whatever functions for speed. file.choose() opens Windows Explorer, allowing you to navigate to your file path.

input <- read.delim(file.choose(), header = F)

# Check the dimensions of input to ensure the data read in correctly.
dim(input)

After checking the dimensions of our input, everything looks good. As suspected, this is a perfect opportunity to use some vectorization via the apply function.

row_diff <- apply(input, 1, function(x) max(x) - min(x))
checksum <- sum(row_diff)
checksum

Et voilà, the answer is 45,972! Continue reading Advent of Code 2017 in R: Day 2