Confidence Interval Estimation with cdfinv()

Peter E. Freeman

Introduction

There are three “classical” techniques for determining confidence intervals for distribution parameters: the pivotal method (taught at scale in mathematical statistics classes), inverting likelihood-ratio tests, and inverting the cumulative distribution function of the sampling distribution of a statistic. All three are described by Casella & Berger (2002; Chapter 9). The latter option, which we dub “cdf inversion,” involves solving the equation \[\begin{align*} F_Y(y_{\rm obs} \vert \theta) - q = 0 \end{align*}\] for \(\theta\). Here, \(Y\) is our chosen statistic (e.g., \(\bar{X}\)), \(F_Y(\cdot)\) is the cumulative distribution function (or cdf) for \(Y\)’s sampling distribution, \(y_{\rm obs}\) is the observed statistic value, and \(q\) is a quantile that we determine given the confidence coefficient \(1-\alpha\) and the type of interval we are constructing (one-sided lower bound, one-sided upper bound, or two-sided). In the pre-computer era, inverting the cdf to determine \(\theta\) was impossible to do in all but the simplest cases, but by utilizing numerical root-finders we can now determine bounds in a wide variety of situations.

Before continuing, we note that we are focusing on a particular setting in which there is one freely varying parameter for the distribution we are working with. (There are a couple of exceptions to this, as we will see below, but in the end, the functions in this package are built to compute confidence intervals and not confidence regions.)

The current package supports three basic use cases.

  1. The case where the sampling distribution is known and where functions associated with it already exist in R (either in the base packages or a contributed package). The examples below deal with this case.
  2. The case where the sampling distribution is known but where functions associated with it (apparently) do not exist. In this case, we would define our own cdf function and pass it to cdfinv(). See the vignette Confidence Interval Estimation with Hand-Crafted CDFs. (See also Example 2 below, specifically those parts about using the pre-defined helper functions ptmean() and pnormvar().)
  3. The case where the sampling distribution is unknown but where we can sample individual data from assumed underlying distribution. (For instance, we sample \(n = 5\) iid data from a \(\text{Beta}(1.4,b)\) distribution…we don’t know the sampling distribution for any sufficient statistic for \(b\), but we can sample the individual data.) In this case, we utilize simulations (and the function cdfinv.sim(); see the vignette Confidence Interval Estimation via Simulations).

In this vignette, we will demonstrate how one would use the cdfinv() function to numerically estimate confidence intervals for a range of (basic) situations.

cdfinv(): Function Arguments

The nine core arguments are the following.

  1. DISTR: this is a string representing the stem of previously coded R functions (e.g., "norm", which “maps” to pnorm(), qnorm(), etc.) or a function that we define ourselves (see the vignette Confidence Interval Estimation with Hand-Crafted CDFs).
  2. PARAM: this is a string representing the name of the parameter of interest, \(\theta\), as it is known to R. (Examples include "mean" for the normal \(\mu\), "shape1" for the beta \(a\), etc.)
  3. STAT: the observed value of the adopted statistic (i.e., \(y_{\rm obs}\)).
  4. lpb: the lower bound on the allowed range of values for \(\theta\). By default, this is \(-10000\) (meaning, effectively \(-\infty\)). If we are working with, e.g., a binomial distribution, we would set this to 0, since \(0 \leq p \leq 1\).
  5. upb: same as lpb, but the upper bound. In the binomial case, we would set this to 1. By default, its value is \(10000\) (effectively \(\infty\)).
  6. bound: set this to “two-sided” (default), or “lower” or “upper”.
  7. alpha: the confidence coefficient is \(1-\alpha\), so if, e.g., we want to compute a 90% interval estimate, we would note that \(1 - \alpha = 0.9\) and thus that \(\alpha = 0.1\). By default, \(\alpha = 0.05\).
  8. tolb: this is an offset value for lpb and upb to mitigate issues that sometimes appear when, e.g., we set lpb to zero (for some distributions, it is OK to have an exact zero value, and for some it isn’t).
  9. tol: this is the tolerance that uniroot() uses as a stopping criterion. The default value, \(10^{-6}\), is smaller than uniroot()’s own default of \(\sim 10^{-4}\).

We must specify values for the first three arguments.

The ... included at the end is for extra arguments that the sampling distribution cdf requires. See, e.g., Example 2 below.

Example 1: Normal Population Mean, Variance Known

Let’s suppose we sample \(n = 6\) independent and identically distributed (iid) data from a \(\text{Normal}(\mu,\sigma^2=4)\) distribution and we wish to determine a 95% two-sided confidence interval on \(\mu\), using \(Y = \bar{X}\) as our statistic. Let’s additionally suppose that the observed statistic value is \(y_{\rm obs} = \bar{x}_{\rm obs} = 3\).

Using the method of moment-generating functions, we determine that \(\bar{X} \sim \text{Normal}(\mu,\sigma^2/n)\).

The first three arguments to cdfinv() specify the name of the distribution as it is known in R (here, norm), the name of the parameter of interest also as it is known in R (here, mean), and the observed statistic value. We need not specify any additional arguments.

require(cdfinv)
## Loading required package: cdfinv
alpha  <- 0.05 # the confidence coefficient is 1-alpha
n      <- 6
sigma2 <- 4/n
y.obs  <- 3

cdfinv("norm","mean",y.obs)
## $DISTR
## [1] "norm"
## 
## $PARAM
## [1] "mean"
## 
## $STAT
## [1] 3
## 
## $bound
## [1] 1.040036 4.959964
## 
## $q
## [1] 0.975 0.025

The output shows the input values ("norm", "mean", and the value of y.obs) as well as the lower and upper bounds for the interval (1.040 and 4.960) and the values of \(q\) associated with each bound (see the equation at the beginning of the vignette; the conversion from \(\alpha\) to \(q\) is done inside cdfinv()).

We can visualize the interval by passing the output of cdfinv() to the function ci_plot().

ci_plot(cdfinv("norm","mean",y.obs))

The vertical long-dashed line represents \(y_{\rm obs}\), the observed statistic value, while the two horizontal short-dashed lines represent the cdf quantiles \(q = 0.025\) and \(q = 0.975\) (i.e., \(\alpha/2\) and \(1-\alpha/2\), where \(\alpha = 0.05\) since we are computing a 95% confidence interval). The solid lines are the cdfs for normal distributions with variance 4/6 and with means 1.040 (the left curve) and 4.960 (the right curve). For these values of \(\mu\), the cdf curves pass through the intersections of the horizontal lines with the vertical line. If we repeat the experiment of collecting \(n = 6\) iid data, then \(y_{\rm obs}\) will change, shifting the vertical line and thus shifting the values of \(\mu\) that comprise the interval endpoints. Thus confidence intervals are random intervals that, in the long-term, have probability \(1-\alpha\) of overlapping the true parameter value.

Example 2: Normal Population Mean, Variance Unknown

We assume the same setting as for Example 1, but here \(\sigma^2\) is unknown; let’s suppose that we observe the sample variance \(S^2 = 4.5\). Here, we utilize the statistic \[\begin{align*} T = \frac{\bar{X}-\mu}{S/\sqrt{n}} \sim t_{n-1} \,, \end{align*}\] where \(t_{n-1}\) is a \(t\)-distribution for \(n-1\) degrees of freedom. Working with the \(t\)-distribution necessitates the use of a helper function that is included as part of the cdfinv package, ptmean():

ptmean <- function(q,mean,s2,n)
{
  ... # detail redacted
  pt((q-mean)/sqrt(s2/n),n-1)
}

We utilize this function directly when calling cdfinv():

cdfinv("tmean","mean",3,s2=4.5,n=6) # here we pass extra arguments
## $DISTR
## [1] "tmean"
## 
## $PARAM
## [1] "mean"
## 
## $STAT
## [1] 3
## 
## $bound
## [1] 0.7738108 5.2261892
## 
## $q
## [1] 0.975 0.025

The lower and upper bounds are 0.774 and 5.226, respectively. (We note that these bounds match those output by the base-R function t.test(), although this function requires that the data vector be input rather than, e.g., the sample mean.)

Note that if we wish to estimate an interval for \(\sigma^2\), we would do something similar to what we do immediately above, but we would utilize the helper function pnormvar() instead.

Example 3: Binomial Probability

Let’s suppose we sample \(n = 5\) iid data from a \(\text{Binom}(k=10,p)\) distribution. (Imagine we flip a coin 10 times, and record the number of heads, then repeat the process four additional times…our data are thus \(\{X_1,\ldots,X_5\}\).) We wish to determine a 95% two-sided confidence interval on \(p\).

Using the method of moment-generating functions, we determine that \(Y = \sum_{i=1}^n X_i \sim \text{Binom}(nk,p)\). Thus we utilize \(Y\) as our statistic. Let the observed value be \(y_{\rm obs} = 18\).

Computing interval bounds given discrete sampling distributions is slightly tricky. When the expected value of the statistic \(Y\), \(E[Y]\), increases as \(\theta\) increases (as it does here: the larger the value of \(p\), the larger, on average, the value of \(Y\)), then when we compute lower bounds, we need to subtract 1 from the observed statistic value. (If \(E[Y]\) decreases with \(\theta\), as is the case for, e.g., the negative binomial distribution, then we subtract 1 from \(y_{\rm obs}\) when computing upper bounds.) Because of this “trickiness,” we advise computing the two bounds separately.

alpha <- 0.025 # appropriate for two-sided interval bounds computed separately
y.obs <- 18
n     <- 5
k     <- 10
cdfinv("binom","prob",y.obs-1,lpb=0,upb=1,bound="lower",alpha=alpha,size=n*k)$bound
## [1] 0.2291573
cdfinv("binom","prob",y.obs  ,lpb=0,upb=1,bound="upper",alpha=alpha,size=n*k)$bound
## [1] 0.5080686

The 95% confidence interval for \(p\) is thus \([0.229,0.508]\). Note the specification of values for the arguments lpb and upb: these are lower and upper bounds for the parameter value. Since \(0 \leq p \leq 1\), we manually set lpb to 0 and upb to 1. (The default values are \(-\infty\) and \(\infty\).) (We note that our bounds match those output by the base-R function binom.test()…what we derive are Clopper-Pearson interval bounds.)

Example 4: Poisson Lambda

Let’s suppose we sample \(n = 10\) iid data from a \(\text{Poisson}(\lambda)\) distribution. We wish to determine a 90% lower bound on \(\lambda\).

Using the method of moment-generating functions, we determine that \(Y = \sum_{i=1}^n X_i \sim \text{Poisson}(n\lambda)\). Thus we utilize \(Y\) as our statistic. Let the observed value be \(y_{\rm obs} = 28\).

As in Example 3, we need to make a “discreteness correction” to \(y_{\rm obs}\): we will input 27 into cdfinv() rather than 28. Additionally, we note that what cdfinv() will return is a lower bound for \(n\lambda\); we need to divide this number by \(n\).

alpha <- 0.1
y.obs <- 28
n     <- 10
cdfinv("pois","lambda",y.obs-1,lpb=0,bound="lower",alpha=alpha)$bound / n
## [1] 2.146867

The 90% lower bound on \(\lambda\) is thus 2.147.

Example 5

We noted in some of our examples above how the bounds we derive match the output from existing base-R functions…which might motivate the question of “why does this package exist?” To answer that, let’s go outside base-R functionality to put a two-sided 95% confidence interval on the parameter \(\sigma\) of a half-normal distribution. Let’s assume that we sample a single datum \(x_{\rm obs} = 5\). (The reason we assume a single datum here is that the sampling distribution for our statistic is simply the half-normal distribution itself. If \(n > 1\), then we don’t know the sampling distribution of, e.g., \(\bar{X}\)…and so we would have to utilize simulations to estimate interval bounds. Which we can do. See the vignette Confidence Interval Estimation via Simulations.)

require(extraDistr)
## Loading required package: extraDistr
cdfinv("hnorm","sigma",5,lpb=0)$bound
## [1]   2.230746 159.550797