Confidence Interval Estimation with Hand-Crafted CDFs

Peter E. Freeman

Introduction

Recall that to determine confidence interval bounds, we solve the equation \[\begin{align*} F_Y(y_{\rm obs} \vert \theta) - q = 0 \end{align*}\] for \(\theta\). Here, \(Y\) is our chosen statistic (e.g., \(\bar{X}\)), \(F_Y(\cdot)\) is the cumulative distribution function (or cdf) for \(Y\)’s sampling distribution, \(y_{\rm obs}\) is the observed statistic value, and \(q\) is a quantile that we determine given the confidence coefficient \(1-\alpha\) and the type of interval we are constructing (one-sided lower bound, one-sided upper bound, or two-sided).

In this vignette, we examine the situation in which the sampling distribution cdf is analytically derivable, but (a) is not (easily) invertible by hand, and (b) is (to our knowledge!) not coded in an existing R package. In this situation, we would develop our own cdf-calculating function, and then pass it to cdfinv().

Example

Let’s suppose we sample \(n = 16\) data from a \(\text{Uniform}(0,\theta)\) distribution and we wish to compute a 95% upper bound for \(\theta\). Further, let’s suppose that the \(10^{\rm th}\) datum is \(x_{(10),\rm obs} = 0.45\), and that we will use this datum to determine the confidence interval. (This is obviously an academic exercise: we would normally utilize \(x_{(16),\rm obs}\) and solve for the upper bound by hand.)

Deriving the CDF

For the \(\text{Uniform}(0,\theta)\) distribution, the cdf within the domain is \[\begin{align*} F_X(x) = \int_0^x \frac{1}{\theta - 0} dy = \frac{x}{\theta} \,. \end{align*}\] We utilize a result found in Casella & Berger (2002; page 229): the cdf for the \(j^{\rm th}\) order statistic is \[\begin{align*} F_{(j)}(x) = \sum_{k=j}^n \binom{n}{k} [F_X(x)]^k [1-F_X(x)]^{n-k} \,. \end{align*}\] (Our notation slightly differs from that of Casella & Berger.) Here, that cdf would be \[\begin{align*} F_{(j)}(x) = \sum_{k=j}^n \binom{n}{k} \left(\frac{x}{\theta}\right)^k \left[1-\left(\frac{x}{\theta}\right)\right]^{n-k} \,. \end{align*}\] This is obviously not a cdf that we can easily invert by hand when \(j < n\).

Coding the CDF

Below we show how one might create one’s own code that returns the cdf value as a function of a coordinate \(x\) and a parameter \(\theta\). Note that the first argument must be q…the observed statistic value passed to cdfinv() will be matched to this argument name.

# give the function a name that will not conflict with others in the namespace
punif.ord <- function(q,theta,j,n)
{
  sum(choose(n,j:n)*(q/theta)^(j:n)*(1-q/theta)^(n-j:n))
}

Applying the CDF

Below we show how we can pass the function, now defined in R’s global environment, to cdfinv(). The first three arguments are standard: the distribution “name” is unif.ord (cdfinv() will tack a p on the front), the parameter of interest is theta, and the observed statistic value is 0.45. The next argument specifies that we wish to compute a (95% by default) upper bound on \(\theta\), instead of the default two-sided interval. As we know that \(\theta \geq x_{(10),\rm obs}\), we specify a lower parameter bound (lpb) of 0.45. The last two arguments are extra ones specifically required for punif.ord().

require(cdfinv)
cdfinv("unif.ord","theta",0.45,bound="upper",lpb=0.45,j=10,n=16)
## $DISTR
## [1] "unif.ord"
## 
## $PARAM
## [1] "theta"
## 
## $STAT
## [1] 0.45
## 
## $bound
## [1] 1.150861
## 
## $q
## [1] 0.05

The 95% upper bound on \(\theta\) determined using the \(10^{\rm th}\) ordered datum is \(\hat{\theta}_U = 1.151\).