\documentclass{article} \usepackage{amsmath} \usepackage{fullpage} \usepackage{url} \usepackage{xspace} \newcommand{\R}{\texttt{R}\xspace} \title{The binomial distribution} \author{Patrick Breheny} \date{September 15, 2016} <>= require(knitr) opts_chunk$set(prompt = TRUE) opts_chunk$set(comment = NA) opts_template$set(fig = list(fig.height=5, fig.width=5, out.width='.5\\linewidth', fig.align='center')) @ \begin{document} \maketitle So, today we're going to learn about \R's suite of functions for the binomial distribution. In addition, we'll see how to carry out hypothesis tests and construct confidence intervals for binomially distributed data. \section{The binomial distribution} \R has four functions related to the binomial distribution: \begin{itemize} \item {\tt dbinom}: Probability mass function \item {\tt pbinom}: Cumulative distribution function \item {\tt qbinom}: Quantile function (inverse of CDF) \item {\tt rbinom}: Generates random variables from a binomial distribution \end{itemize} This is a common pattern in \R; as we will see, there is are {\tt dnorm}, {\tt pnorm}, {\tt qnorm}, and {\tt rnorm} functions that do the same thing for the normal distribution, and so on for many other distributions. The {\tt dbinom} function evaluates the binomial formula, returning $P(X=k)$: <>= dbinom(31, size=39, prob=0.7) ## Same as: choose(39,31) * 0.7^31 * 0.3^8 @ As an aside, note the use of the {\tt choose} function; you could use factorials directly, but this can lead to overflow inaccuracies if the numbers are large enough: <>= ## This is OK: print(choose(39,31), digits=15) print(factorial(39)/(factorial(31)*factorial(8)), digits=15) ## This is a problem: print(choose(139,131), digits=15) print(factorial(139)/(factorial(131)*factorial(8)), digits=15) @ The {\tt pbinom} function evaluates the binomial CDF: <>= pbinom(31, size=39, prob=0.7) ## Same as: sum(dbinom(0:31, size=39, prob=0.7)) @ Note that in the second line of code, I passed a vector of values to {\tt dbinom}; this is fine, and is much more convenient than writing a for loop. For example, it allows you to plot the whole distribution with a single line of code: <>= barplot(dbinom(0:39, 39, 0.7), names=0:39, border=FALSE, las=1) @ Finally, {\tt rbinom} generates a vector of random variables from the binomial distribution (yes, I know I skipped {\tt qbinom}; it's not as useful, although we will use other ``q'' functions, such as {\tt qnorm}, many times): <>= x <- rbinom(100, size=39, prob=0.7) x barplot(table(x), border=FALSE, las=1) @ \section{Constructing the Clopper-Pearson interval} As we discussed in class, we need a way to find the values $\theta_L$ and $\theta_U$ such that we would reject the hypothesis concerning $\theta$ at the $\alpha=0.025$ level for all the values outside $[\theta_L, \theta_U]$. Let's draw a picture of what's going on: <>= p <- function(theta) pbinom(31, 39, theta) x <- seq(0.85, 1, 0.01) plot(x, p(x), type="l", xlab=expression(theta), ylab=expression(p(theta)), las=1) abline(h=0.025, col="red") @ This is our first example of a user-created function in \R. \R of course comes with a lot of functions, but here, on the first line, we created a new function {\tt p} that didn't exist before. Once we create it, we can call it, as we do on the third line with {\tt p(x)}. This is a pretty simple function; we'll create more complicated ones later in the course. The illustration gives us a good idea of where $\theta_U$ is, but how can we find the exact value of $\theta_U$? This is what is known in numerical analysis as a {\em root-finding problem}; i.e., it can be restated as finding the zero, or root, of a function. Strictly speaking, we want to find where our function intersects 0.025, not where it intersects zero, but this can be trivially fixed by subtracting off 0.025 from {\tt p(theta)}. \R's function for root-solving in one dimension is called {\tt uniroot}. Here's what it looks like in action: <>= p.u <- function(theta) pbinom(31, 39, theta) - 0.025 uniroot(p.u, 0:1)$root @ Note that we need to supply an interval to {\tt uniroot}; here, $[0,1]$ is an obvious interval for $\theta$. Now let's find $\theta_L$: <>= p.l <- function(theta) 1 - pbinom(30, 39, theta) - 0.025 uniroot(p.l, 0:1)$root @ So, by inverting the hypothesis test, we have obtained the 95\% confidence interval $[0.635, 0.907]$ for $\theta$. Of course, this is a pretty common task in statistics, so it shouldn't be surprising that \R has a function to calculate the confidence interval directly: <>= binom.test(31,39) @ Note that we get the same interval from {\tt binom.test} that we got ``manually'' -- in practice, of course, you'd use {\tt binom.test}, but it's important to see where that interval comes from. \end{document}