\documentclass{article}

\usepackage{amsmath}
\usepackage{fullpage}
\usepackage{url}
\usepackage{xspace}
\newcommand{\R}{\texttt{R}\xspace}


\title{Contingency table and count data}
\author{Patrick Breheny}
\date{November 3, 2016}

<<include=FALSE, purl=FALSE>>=
require(knitr)
opts_chunk$set(prompt = TRUE)
opts_chunk$set(comment = NA)
opts_template$set(fig = list(fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'))
@

\begin{document}
\maketitle

Today's lab will introduce the \R functions for analyzing contingency tables as well as analyzing count data using the Poisson distribution.

\section{Contingency tables}

\subsection{Fisher's exact test}

The \R function for carrying out Fisher's exact test is called {\tt fisher.test}.  To illustrate how it works, let's use one of the examples we saw in class when discussing cross-sectional studies.  Below, we create a matrix {\tt X} containing the data (here, counts of hospitalized patients according to whether they have a respiratory disease, a circulatory disease, both, or neither) in a $2 \times 2$ table, then analyze it using Fisher's exact test.  First, I use the built-in \R function, then I demonstrate where this $p$-value is coming from using the equations we derived in class.

<<fisher_test, opts.label='fig'>>=
X <- rbind(c( 7,  29),
           c(13, 208))
fisher.test(X)

## By hand
M <- sum(X[,1])
N <- sum(X)
n <- sum(X[1,])
x <- 0:20
Prob <- choose(M,x)*choose(N-M,n-x)/choose(N,n)
names(Prob) <- x
barplot(Prob, border="white", las=1)
sum(Prob[Prob <= Prob["7"]])
@

Notice that {\tt fisher.test} provides an estimate of the odds ratio along with a confidence interval.  We didn't discuss this in class, but it is also possible to construct a confidence interval for the odds ratio by inverting Fisher's exact test.  However, being able to test hypotheses other than indepdendence involves analysis of noncentral hypergeometric distributions, which are a bit beyond the scope of our course.

Still, it is worthwhile knowing that we can obtain these exact confidence intervals from {\tt fisher.test}.  Let's see how they compare with the approximate confidence interval that we derived in class.

<<Approximate_interval>>=
SE <- sqrt(sum(1/X))
OR <- X[1,1]*X[2,2]/(X[1,2]*X[2,1])
log(OR) + qnorm(c(0.025, 0.975))*SE      ## CI for log(OR)
exp(log(OR) + qnorm(c(0.025, 0.975))*SE) ## CI for OR
@

So, reasonably close, although clearly there are some differences.  Differences between exact and approximate methods are to be expected here, since low counts are present in several cells.  We see much better agreement, for example, in the CDC breast cancer data:

<<Confidence_interval_agreement>>=
Y <- rbind(c(4475, 65),
           c(1597, 31))
fisher.test(Y)$conf.int
SE <- sqrt(sum(1/Y))
OR <- Y[1,1]*Y[2,2]/(Y[1,2]*Y[2,1])
exp(log(OR) + qnorm(c(0.025, 0.975))*SE)
@

\subsection{The $\chi^2$ test}

\R also offers a function for carrying out the $\chi^2$ test, called {\tt chisq.test}.  As before, I'll use the built-in \R function, then demonstrate where the $p$-value is coming from using the equations we derived in class.

<<chisq_test, opts.label='fig'>>=
chisq.test(X)
chisq.test(X, correct=FALSE)

E <- outer(apply(X, 1, sum), apply(X, 2, sum))/sum(X) ## Expected cell counts
sum((X-E)^2/E)
1-pchisq(sum((X-E)^2/E), 1)

x <- seq(0, 10, 0.01)
plot(x, dchisq(x, 1), type="l")
abline(v=sum((X-E)^2/E), col="red")
@

Two remarks here:
\begin{itemize}
\item First, note that we get a warning here for the $\chi^2$ test.  As you can verify by looking at {\tt E}, one of the expected cell counts is only 2.8; {\tt chisq.test} will warn you if any of the $E_i$ values is below 5.
\item Second, we didn't discuss the idea of a continuity correction in class.  The most important thing to know in terms of practical usage is that a continuity ``correction'' doesn't necessarily make the test more accurate, but it always makes the test more conservative (increases the $p$-value).  For example, although for the hospital data the continuity-corrected test was in closer agreement with Fisher's exact test than the regular $\chi^2$ test, this is not the case for the CDC breast cancer data:

<<Test_agreement, opts.label='fig'>>=
fisher.test(Y)$p.value
chisq.test(Y)
chisq.test(Y, correct=FALSE)
@
\end{itemize}

Note that the {\tt chisq.test} function does not provide an estimate or confidence interval for the odds ratio (and there is no option to request one).

\subsection{Larger tables}

Before moving on to Poisson/count data, let's take a quick look at larger tables (i.e., not $2 \times 2$ tables).  For example, here are the results from the 2000 General Social Survey in terms of gender and party identification:

<<Party_identification_data>>=
Z <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(Z) <- list(gender = c("Female","Male"),
                    party = c("Democrat","Independent", "Republican"))
Z
@

The $\chi^2$ test proceeds in exactly the same way for larger tables, with the exception of a change in the degrees of freedom:

<<chisq.test___Two-by-three_table>>=
chisq.test(Z, correct=FALSE)
E <- outer(apply(Z, 1, sum), apply(Z, 2, sum))/sum(Z)
1-pchisq(sum((Z-E)^2/E), 2)
@

The syntax of {\tt fisher.test} is the same for larger tables as it is for smaller ones.  The internal calculations, however, are quite different as the distribution of counts no longer follows a simple hypergeometric distribution.

<<fisher.test___Two-by-three_table>>=
fisher.test(Z)
@

Note that (a) the $p$-value is nearly identical to that of the $\chi^2$ test, and (b) we no longer get an odds ratio.  There is an obvious reason for this: we cannot take a ratio when we have more than two categories we are comparing.  We can only discuss the odds ratios for smaller comparisons within the larger table:

<<Odds_ratios___Two-by-three_table>>=
fisher.test(Z[,-2])
fisher.test(Z[,-3])
@

Thus, women are 60\% more likely to identify as Democrats (vs. Republicans) compared to men, but there is no clear evidence from this survey that women are any more likely to identify as Independents (vs. Republicans) than men are.

\section{Count data}

The function in \R for analyzing one- and two-sample count data using the Poisson distribution is called {\tt poisson.test}.  To get some added variety, let's consider a different sort of experimental setup than what we considered in class.  The Poisson distribution is widely used to model plate counts in laboratory settings.  One common example is the Ames test for mutagenicity:

\begin{center}
  \includegraphics[width=0.6\textwidth]{ames.png}
\end{center}

Let's analyze data from an Ames test concerning the mutagenicity of a substance called harmine:

<<Harmine data>>=
x <- c(16, 21, 22)     ## Control
y <- c(22, 23, 33, 37) ## 100 micrograms of harmine
X <- sum(x)
Y <- sum(y)
@

Here, due to the additivity of the Poisson distribution, we have $X \sim \textrm{Pois}(3\lambda)$ and $Y \sim \textrm{Pois}(4\mu)$.  To analyze using {\tt poisson.test}:

<<poisson.test___Two_sample>>=
poisson.test(c(Y, X), c(4, 3))
@

Thus, the data are incompatible with the hypothesis that harmine does not affect mutation rate; this study indicates that harmine increases the natural mutation rate by 46\%, with a 95\% confidence interval for the rate ratio of $[1.06, 2.04]$.  Note that a $t$-test of the same data is considerably less powerful:

<<Comparison___t-test>>=
t.test(x, y)
@

Finally, if we wanted to obtain separate intervals for each rate, we can use {\tt poisson.test} for that also:

<<poisson.test___One_sample>>=
poisson.test(Y, 4)
poisson.test(X, 3)

## Equivalent to
f.u <- function(lam) ppois(X, 3*lam) - 0.025
uniroot(f.u, c(20,30))$root
f.l <- function(lam) 1-ppois(X-1, 3*lam) - 0.025
uniroot(f.l, c(1, 20))$root
@

\end{document}
