\documentclass{article}

\usepackage{amsmath}
\usepackage{fullpage}
\usepackage{url}
\usepackage{xspace}
\newcommand{\R}{\texttt{R}\xspace}


\title{Simulations, the central limit theorem, and robustness}
\author{Patrick Breheny}
\date{October 13, 2016}

<<include=FALSE, purl=FALSE>>=
require(knitr)
opts_chunk$set(prompt = TRUE)
opts_chunk$set(comment = NA)
opts_template$set(fig = list(fig.height=5, fig.width=5, out.width='.5\\linewidth', fig.align='center'))
set.seed(1)
@

\begin{document}

\maketitle

Last week we carried out our first simulations, investigating the actual coverage of various methods for constructing intervals for binomial proportions.  We continue with simulations today, using them to investigate the central limit theorem as well as how the methods we derived in class (based on the $t$-distribution for the mean and the $\chi^2$ distribution for the variance) perform when applied to non-normal data.

\section{Central limit theorem}

As part of the NHANES study, the triglyceride levels of 3,026 adult women were measured.  Triglycerides, the main constituent of both vegetable oil and animal fat, have been linked to atherosclerosis, heart disease, and stroke.  Let's consider this whole group of 3,026 women the ``population'' for the purposes of our simulation, and that we are going to conduct a study of this population by taking a small sample of, say, 25 women from it.  So let's do this and take a look at the distribution of triglycerides in our population and in the sample:

<<NHANES_lipid_data, opts.label="fig">>=
lipids <- read.delim("http://myweb.uiowa.edu/pbreheny/data/lipids.txt")
pop <- lipids$TRG
sam <- sample(pop, 25)
hist(pop, col="gray", border="white", breaks=seq(0, 400, 5), las=1)
hist(sam, col="gray", border="white", breaks=seq(0, 400, 50), las=1)
mean(pop)
mean(sam)
@

It's worth noting that (a) the distribution of triglycerides in the population is clearly right-skewed, (b) the sample looks reasonably representative of the population (as it should, since it's a random sample), and (c) the sample mean and population mean are reasonably close, but the sample mean is clearly off by a bit in terms of estimating the population mean.  Of course, this is just one sample; the means of other random samples might be much further away from 116.9, or much closer.

What the central limit theorem deals with is the distribution of the sample mean.  To see that distribution, we'll have to repeat the above sampling process many times and obtain many sample means.  This can be done in \R using a {\tt for} loop:

<<Simulation, opts.label="fig">>=
N <- 10000          ## Number of simulations to run
n <- 25             ## Sample size
means <- numeric(N) ## Setting up an empty vector
for (i in 1:N) {
  sam <- sample(pop, n)
  means[i] <- mean(sam)
}
hist(means, col="gray", border="white", breaks=seq(0, 400, 5), las=1)
@

Note that, at least qualitatively, the central limit theorem seems to be holding up:

\begin{itemize}
  \item The distribution of the means seems centered around the population mean of 117
  \item The spread of the distribution of the means is clearly much smaller than the spread in the original distribution of TRG values
  \item The shapes of the distributions are not the same; in particular, the distribution of means looks much less skewed and more normal-like
\end{itemize}

Let's put it through a more quantitative check, though, to see how exactly the CLT is working out:

<<CLT_accuracy, opts.label="fig">>=
mean(means)

SD <- sd(pop)
SE <- SD/sqrt(n)
sd(means)
SE

z <- sqrt(n) * (means-mean(pop)) / sd(pop)
hist(z, col="gray", border="white", freq=FALSE, breaks=99, las=1)
zz <- seq(-4, 4, length=101)
lines(zz, dnorm(zz))
@

So the approximation seems pretty good -- the distribution isn't {\em exactly} normal, but it's pretty close, and the expectation and SD calculations match up with our in-class derivations.  Let's look at some specific distributional predictions with respect to probability:

<<CLT_accuracy_part_2>>=
## Probability that a sample mean is less than 100 mg/dL
mean(means <= 100)
pnorm(100, mean(pop), SE)  ## Using the location-scale normal
pnorm((100-mean(pop))/SE) ## Using the standard normal

## 90th percentile
quantile(means, .9)
qnorm(.9, mean(pop), SE) ## Using the location-scale normal
qnorm(.9)*SE + mean(pop) ## Using the standard normal
@

Questions for discussion:

\begin{itemize}
  \item Try checking the accuracy with respect to: ``What's the probability that the sample mean will be between 100 and 150 mg/dL?''
  \item Each of the above answers is an approximation.  Why?  Which should you trust?  What does the accuracy of each approximation depend on?
\end{itemize}

Additional exercises:

\begin{itemize}
  \item Try re-running the above experiment(s) with different values for {\tt N}, such as 100 and 1,000,000.  What changes?
  \item Try re-running the above experiment(s) with different values for {\tt n}, such as 5 and 1,000.  What changes?
\end{itemize}

\section{$t$-tests and confidence intervals}

Carrying out one-sample $t$-tests and obtaining the corresponding confidence intervals is fairly straightforward in \R using the function {\tt t.test}.  To illustrate with the cystic fibrosis study we discussed in class:

<<Cystic_fibrosis_study, opts.label="fig">>=
cf <- read.delim("http://myweb.uiowa.edu/pbreheny/data/cystic-fibrosis.txt")
Diff <- cf$Placebo - cf$Drug
hist(Diff, col="gray", border="white", las=1)
t.test(Diff)
@

As we saw in class, the $p$-value for testing the null hypothesis that amiloride has no effect on the average reduction in FVC is 0.04, and the confidence interval for the average difference is $[8, 265]$.  As we see from the histogram, the distribution of differences doesn't look exactly normal, but with only 14 observations, it's difficult to tell.

\section{Robustness}

As we said in class, the derivation of the $t$ distribution (and, therefore, its resulting confidence intervals) is based on an assumption of normality.  How well do these intervals work when the data is not normally distributed?  Or to put it a different way, how robust are these results to departures from normality?  As we did earlier, let's investigate how well this works by resampling the NHANES triglyceride data.  Recall that the underlying distribution here is somewhat skewed to the right.  Does this cause problems for the $t$ intervals?

<<NHANES_sim_mean, opts.label="fig">>=
N <- 10000          ## Number of simulations to run
n <- 25             ## Sample size
covered <- numeric(N)
for (i in 1:N) {
  sam <- sample(pop, n)
  covered[i] <- t.test(sam, mu=mean(pop))$p.value > 0.05
}
mean(covered)
@

The coverage is slightly under 95\%, but the $t$ intervals aren't doing too bad here.  What if we lower the sample size to 15?  To 5?

Now let's look at the variance.  As we discussed in class, there is a nice pivotal quantity for the normal distribution that allows us to obtain confidence intervals for the variance.  How about its coverage in this situation?  Unfortunately there is no standard \R function for constructing this interval, but it's pretty easy to code:

<<NHANES_sim_variance, opts.label="fig">>=
N <- 10000          ## Number of simulations to run
n <- 25             ## Sample size
covered <- numeric(N)
for (i in 1:N) {
  sam <- sample(pop, n)
  ci <- var(sam)*(n-1)/qchisq(c(0.975,0.025), n-1)
  covered[i] <- (ci[1] < var(pop)) & (var(pop) < ci[2])
}
mean(covered)
@

The coverage here is far worse than we had for the mean.  As we said in class, the central limit theorem works for {\em any} distribution.  For that reason, the $t$ test is often a reasonable approximation even if the data departs somewhat from the normal distribution.  There is no corresponding theorem for the variance: the pivotal CI works for the normal distribution and is not guaranteed to be accurate -- even approximately -- for other distributions.

This is a specific instance of an important general point in statistics: even though both of the above confidence interval methods are derived based on the assumption of normality, the two methods are not equally sensitive to that assumption.  The $t$ test is quite robust to departures from normality, while the variance interval is very sensitive to such departures.  This is one reason why simulations are so important and useful in statistics: to examine the performance of various methods when assumptions are not met (this is typically difficult to address analytically).

\end{document}