\documentclass{article}

\usepackage{amsmath}
\usepackage{fullpage}
\usepackage{url}
\usepackage{xspace}
\newcommand{\R}{\texttt{R}\xspace}

\title{Summary statistics and graphics: \\ Categorical Data}
\author{Patrick Breheny}
\date{September 1, 2016}

<<include=FALSE, purl=FALSE>>=
require(knitr)
opts_chunk$set(prompt = TRUE)
opts_chunk$set(comment = NA)
opts_template$set(fig = list(fig.height=5, fig.width=5, out.width='.5\\linewidth', fig.align='center'))
@

\begin{document}
\maketitle

The next two labs will be summary statistics and statistical graphics.  Today's lab will focus on categorical data; next week's will address continuous data.  Our data set for today will be the {\tt titanic} data set, an interesting data set with 4 variables, all of which are categorical:

<<Read_in_data>>=
titanic <- read.delim("http://myweb.uiowa.edu/pbreheny/data/titanic.txt")
head(titanic)
nrow(titanic)
ncol(titanic)
dim(titanic)
@

\section{Tables}

\subsection{One-way tables}

As we discussed in lecture, the basic summary statistic for categorical data is the {\em count}.  For example, we might want to know the number of 1st/2nd/3rd class passengers and crew aboard the ship.  This is most easily accomplished using the {\tt table} function:

<<table>>=
table(titanic$Class)
@

An alternative way of expressing this information is the fraction of total passengers who fell into each of these categories.  This can either be done directly by dividing by the number of passengers, or automated using the {\tt prop.table} function:

<<Proportions>>=
tab <- table(titanic$Class)
tab/nrow(titanic)
prop.table(tab)
@

Yet another way is as a percentage, or rate per 100 passengers:

<<Percentages>>=
round(100*prop.table(tab), 1)
@

\subsection{Two- and three-way tables}

The above approaches allow us to look at one variable at a time.  If we want to look at multiple variables at the same time, we need to construct multi-way tables.  This is easily accomplished by adding more variables to the {\tt table} function:

<<Two_way_table_of_class_and_survival>>=
with(titanic, table(Class, Survived))
@

Fractions are still interesting -- perhaps even more so -- in a multi-way table, but we have several ways of going about calculating them: (a) a fraction out of all passengers, (b) a fraction of the passengers in that row (Class, in the above example), and (c) a fraction of the passengers in that column (Survival, in the above example).  All of these can be calculated with {\tt prop.table}:

<<Two_way_proportions>>=
tab <- with(titanic, table(Class, Survived))
prop.table(tab)    ## Overall proportion
prop.table(tab, 1) ## Row-wise proportion
prop.table(tab, 2) ## Column-wise proportion
@

In other words, 24\% of people on board were 3rd class passengers who died, 75\% of the 3rd class passengers died, and 45\% of the people who died were members of the crew.

The same logic can be extended to higher-way tables as well:

<<Three_way_table_of_class,_survival,_and_sex>>=
tab <- with(titanic, table(Class, Survived, Sex))
tab
prop.table(tab, c(1,3))
@

The \verb|c(1,3)| syntax tells \R that we want to calculate proportions over all levels of {\tt Class} and {\tt Sex}.  The output tells us, for instance, that 97\% of female 1st class passengers survived (141/145).

\subsection{Questions}

\begin{itemize}
\item How many 2nd class male passengers died?
\item How many children were in second class?
\item What fraction of 3rd class children survived?
\item What fraction of adults who survived were crew members?
\end{itemize}

\section{Graphs}

As the above example(s) probably make clear, multi-way tables are informative but quickly get very cumbersome.  Graphs are often superior to tables at quickly conveying information.

\subsection{Basic bar plots}

The basic plot for categorical data is the bar plot, which is pretty self-explanatory:

<<barplot, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
tab <- table(titanic$Class)
barplot(tab)
@

An extension of the bar chart that allows us to plot two variables at once is the {\em stacked bar plot}:

<<stacked_barplot, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
tab <- with(titanic, table(Survived, Class))
barplot(tab, legend=TRUE, args.legend=list(x="topleft"))
@

With this plot, we can see that crew and 3rd class passengers were much more plentiful on the ship than 1st and 2nd class passengers, and also that a higher percent of 1st class passengers survived than the others.

\subsection{\R Packages}

It is essential to know the basic \R plotting functions; however, in many situations it is difficult and/or tedious to make more complicated plots using standard \R graphics.  To facilitate the construction of these plots, several individuals have developed {\em packages} to assist in the making of these plots.  One of the most common and widely used is the {\tt lattice} package.  This package is actually installed by default when you install \R, but still needs to be loaded with:

<<Loading_the_lattice_package>>=
require(lattice) ## or library(lattice)
@

The lattice equivalent to {\tt barplot} is {\tt barchart}; simple plots are very similar for both functions:

<<lattice_barchart, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
tab <- table(titanic$Class)
barchart(tab)
@

Other than differences in the defaults (which we can change with {\tt horizontal=FALSE, col="gray"}), this is the same plot that we got from {\tt barplot}.

\subsection{Grouping and conditioning}

So why bother with {\tt lattice}?  The big advantage of lattice is that it allows you to easily create plots that take advantage of grouping and conditioning.  {\em Grouping} is simply the use of an aesthetic property such as color or shape to represent a variable.  We have already seen an example of this with the stacked barplot, but this is a little nicer in {\tt lattice} since the legend doesn't get in the way of the plot:

<<lattice_stacked_barplot, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
tab <- with(titanic, table(Class, Survived))
barchart(tab, auto.key=TRUE)
@

A more significant advantage is {\em conditioning}, which creates multiple small plots (panels) of different subsets of the data, with the subsets determined by the conditioning variables.  For example, the following creates separate panels for males and females:

<<Conditioning, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
tab <- with(titanic, table(Class, Sex, Survived))
barchart(tab, auto.key=TRUE)
@

It's kind of hard to see what's going on in the female plot above because, by default, the scales in each panel are constrained to be equal, and there were a lot more men on board than women.  We can allow the scales to differ in each panel with {\tt scales="free"}:

<<Unequal_scales, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
barchart(tab, auto.key=TRUE, scales="free")
@

\subsection{Questions}

Add {\tt Age} to the above plot and then consider the following questions:

\begin{itemize}
\item The general policy in evacuation was ``women and children first''.  How well did this policy hold up across the various classes?
\item Overall, there was a striking class bias in terms of survival (62\% of 1st class passengers survived compared with only 25\% of 3rd class passengers).  Does this trend hold up once you start making comparisons in smaller groups?  If not, what explains the discrepancy?
\item Overall, a (slightly) higher percentage of crew died than 3rd class passengers.  Does this trend hold up once you start making comparisons in smaller groups?  If not, what explains the discrepancy?
\end{itemize}

\subsection{Installing new packages}

Another popular graphics package is {\tt ggplot2}.  Unlike {\tt lattice}, {\tt ggplot2} must be installed:

<<Install_ggplot2, eval=FALSE>>=
install.packages("ggplot2")
@

Once installed, it can be loaded using {\tt require} or {\tt library}.  Note that you need to load packages like {\tt lattice} and {\tt ggplot2} every time you open R, but you only need to install a package once.  Basic plots in {\tt ggplot2} can be constructed using the {\tt qplot} (for 'quick plot') function.  To illustrate using the {\tt titanic} data,

<<Plotting_with_ggplot2, message=FALSE, fig.height=4, fig.width=6, out.width='.5\\linewidth', fig.align='center'>>=
require(ggplot2)
qplot(Class, data=titanic, fill=Survived) + facet_grid(Age~Sex, scales="free")
@

Here, {\tt fill=Survived} specifies that the color used to fill in the bars should depend on survival status.  This sets up the basic plot, while \verb|facet_grid| controls the conditioning.  {\tt Age~Sex} lists the conditioning variables as well as whether they should be oriented in the vertical or horizontal direction.  {\tt scales="free"} means the same thing in {\tt ggplot2} as it did in {\tt lattice}, although note that the scales are not completely free, in that all panels in a row must share the same vertical scale.  We can obtain another interesting plot by adding {\tt position="fill"} to {\tt qplot()}.

Whether you use {\tt lattice}, {\tt ggplot2}, or basic \R graphics is up to you (personally, I use all three, depending on the task at hand).  What is important is to appreciate how information-rich a plot like the above is and how much information it communicates.  As the saying goes, the above picture is worth a thousand words in terms of all it communicates about the relationships between the four variables in this data set.

\end{document}
