\documentclass{article}

\usepackage{amsmath}
\usepackage{fullpage}
\usepackage{url}
\usepackage{xspace}
\newcommand{\R}{\texttt{R}\xspace}

\title{Summary statistics and graphics: \\ Continuous Data}
\author{Patrick Breheny}
\date{September 8, 2016}

<<include=FALSE, purl=FALSE>>=
require(knitr)
opts_chunk$set(prompt = TRUE)
opts_chunk$set(comment = NA)
@

\begin{document}
\maketitle

In last week's lab, we explored the Titanic data set and how to describe, graph, and explore categorical data.  In this lab, our data set will have both continuous and categorical variables, and we'll learn the tools in \R for describing, graphing, and exploring the distribution of continuous variables as well as relationships between two continuous variables, and between continuous and categorical variables.

Our data set that we will use comes from the efforts of a waiter who recorded information about 244 tips he received over a period of a few months working in a restaurant (\texttt{tips.txt}).

<<Read_in_data>>=
tips <- read.delim("http://myweb.uiowa.edu/pbreheny/data/tips.txt")
@

\section{Summary statistics}

Total bill ({\tt TotBill}) is an important continuous variable in our data set.  As we discussed in class, a common approach to summary statistics for continuous variables is the two-number summary mean $\pm$ SD:

<<Mean_and_SD>>=
mean(tips$TotBill)
sd(tips$TotBill)
@

We can get percentile-based summaries as well:

<<Percentiles>>=
median(tips$TotBill)
IQR(tips$TotBill)
fivenum(tips$TotBill)
quantile(tips$TotBill) ## Same thing
quantile(tips$TotBill, seq(0,1,.1)) ## By tenths
@

Now, these summary statistics are for the entire data set.  We might be interested in summaries for various subsets instead.  This can accomplished either directly with brackets or by using the {\tt by} function:

<<Subsets>>=
with(tips, mean(TotBill[Time=="Night"]))
with(tips, by(TotBill, Time, mean))
with(tips, by(TotBill, Time, sd))
@

Note the double equal sign (\verb|Time=="Night"|); this tests whether {\tt Time} is equal to the string \verb|"Night"|, as opposed to the single equal sign (\verb|Time="Night"|), which assigns the value \verb|"Night"| to \verb|Time|, which is not what we want to do.  Note that nighttime meals have a higher average bill, which makes sense -- dinner is usually more expensive than lunch at American restaurants.

\section{Histograms}

What does this look like when we plot it?  Let's make histograms first.  As we've already seen, histograms are created with {\tt hist}.  They can also be constructed in {\tt lattice} using {\tt histogram}:

<<Histograms, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
hist(tips$TotBill, col="gray", border="white", main="", xlab="Total Bill")
require(lattice)
histogram(~TotBill, data=tips, col="gray", border="white", xlab="Total Bill")
@

We can see that most bills were around \$15, but that some were as high as \$50.  As we did last week, we can break this plot down by conditioning:

<<Histograms_with_conditioning, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
histogram(~TotBill|Time, data=tips, col="gray", border="white", xlab="Total Bill")
histogram(~TotBill|Time, data=tips, col="gray", border="white", xlab="Total Bill", layout=c(1,2))
@

Note that dinners tend to be more expensive and more highly variable; this agrees with our numerical summaries from earlier.

\section{Box plots}

Box plots are pretty straightforward (again, here are base graphics and lattice versions):

<<Box_plots, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
boxplot(TotBill~Time, data=tips, ylab="Total bill")
bwplot(TotBill~Time, data=tips, ylab="Total Bill")
@

Once again, dinner bills are a little higher and more spread out than lunch bills.  Note that {\tt lattice} draws a dot for the median instead of a line.

\section{Scatter plots}

Scatter plots (which we'll discuss more in class when we get to regression and correlation) are made using, simply {\tt plot} (base graphics) or {\tt xyplot} ({\tt lattice}):

<<Scatter_plots, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
with(tips, plot(TotBill, Tip, xlab="Total bill"))
xyplot(Tip~TotBill, data=tips, xlab="Total Bill")
@

The plot illustrates several trends:
\begin{itemize}
\item As we would expect, there is a positive association between bill and tip
\item There is plenty of variation, however (big tips on small bills, small tips on big bills)
\item There are more points in the lower right of the plot than the upper left -- cheap tippers are more common than generous tippers?
\item There seem to be some horizontal ``stripes'' in the plot -- why?
\end{itemize}

Note that none of the earlier summaries showed these stripes -- all summaries risk concealing features of the distribution, and each plot illustrates something new about the data.

Finally, recall that conditioning helps us see how the relationship between bill and tip differs for different subcategories of dining parties.  For example, let's compare smokers and nonsmokers:

<<Scatter_plots_with_conditioning, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
xyplot(Tip~TotBill|Smoker, data=tips, xlab="Total Bill", pch=19)
@

The relationship between tip and bill seems to be much stronger in the nonsmoking section than in the smoking section.

\section{Tip rate}

The most interesting "outcome" in the {\tt tips} data set is probably the tipping rate.  In the United States, tip rates usually vary between 10\% and 20\%, depending on factors such as the quality of the service and the generosity of the customer.  Here, since all tips involve the same waiter and restaurant, it would be reasonable to assume that variations in tipping behavior primarily reflect differences in the customers' attitudes.

To analyze tip rate, however, we first need to calculate it.  You could either create a new variable outside of the {\tt tips} object in \R, or a new column in the {\tt tips} data frame.  I'll create  -- either way is fine.

 converted it to a percent when I multiplied by 100, but you can call it whatever you like and leave it as a fraction if you wish. 

<<Tip_rate>>=
TipRate <- with(tips, 100*Tip/TotBill)
@

Let's check out what our new variable looks like:

<<Histogram_of_tip_rate, fig.height=4, fig.width=5, out.width='.5\\linewidth', fig.align='center'>>=
hist(TipRate, col="gray", border="white", main="")
hist(TipRate, col="gray", border="white", main="", breaks=seq(0,75,5))
@

As we would expect, most tips are between 10 and 20 percent, although there are certainly exceptions.

\section{Questions:}

Simple questions:

\begin{itemize}
\item What percent of tips are above 20\%?
\item What is the average tip rate that this waiter received?
\end{itemize}

A more interesting question is how tip rate varies depending on various other factors.  In addition, there are many interesting questions one can ask about how the other variables relate to each other.  The following list is by no means exhaustive, but contains some questions that I found interesting and looked at.  For each question, think about how you would answer the question graphically as well as what numbers you could report that would summarize the trend.

\begin{itemize}
\item How does tip rate change with total bill?  Do small bills have more variation in tip rate than large bills?  Are people proportionally more generous with smaller bills?
\item Do smokers tip differently than nonsmokers?
\item Suppose that an equal number of men and women dine at the restaurant.  Are men more likely to pick up the check than women?  Does this depend on whether the meal is lunch or dinner?
\item Does tipping behavior change at lunch versus dinner?
\item Does tipping behavior differ by days of the week?
\end{itemize}

There is nothing special about this list; if additional or different questions interest you, feel free to explore them as well.

This is a relatively simple data set, yet provides a wealth of information about a lot of complicated relationships.  Just think about how much information is contained in more complicated biomedical studies.  Exploring your data to gain an understanding of these relationships is very important!

\end{document}
