\documentclass{article} \usepackage{amsmath} \usepackage{fullpage} \usepackage{url} \usepackage{xspace} \newcommand{\R}{\texttt{R}\xspace} \input{commands} \title{Introduction to \R} \author{Patrick Breheny} \date{August 25, 2016} <>= require(knitr) opts_chunk$set(prompt = TRUE) opts_chunk$set(comment = NA) opts_template$set(fig = list(fig.height=5, fig.width=5, out.width='.5\\linewidth', fig.align='center')) @ \begin{document} \maketitle Our goal for today is to introduce \R, an open-source computing language designed to allow for fluid, interactive data manipulation, analysis, and visualization. \R is installed on all computers throughout the College of Public Health Building. If you are interested in installing it at home, go to \url{www.r-project.org}. You can run \R directly, but it is often more convenient to run \R through an integrated development environment; by far the most well-developed of these is RStudio \url{www.rstudio.com}, which is also (a) installed throughout the building and (b) open-source and easy to install. Much of the material here is adapted from {\em S Programming}, by Venables and Ripley (2000), an excellent book on the details of the \texttt{S} and \texttt{R} languages that goes into far more detail than I do here. \section{\R objects} Commands in \R are either {\em expressions}, which are evaluated and printed, or {\em assignments}, which store the result of an evaluation as an object. Arithmetic operations for the most part work very similar to any calculator (note that \verb|#| marks the rest of the line as a comment): <>= (5^2)*(10-8)/3 + 1 ## An expression x <- (5^2)*(10-8)/3 + 1 ## An assignment x @ Note that the value of the expression is now stored in an {\em object} called \texttt{x}. This allows us to use it again in further calculations: <>= x+1 n <- 50 x/n @ Objects can be named using any combination of upper- and lower-case letters, digits 0-9 (provided they are not in the initial position), and the period and underscore. Note that \R is case sensitive (\texttt{x} and \texttt{X} refer to two different objects). All objects in \R have a {\em class}, which describes the kind of thing that is stored in the object. For instance, <>= class(x) @ tells us that \texttt{x} is storing a numeric object at the moment. \subsection{Functions} \R is said to be a functional language, meaning that it is built around calling functions to accomplish tasks: <>= x <- 1:9 ## Creates a vector of numbers 1, 2, ..., 9 mean(x) median(x) sd(x) min(x) sum(x) x^2 sum(x^2) @ To get more information about any function in \R, just type \texttt{help('sd')} or, more compactly, \texttt{?sd}. To search the help files for pages mentioning, say, regression, type \texttt{help.search("regression")} or \texttt{??regression}. Over time in this course, we'll see a number of functions in \R and how they are used. Functions typically have a number of options which may either be specified or left to their default values: <>= x <- runif(1000) y <- runif(1000, min=10, max=20) hist(x) hist(y, col="gray", border="white", breaks=40) @ \subsection{Vectors and lists} We have already seen several {\em vectors} in \R; a vector is set of elements that all share the same type. Vectors are typically either numeric, character, or logical: <>= x <- runif(10) ## A numeric vector y <- letters[1:10] ## A character vector z <- x > 0.5 ## A logical vector x y z @ As mentioned earlier, a vector cannot contain elements of different types. We can combine vectors with {\tt c} (for concatenate), but if we try to combine, say, character and numeric vectors, problems may arise: <>= a <- c(x, y) ## Probably not what you want a[1] + a[2] @ So is there a way to combine elements of different types into a single object? Yes, this is what a {\em list} is for. Elements of a list may be accessed by number or by name, as in the following example: <>= a <- list(x=x, y=y, z=z) a$x a$x[1] + a$x[2] toupper(a$y[3]) a[[3]] @ Note that for lists, you have to put double brackets around the index, as in {\tt myList[[1]]}. Also, note the use of {\tt \$} as a separator between the name of the list and the name of the element. \subsection{Data frames} A {\em data frame} is a special kind of list in \R, in which each element has the same length and thus the list can be structured as a systematic grid of rows and columns. This is the typical structure of a data set: each row represents a separate observation on a collection of variables (which make up the columns). Note that a list really is necessary here, as a data set could easily contain a mix of different types of variables (continuous, categorical, etc.). In a data frame, each column has its own type (i.e., class). Data can of course come in a wide variety of formats, but in this class, all data sets will be provided as tab-delimited text files. Let's see an example: <>= tips <- read.delim("http://myweb.uiowa.edu/pbreheny/data/tips.txt") head(tips) class(tips$TotBill) class(tips$Sex) @ \noindent (What's a "factor", you ask? We'll cover them in the next section.) Here, we're reading a data set directly from a web address (obviously, you need an internet connection for this to work). Local addresses can be used as well, either relative to the current directory ({\tt getwd}) or as an absolute path. It is often cumbersome to type \verb|tips$| repeatedly to access the elements of the data frame. There are two ways around this: \texttt{attach} and \texttt{with}. The former is permanent (although it can be undone with \texttt{detach}) and thus can sometimes lead to unintended side effects, while the latter only acts temporarily: <>= with(tips, mean(TotBill)) mean(TotBill) attach(tips) mean(TotBill) detach(tips) @ Note that we can add columns to the data frame after it has been created: <>= tips$Rate <- with(tips, Tip/TotBill) @ \subsection{Factors} A factor is a special type of vector used to encode levels of a categorical variable (such as \verb|tips$Sex| above). <>= table(tips$Sex) levels(tips$Sex) barplot(table(tips$Day)) @ It may seem pointless to have \verb|factor| objects as a separate class from \verb|character| objects, but as you will see, it is often very useful. For example, with \verb|factor| objects, you can specify an ordering to the levels: <>= tips$Day <- factor(tips$Day, levels=c("Thu", "Fri", "Sat", "Sun")) barplot(table(tips$Day)) @ \section{Indexing} One of the most common tasks in data analysis is subsetting data -- picking out certain rows or columns to look at more closely. Because this is a common task and often easier much easier to use one way of subsetting in circumstance and a different method in a different context, \R provides five different ways to access elements of a vector, which is extremely convenient: \begin{itemize} \item A logical vector: Specifies, for each element, whether or not to include it \item A vector of positive integers: Lists the elements to include \item A vector of negative integers: Lists the elements to exclude \item A vector of names: Lists the elements to include by name \item Empty: Select all components \end{itemize} <>= Students <- c("Monica", "Michael", "Ming", "Mayra", "Sean", "Jeanette", "Lydia", "Emily", "Devon", "Evan", "Jinli", "Caitlin") x <- runif(length(Students), 50, 100) names(x) <- Students x[x > 75] ## Logical vector x[1:3] ## Positive integers x[-(5:10)] ## Negative integers x[c("Lydia", "Evan")] ## Names x[] @ The last option may seem pointless, but it is necessary (among other places) when accessing portions of a matrix or data frame (note that we are specifying a subset of rows, but all the columns): <>= tips[tips$Tip >= 7, ] @ \section{And much more} There is much more to \R than this, of course: a wide variety of character and mathematical operations, probability distributions, using add-on packages, how to write your own functions, control structures (i.e., \verb|if| statements and \verb|for| loops), how to conduct simulations, and so on. We'll cover these in future labs. As with any programming language, the only way to learn \R is to use \R, and certainly, you'll grow more familiar with \R and learn many more functions as the semester progresses. Hopefully, this document serves as a useful introduction and reference. \end{document}