```{r, include = FALSE} library(magrittr) library(kableExtra) ``` # Objectives * Quiz 2 review * Binomial distribution and its functions in R # Quiz Review ## Problem 1 - Summary Statistics Below is a sample of 15 patients' systolic blood pressures prior to having surgery and 15 patients' systolic blood pressures after having surgery: ```{r, echo = FALSE} # boxplot bp <- data.frame(blpr = c(118, 144, 134, 110, 119, 128, 132, 136, 125, 160, 190, 140, 150, 160, 220,100, 70, 90, 75, 110, 115, 108, 50, 111, 110, 60, 80, 70, 40, 60), time = rep(c("before", "after"), each = 15)) boxplot(blpr ~ time, data = bp, horizontal = T, main = "Systolic Blood Pressures", col = "orchid") ``` a) Estimate the median of blood pressure both before and after surgery.
Answer

The median blood pressure before surgery is about 130 and the median blood pressure after surgery is approximately 80.

b) Are there any outliers? If so, identify them and state how they impact the mean and the median?
Answer

The "before surgery" blood pressures has 1 outlier of about 220. This does not affect the median, but it has a large effect on the mean.

c) What percentage of individuals had a systolic blood pressure lower than 80 post surgery?
Answer

About 50% (since the median is at about 80)

d) 75 percent of individuals had a pre-surgery systolic blood pressure lower than what?
Answer

About 160 (the 3rd quartile is at about 160)


## Problem 2 - Histograms ```{r, echo=FALSE} set.seed(3516) dat <- rgamma(300, 8, 1) hist(rgamma(300, 8, 1), main = "Sleeping Habits of College Students", xlab = "Average Time Slept (Hours)", col = "skyblue1", breaks = seq(0, 20, by = 2), xaxt = "n", ylim = c(0, 100)) axis(side = 1, at = seq(0, 20, by = 4)) axis(side = 2, at = seq(0, 100, by = 10)) ``` a) Are the data normally distributed, right-skewed, or left-skewed?
Answer

The data are skewed right (the tail trails off to the right).

b) How does the mean compare to the median in this data?
Answer

The mean will be higher than the median (since the data are right-skewed).

c) Estimate the number of college students that got between 10 and 12 hours of sleep, on average.
Answer

Approximately 40 college students slept between 10 and 12 hours, on average, in this data set. The y-axis denotes the frequency (number of college students) and we can see that the bars of the histogram each contain 2 hours of sleep. This means we can find the number of college students that got between 10 and 12 hours of sleep by looking at the height (frequency) of the 5th bar from the left.


## Problem 3 - Probability Suppose the probability that a potato is a Yukon Gold is 1/3. Suppose the probability that a potato is mashed, given that it was Yukon Gold, is 3/4. Suppose the probability that a potato is mashed, given that it was NOT Yukon Gold, is 1/2. a) What is the probability that a potato is both Yukon Gold AND mashed?
Answer

Multiplication Rule:

$P(Mashed \cap Yukon) = P(Mashed | Yukon)*P(Yukon)$

= (3/4)(1/3) = 1/4

b) What is the probability that a potato is mashed?
Answer

Law of Total Probability & Multiplication Rule:

$P(Mashed) = P(Mashed | Yukon)*P(Yukon) + P(Mashed | Yukon^C)*P(Yukon^C)$

= (3/4)(1/3) + (1/2)(1-1/3) = 7/12

c) What is the probability that a potato is Yukon Gold, given that it is mashed?
Answer

$P(Yukon | Mashed) = \frac{P(Yukon \cap Mashed)}{P(Mashed)}$

= (1/4) / (7/12) = 3/7

d) What is the probability that a potato is Yukon Gold OR mashed?
Answer

Addition Rule:

$P(Yukon \cup Mashed) = P(Yukon) + P(Mashed) - P(Yukon \cap Mashed)$

= (1/3) + (7/12) - (1/4) = 2/3

e) Assuming picking potatoes involves independent events, what is the probability that I pick two Yukon Golds in a row (with replacement)?
Answer

$P(Yukon)*P(Yukon) = (1/3)^2 = 1/9$


## Problem 4 - Correlation and Regression Below is information about a study concerned with the effect that listening to classical music has on students' test scores. Time spent listening to classical music is given in minutes per week and test scores are the percent that students scored on their biostatistics exams. ```{r,echo=FALSE,results='hide'} set.seed(17) classicaltime <- rnorm(20,50,10) testscores <- classicaltime + rnorm(20, 30, 9) mod <- lm(testscores~classicaltime) ``` ```{r,echo=FALSE,results='hold'} cat("Mean Time Listening to Classical Music:", signif(mean(classicaltime),4)) cat("\nMean Test Scores:", signif(mean(testscores),4)) cat("\nStd Dev of Time Listening to Classical Music:", signif(sd(classicaltime), 3)) cat("\nStd Dev of Test Scores:", signif(sd(testscores),3)) cat("\nCorrelation:", signif(cor(classicaltime,testscores),4)) cat("\nRegression Line Slope:", signif(mod$coefficients[[2]])) ``` a) If we have a student who listens to classical music 1 standard deviation less than the average student, how many standard deviations below average would we predict their exam score to be?
Answer

$Z_y = r*Z_x$

$Z_y = 0.3919*-1 = -0.3919$ standard deviations above average (ie 0.3919 below average).

b) If we have a student who scores 3 standard deviations below average on their exam, how much time do we predict that they spend listening to classical music per week?
Answer

$Z_y = r*Z_x$

$Z_y = 0.3919*-3 = -1.1739$ standard deviations above average for music

$y = \overline{y} + SD_Y * Z_Y$

$y = 54.44 + 7.3 * -1.1739 = 45.87$ mins

**Note that the slope does not work here because we are now using exam scores to predict music

c) If a student listens to 5 minutes more classical music per week than average, what is their predicted test score?
Answer **Method 1:**

$Z_x = \frac{x - \overline{x}}{SD_x}$

$Z_x = \frac{5}{7.3} = 0.685$

$Z_y = r*Z_x$

$Z_y = 0.3919*0.685 = 0.268$

$y = \overline{y} + SD_Y * Z_Y$

$y = 80.36 + 9.9*0.268 = 83.017$

**Method 2:**

$y = \overline{y} + \hat{\beta} *(x-\overline{x})$

$y = 80.36 + 0.531768*5 = 83.019$



## Problem 5 - Contingency Table Below is data about a fictitious HIV rapid-diagnostic test. Please fill in the rest of the table and answer the following questions. Assume the prevalence in this population is 0.1%. ```{r, echo = FALSE, comment = "", results = "asis"} tab4 <- data.frame(Disease = c("Present", "Absent", "Total"), Positive = c(2970, 11000, 13970), Negative = c(30,539000, 539030), Total = c(3000, 550000, 553000)) kable(tab4) %>% kable_styling(bootstrap_options = c("striped", "hover")) %>% add_header_above(c(" " = 1, "Test Result" = 2, " " = 1)) ``` a) Find the sensitivity of the test.
Answer Sensitivity = $P(Test+ | D+) = \frac{2970}{3000} = 0.99$
b) Find the specificity of the test.
Answer Specificity = $P(Test- | D-) = \frac{539000}{550000} = 0.98$
c) Find the false positive rate.
Answer $P(Test+ | D-) = 1 - P(Test- | D-) = 1 - 0.98 = 0.02$
d) Find the positive predictive value. $P(D^+|T^+)$
Answer Using Bayes' rule, we find the positive predictive value: $P(D+|Test+) = \frac{P(Test+ | D+)P(D+)}{P(Test+ | D+)P(D+) + P(Test+ | D-)P(D-)} = \frac{(0.99)(0.001)}{(0.99)(0.001) + (0.02)(0.999)} = 0.04721$
e) Find the negative predictive value. $P(D^-|T^-)$
Answer Neg Predictive Value = $P(D-|T-) = \frac{P(Test-| D-)P(D-)}{P(Test- | D-)P(D-) + P(Test- | D+)P(D+)} = \frac{(0.98)(1-0.001)}{(0.98)(1-0.001) + (1-0.99)(0.001)} = 0.999$


# Binomial Distribution (Not on Quiz) ### Binomial Distribution formula From lecture, we know that when there are two possible outcomes (success/failure) in n trials, the number of ways of one event occurring x times is $\frac{n!}{x!(n-x)!}$, which can also be denoted as ${n \choose x}$. We also know that, given independence, the probability of an intersection of events is $p^x(1-p)^{1-x}$. Combining these, we get the formula for the binomial distribution: * $\frac{n!}{x!(n-x)!}p^x(1-p)^{n-x}$ ### Calculating probability using formula Using the information about Yukon Gold potatoes from the Probability Review section, let's find the probability that if 3 potatoes are picked, 2 are Yukon Gold. We can calculate this probability using the formula in R: ```{r} # Setting our values n <- 3 x <- 2 p <- 1/3 # Manually using the binomial formula: factorial(n)/(factorial(x)*factorial(n-x)) * p^x * (1-p)^(n-x) # You can also do this using the 'choose' function: choose(n, x)* p^x * (1-p)^(n-x) ``` We can also use R's built-in 'dbinom' function to answer this question. dbinom takes the following arguments: * x - the number of successes * size - the total number of trials * prob - the probability of success ```{r} dbinom(x = 2, size = 3, prob = 1/3) ``` R's built-in functions can also help us answer other questions of interest. For example, consider our quiz review problem 3 involving Yukon Gold Potatoes. Suppose we pick 10 potatoes and get 5 Yukon Golds. We can find the probability of this event just like we did earlier with the dbinom function: ```{r} # Manually using the binomial formula: factorial(10)/(factorial(5)*factorial(10-5)) * (1/3)^5 * (1-(1/3))^(10-5) # Using the dbinom function: dbinom(x = 5, size = 10, prob = 1/3) ``` ### Calculating p-value manually We may also be interested in finding the probability of seeing an event *as extreme or more extreme* than the one we observed (this corresponds to the p-value). Since the probability of picking a Yukon Gold is 1/3 and we picked a total of 10 potatoes, we would expect to see about (10)(1/3) = 3.33 Yukon Gold potatoes. What we observed (5) is 5 - 3.33 = 1.67 potatoes greater than what we'd expect. So in order to be *as extreme or more extreme*, we are interested in anything greater than or equal to 5, OR anything less than or equal to 1.67 (3.33 - 1.67). Since the data is discrete, it cannot take on decimal values (only whole potatoes), so this is the same thing as $P(x\leq1 \cup x\geq5)$. This is what we called a two-sided hypothesis test ($H_0: p = p_0$ ;$H_1: p \neq p_0$) because we are finding the probability of observing values "as or more extreme" in both directions (greater than or equal to 5 or less than or equal to 1) to compute the p-value. (Note that we round down from 1.67 to 1 instead of rounding to 2. This is because 2 is not more extreme than 1.67, but 1 is, so we use 1.) We can calculate this using pbinom(). The 'pbinom' function finds the probability of being LESS THAN or equal to a value. If we want to find the probability of being GREATER or equal to a number, we tell R to calculate 1-pbinom() of one LESS than what we're interested in. Alternatively, you can using the 'lower.tail' argument to manually tell R that you want the 'greater than' probability. ```{r} # The probability of picking 1 or less potatoes (1 potato or 0 potatoes) pbinom(1, size = 10, prob = 1/3) # The probability of picking 5 or more potatoes (5, 6, 7, 8, 9, or 10 potatoes) # Note that instead of x = 5, we use x = 4 1 - pbinom(4, size = 10, prob = 1/3) # Alternatively, using the lower.tail argument: pbinom(4, size = 10, prob = 1/3, lower.tail = F) # Total of the Extremes (p-value): pbinom(1, size = 10, prob = 1/3) + (1 - pbinom(4, size = 10, prob = 1/3)) ``` Sometimes the pbinom function can be hard to keep track of, especially remembering whether it includes the number you input or not. To avoid this, you can simply add up your desired values using the 'dbinom' function, though it will require a few extra lines of code. ```{r} # The probability of picking 5 or more potatoes (5, 6, 7, 8, 9, or 10 potatoes) (five_or_more <- dbinom(5, size = 10, prob = 1/3) + dbinom(6, size = 10, prob = 1/3) + dbinom(7, size = 10, prob = 1/3) + dbinom(8, size = 10, prob = 1/3) + dbinom(9, size = 10, prob = 1/3) + dbinom(10, size = 10, prob = 1/3)) # The probability of picking 0 or 1 potatoes (zero_or_one <- dbinom(0, size = 10, prob = 1/3) + dbinom(1, size = 10, prob = 1/3)) # P-value is the sum of picking one and picking 5 or more: (p_val <- zero_or_one + five_or_more) # this matches the value from above ``` ### Calculating p-value using 'binom.test' The binom.test function gives us a variety of information. It's core purpose is to test whether the true probability of success is equal to some value, but you can also use it as an easy way to get our p-values from earlier. For example, if we were looking for the p-value of getting a result as extreme or more extreme than our 5 Yukon Gold potatoes out of 10, we can use the following syntax. Note that this function not only gives us a p-value, but also a confidence interval, hypothesis, and sample estimate. ```{r} binom.test(x=5, n=10, p=1/3) ```
### Binomial Practice Problems #### Problem 1 For which of the following scenarios could we apply a binomial distribution: Recall: The binomial distribution has the following characteristics: * There are a specific number of trials (n), each with a binary outcome * The n trials are independent * The probability of success (p) is constant with each trial a) The number of jackpots in 1,000 pulls of a slot machine
Answer Yes
b) The number of people who get sick in a 5-person household
Answer No, since the probability of one person getting sick will likely affect the probability that others will get sick, this fails the 3rd condition.
c) The number of free throws Caitlin Clark makes in 10 attempts
Answer Yes. We assume Caitlin's probability of a successful throw won't change for each throw.
d) The number of questions a student answers correctly on a multiple choice test (they are not randomly guessing).
Answer No, if a student is not randomly guessing, the probability of success for each question changes depending on their level of knowledge for that question.

#### Problem 2 Suppose we have a standard 6 sided die, and we roll the die 10 times. Consider the following questions a) What's the probability that we get at least two rolls of 4?
Answer We can find this probability by taking 1 - the probability that we get strictly less than 2 rolls of 4. ```{r} # Recall that the pbinom function calculates the probability of getting less than or equal to x. Hence, for our first argument we put 1 instead of 2. 1 - pbinom(1, size = 10, prob = 1/6) ```
b) Suppose that we roll five 4's. Using the above techniques, find the p-value. Do you believe that the die is biased?
Answer We can simply doing this using the binom.test function. Note that we get a p-value of 0.015, which means there is a very low chance that we would roll five 4's out of 10 trials. Hence, the die may be biased. ```{r} binom.test(x=5,n=10,p=1/6)$p.value ```