homework 3: logistic regression
- homework is due in Dropbox on Avenue on Weds 26 September .
- your homework should be saved as R code with comments ( .R ), R markdown ( .Rmd ), or Sweave ( .Rnw )
- rm(list=ls())
- the TA or I should be able run your code from scratch without any problems.
logistic regression on beetles
- create a plot displaying the data; use stat_sum (with ggplot ) or plotrix::sizeplot() so that the graph shows the number of data values at each point. It’s up to you whether to distinguish between series="I" and series="II" in the data.
- use aggregate (base R) or group_by + summarise ( dplyr ) to compute the proportion killed for each unique dosage value/series combination. Optionally, add another column with the total number of individuals for each dosage value/series combination.
- Create a plot showing these aggregated values; add a smooth line showing the general trend. If you’re feeling ambitious, make the size of the points proportional to the total number of individuals.
- Fit a logistic model including the interaction of the predictors series and log10(dose) to the original (disaggregated) data.
- Explain the meaning of the four parameters in words, as they relate to the expected survival, the effects of dose on survival, and the differences in these quantities between series.
Answers will depend on whether you used treatment or sum-to-zero contrasts
- Test the null hypothesis that the two series have identical dose-response curves. Explain whether you are using a Wald test or a likelihood ratio test, and what that means. Is there evidence that the intercepts differ, the slopes, or neither?
Likelihood ratio test of combination of intercepts and slopes:
(i.e. no evidence that either slopes or intercepts differ)
Wald tests of differences in intercept alone ( seriesII ), differences in slope alone ( seriesII:log10(dosage) )
- Fit a model that uses only log10(dose) , ignoring series .
We already did this, more or less:
- Compute and compare the Wald, likelihood profile, and bootstrap confidence intervals for the dose effect.
Also very similar (bootstrap CIs have an extra stochastic component, as well; we might get slightly different answers if we ran it all with a different random-number seed)
- Compute and display quantile residual-based diagnostics: what do you conclude?
Plot residuals look more or less perfect (which they kind of have to be, since the Bernoulli conditional distribution is necessarily true); quantile values (red vs dotted black lines) look good, suggesting little bias. (I didn’t use pch="." here, not really useful unless we have a very large data set).
- Compute predicted survival probabilities and confidence intervals for the minimum, mean, and maximum log10(dose)
- The LD50 (dose that is expected to kill 50% of individuals) is defined as the point where the log-odds of survival are equal to zero, i.e. x0.5=−β0/β1 . Compute the LD50 based on your fit.
- Compute confidence intervals for the LD50 using (1) the delta method and (2) bootstrapping.
Delta method:
Lazy method:
Delta-method and bootstrap results are very similar.
Homework 05: Logistic regression
Due: wednesday, april 21 11:59pm et.
Demonstrate a thorough understanding of logistic regression
Practice using logistic regression models to make predictions
General guidelines
For this assignment you must have at least three meaningful commits and all of your code chunks must have informative names.
All code should follow the tidyverse style guidelines, including not exceeding the 80 character limit.
Getting started
- Accept and create your private repository of the assignment at https://classroom.github.com/a/37meAqER
In this assignment you will be working with a dataset containing information on individuals from the Donner party. The Donner party was a group of pioneers traveling to California from Missouri on the Oregon trail by wagon train. They were trapped in the Sierra Nevada mountains by extremely heavy snowfall during the winter of 1846-1847 and eventually ran out of food supplies. Of the 90 members of the party, only 48 survived. We will use logistic regression to model the probability of survival based on age and sex. Relevant data is contained in donner.csv .
What is the relationship between sex and survival? Effectively visualize the relationship and summarize what you observe in a brief sentence.
What is the relationship between age and survival? Effectively visualize the relationship and summarize what you observe in a brief sentence.
Fit a logistic regression model to predict survival based on sex and age. You do not need to include an interaction. Report the model output in tidy format.
Write out the logistic regression model.
Provide an interpretation of \(e^{\hat{\beta}_0}\) in the context of the problem.
Provide an interpretation of \(e^{\hat{\beta}_\text{age}}\) in the context of the problem.
Provide an interpretation of \(e^{\hat{\beta}_\text{sex}}\) in the context of the problem.
What is the predicted probability of survival for a 60 year old man? For a 20 year old man? For a female newborn?
Create a predicted probability plot showing the effect of age and sex on survival. Comment on what you observe.
How young or old must a female member of the Donner party be in order to have a predicted probability of survival greater than 0.75 based on your logistic regression model? Use algebra (not code) to answer.
What are some limitations of your model given the data? Answer in a brief paragraph.
Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.
Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Associate the “Overall” section with the first page.
IMAGES
VIDEO