STAT 420 R分析机器学习

STAT 420: Homework 12
Spring 2019, R. Zhu
Due: Tuesday, April 23 by 11:30 PM CT
Contents
Directions 1
Assignment 2
Exercise 1 (Simulating Wald and Likelihood Ratio Tests) . . . . . . . . . . . . . . . . . . . . . . . 2
Exercise 2 (Surviving the Titanic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Exercise 3 (Breast Cancer Detection) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Directions
Students are encouraged to work together on homework. However, sharing, copying or providing any part of
a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation
will be punished as severely as possible.
Final submissions must be uploaded to our Compass 2g site on the Homework page. No email, hardcopy, or
late submissions will be accepted.
Your assignment must be submitted through the submission link on Compass 2g. You have two
options to submit your file:
– [PDF format] Attach two files:
Your RMarkdown file, saved as hw11_yourNetID.Rmd. For example hw11_rqzhu.Rmd.
The result of knitting your RMarkdown file as hw11_yourNetID.pdf. For example
hw10_rqzhu.pdf.
– [HTML format] Attach one .zip file, named hw11_yourNetID.zip, which contains:
Your RMarkdown file which should be saved as hw11_yourNetID.Rmd.
The result of knitting your RMarkdown file as hw11_yourNetID.html.
Your resulting .html/.pdf file will be considered a “report” which is the material that will determine
the majority of your grade. Be sure to visibly include all R code and output that is relevant to answering
the exercises. (You do not need to include irrelevant code you tried that resulted in error or did not
answer the question correctly.)
You are granted an unlimited number of submissions, but only the last submission before the deadline
will be viewed and graded.
If you use this .Rmd file as a template, be sure to remove the directions section. Consider removing
eval = FALSE from any code chunks provided in the template, if you would like to run that code as
part of your assignment.
Your .Rmd file should be written such that, if it is placed in a folder with any data your are asked to
import, it will knit properly without modification.
Unless otherwise stated, you may use R for each of the exercises.
Be sure to read each exercise carefully!
Include your Name and NetID in the final document, not only in your filenames.
1
Assignment
Exercise 1 (Simulating Wald and Likelihood Ratio Tests)
In this exercise we will investigate the distributions of hypothesis tests for logistic regression. For this exercise,
we will use the following predictors.
sample_size = 150
set.seed(114)
x1 = rnorm(n = sample_size)
x2 = rnorm(n = sample_size)
x3 = rnorm(n = sample_size)
Recall that
p(x) = P[Y = 1 | X = x]
Consider the true model
(a) To investigate the distributions, simulate from this model 2500 times. To do so, calculate
P[Y = 1 | X = x]
for an observation, and then make a random draw from a Bernoulli distribution with that success probability.
(Note that a Bernoulli distribution is a Binomial distribution with parameter n = 1. There is no direction
function in R for a Bernoulli distribution.)
Each time, fit the model:
Store the test statistics for two tests:
The Wald test for H0 : β2 = 0, which we say follows a standard normal distribution for “large” samples
The likelihood ratio test for H0 : β2 = β3 = 0, which we say follows a χ
2 distribution (with some degrees
of freedom) for “large” samples
(b) Plot a histogram of the empirical values for the Wald test statistic. Overlay the density of the true
distribution assuming a large sample.
(c) Use the empirical results for the Wald test statistic to estimate the probability of observing a test statistic
larger than 1. Also report this probability using the true distribution of the test statistic assuming a large
sample.
(d) Plot a histogram of the empirical values for the likelihood ratio test statistic. Overlay the density of the
true distribution assuming a large sample.
2
(e) Use the empirical results for the likelihood ratio test statistic to estimate the probability of observing
a test statistic larger than 5. Also report this probability using the true distribution of the test statistic
assuming a large sample.
(f) Repeat (a)-(e) but with simulation using a smaller sample size of 10. Based on these results, is this
sample size large enough to use the standard normal and χ
2 distributions in this situation? Explain.
sample_size = 10
set.seed(114)
x1 = rnorm(n = sample_size)
x2 = rnorm(n = sample_size)
x3 = rnorm(n = sample_size)
Exercise 2 (Surviving the Titanic)
For this exercise use the ptitanic data from the rpart.plot package. (The rpart.plot package depends on
the rpart package.) Use ?rpart.plot::ptitanic to learn about this dataset. We will use logistic regression
to help predict which passengers aboard the Titanic will survive based on various attributes.
install.packages("rpart") install.packages("rpart.plot") 【STAT 420 R分析】library(rpart)
library(rpart.plot)
data("ptitanic")
For simplicity, we will remove any observations with missing data. Additionally, we will create a test and
train dataset.
ptitanic = na.omit(ptitanic)
set.seed(114)
trn_idx = sample(nrow(ptitanic), 300)
ptitanic_trn = ptitanic[trn_idx, ]
ptitanic_tst = ptitanic[-trn_idx, ]
(a) Consider the model β0 + β1x1 + β2x2 + β3x3 + β4x4 + β5x3x4
where
p(x) = P[Y = 1 | X = x]
is the probability that a certain passenger survives given their attributes and
x1 is a dummy variable that takes the value 1 if a passenger was 2nd class.
x2 is a dummy variable that takes the value 1 if a passenger was 3rd class.
x3 is a dummy variable that takes the value 1 if a passenger was male.
x4 is the age in years of a passenger.
Fit this model to the training data and report its deviance.
(b) Use the model fit in (a) and an appropriate statistical test to determine if class played a significant role
in surviving on the Titanic. Use α = 0.01. Report:
The null hypothesis of the test
The test statistic of the test
3
The p-value of the test
A statistical decision
A practical conclusion
(c) Use the model fit in (a) and an appropriate statistical test to determine if an interaction between age
and sex played a significant role in surviving on the Titanic. Use α = 0.01. Report:
The null hypothesis of the test
The test statistic of the test
The p-value of the test
A statistical decision
A practical conclusion
(d) Use the model fit in (a) as a classifier that seeks to minimize the misclassification rate. Classify each of
the passengers in the test dataset. Report the misclassification rate, the sensitivity, and the specificity of this
classifier. (Use survived as the positive class.)
Exercise 3 (Breast Cancer Detection)
For this exercise we will use data found in wisc-train.csv and wisc-test.csv, which contain train and test
data, respectively. wisc.csv is provided but not used. This is a modification of the Breast Cancer Wisconsin
(Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have
been provided. (And these are all you should use.)
UCI Page
Data Detail
You should consider coercing the response to be a factor variable if it is not stored as one after importing the
data.
(a) The response variable class has two levels: M if a tumor is malignant, and B if a tumor is benign. Fit
three models to the training data.
An additive model that uses radius, smoothness, and texture as predictors
An additive model that uses all available predictors
A model chosen via backwards selection using AIC. Use a model that considers all available predictors
as well as their two-way interactions for the start of the search.
For each, obtain a 5-fold cross-validated misclassification rate using the model as a classifier that seeks to
minimize the misclassification rate. Based on this, which model is best? Relative to the best, are the other
two underfitting or over fitting? Report the test misclassification rate for the model you picked as the best.
(b) In this situation, simply minimizing misclassifications might be a bad goal since false positives and false
negatives carry very different consequences. Consider the M class as the “positive” label. Consider each of the
probabilities stored in cutoffs in the creation of a classifier using the additive model fit in (a).
cutoffs = seq(0.01, 0.99, by = 0.01)
That is, consider each of the values stored in cutoffs as c. Obtain the sensitivity and specificity in the test
set for each of these classifiers. Using a single graphic, plot both sensitivity and specificity as a function of
the cutoff used to create the classifier. Based on this plot, which cutoff would you use? (0 and 1 have not
been considered for coding simplicity. If you like, you can instead

STAT 420 R分析

推荐阅读

单机经典角色扮演类游戏，好玩的单机游戏角色扮演

卷心菜怎么洗农药卷心菜怎么洗

正睿，正睿nbspI243738WE内存规格怎样

苹果6投屏要怎么设置

雪铁龙天逸缺点赶紧来看看

免费文件夹加密器,文件夹加密器忘记密码怎么办

玉米虫养殖方法，玉米虫人工养殖的方法

vivox21参数,vivox21手机参数

内心的想法怎么会被别人知道如何不被别人知道自已的想法，如何不让别人知道自己的想法

微信小程序新手教程程序文件含义

冒险岛品克缤应该怎么用啊

赛车的防滚架有什么用

2012款福克斯两厢质量怎么样福特福克斯两厢怎么样

法国Cougar MS 40毫米榴弹发射器，现代化的外表包裹着简单的构造

摄影师画像摄影师画匠

荒野行动前10武器排名2020,荒野行动武器大全图解

视频需要配合什么问题，做视频需要准备什么工具

为什么不能养泰迪熊多肉

佳能相机x7 佳能x7i是750d吗

qq群拉好友不用同意直接进群怎么加入qq群