- What is predictive modeling?
- Training and validation matrices
- Confusion matrices
- ROC curves
- Classification and Regression Trees (CART)
- Ensemble methods
- Issues and limitations of predictive modeling
2019-10-10
“Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive models are often used to detect crimes and identify suspects, after the crime has taken place.” (Wikipedia)
Data is typically split into two data sets:
Although there are not established guidelines as to the size of these data sets, randomly selected between 20% and 25% of the data set to “set aside” for validation is typical. The more data we spend on training, the better the model estimates will (often) be.
Stratified Sampling
It is generally advisable to stratify the selection of the training and validation data sets to ensure they are representative of the total sample.
set.seed(2112) titanic <- read.csv('data/titanic3.csv', stringsAsFactors = FALSE)
titanic.split <- rsample::initial_split(titanic, strata = 'survived') titanic.train <- rsample::training(titanic.split) titanic.valid <- rsample::testing(titanic.split) calif <- read.table('data/cadata.dat', header=TRUE) price.quintiles <- quantile(calif$MedianHouseValue, props = seq(0, 1, 0.2)) calif$cut.prices <- cut(calif$MedianHouseValue, price.quintiles, include.lowest=TRUE) calif.split <- rsample::initial_split(calif, strata = 'cut.prices') calif.train <- rsample::training(calif.split) calif.valid <- rsample::testing(calif.split)
p1 <- ggplot(titanic.train, aes(x = survived, y = (..count..)/sum(..count..))) + geom_bar(fill = 'red') + ylab('Percent Training') p2 <- ggplot(titanic.valid, aes(x = survived, y = (..count..)/sum(..count..))) + geom_bar(fill = 'blue') + ylab('Percent Testing') p1 + p2 + plot_layout(ncol = 2)
ggplot(calif.train, aes(x = MedianHouseValue)) + geom_density(color = 'red') + geom_density(data = calif.valid, color = 'blue')
We split the data into V distinct blocks then:
We use the average across all models as the final performance for this model.
V is typically 5 or 10 depending on the data size.s
titanic.lr <- glm(survived ~ pclass + sex + age + sibsp, data=titanic.train, family=binomial(logit))
summary(titanic.lr)
## ## Call: ## glm(formula = survived ~ pclass + sex + age + sibsp, family = binomial(logit), ## data = titanic.train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2247 -0.6838 -0.4459 0.6614 2.4390 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 4.552446 0.433349 10.505 < 2e-16 *** ## pclass -1.082424 0.113106 -9.570 < 2e-16 *** ## sexmale -2.461698 0.174020 -14.146 < 2e-16 *** ## age -0.032267 0.006705 -4.813 1.49e-06 *** ## sibsp -0.276552 0.091306 -3.029 0.00245 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1306.01 on 981 degrees of freedom ## Residual deviance: 921.22 on 977 degrees of freedom ## AIC: 931.22 ## ## Number of Fisher Scoring iterations: 4
lr.valid <- predict(titanic.lr, newdata = titanic.valid, type = 'response') tab <- table(lr.valid > 0.5, titanic.valid$survived) %>% prop.table * 100 tab
## ## 0 1 ## FALSE 50.76453 10.39755 ## TRUE 11.00917 27.82875
Our total prediction accuracy is 78.6% (reminder that 38.2% of passengers survived).
The goal of CART methods is to find best predictor in X of some outcome, y. CART methods do this recursively using the following procedures:
There are a number of possible stopping criteria including: Only one data point remains.
Consider the scatter plot to the right with the following characteristics:
The sum of squared errors for a tree T is:
\[S=\sum _{ c\in leaves(T) }^{ }{ \sum _{ i\in c }^{ }{ { (y-{ m }_{ c }) }^{ 2 } } }\]
Where, \({ m }_{ c }=\frac { 1 }{ n } \sum _{ i\in c }^{ }{ { y }_{ i } }\), the prediction for leaf .
Or, alternatively written as:
\[S=\sum _{ c\in leaves(T) }^{ }{ { n }_{ c }{ V }_{ c } }\]
Where \(V_{c}\) is the within-leave variance of leaf .
Our goal then is to find splits that minimize S.
In this example we will predict the median California house price from the house’s longitude and latitude.
str(calif)
## 'data.frame': 20640 obs. of 10 variables: ## $ MedianHouseValue: num 452600 358500 352100 341300 342200 ... ## $ MedianIncome : num 8.33 8.3 7.26 5.64 3.85 ... ## $ MedianHouseAge : num 41 21 52 52 52 52 52 52 42 52 ... ## $ TotalRooms : num 880 7099 1467 1274 1627 ... ## $ TotalBedrooms : num 129 1106 190 235 280 ... ## $ Population : num 322 2401 496 558 565 ... ## $ Households : num 126 1138 177 219 259 ... ## $ Latitude : num 37.9 37.9 37.9 37.9 37.9 ... ## $ Longitude : num -122 -122 -122 -122 -122 ... ## $ cut.prices : Factor w/ 4 levels "[1.5e+04,1.2e+05]",..: 4 4 4 4 4 4 4 3 3 3 ...
treefit <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data=calif) plot(treefit); text(treefit, cex=0.75)
summary(treefit)
## ## Regression tree: ## tree(formula = log(MedianHouseValue) ~ Longitude + Latitude, ## data = calif) ## Number of terminal nodes: 12 ## Residual mean deviance: 0.1662 = 3429 / 20630 ## Distribution of residuals: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.75900 -0.26080 -0.01359 0.00000 0.26310 1.84100
Here “deviance” is the mean squared error, or root-mean-square error of \(\sqrt{.166} = 0.41\).
We can increase the fit but changing the stopping criteria with the mindev parameter.
treefit2 <- tree(log(MedianHouseValue) ~ Longitude + Latitude, data=calif, mindev=.001) summary(treefit2)
## ## Regression tree: ## tree(formula = log(MedianHouseValue) ~ Longitude + Latitude, ## data = calif, mindev = 0.001) ## Number of terminal nodes: 68 ## Residual mean deviance: 0.1052 = 2164 / 20570 ## Distribution of residuals: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.94700 -0.19790 -0.01872 0.00000 0.19970 1.60600
With the larger tree we now have a root-mean-square error of 0.32.
However, we can get a better fitting model by including the other variables.
treefit3 <- tree(log(MedianHouseValue) ~ ., data=calif) summary(treefit3)
## ## Regression tree: ## tree(formula = log(MedianHouseValue) ~ ., data = calif) ## Variables actually used in tree construction: ## [1] "cut.prices" ## Number of terminal nodes: 4 ## Residual mean deviance: 0.03608 = 744.5 / 20640 ## Distribution of residuals: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -1.718000 -0.127300 0.009245 0.000000 0.130000 0.358600
With all the available variables, the root-mean-square error is 0.11.
pclass
: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)survival
: A Boolean indicating whether the passenger survived or not (0 = No; 1 = Yes); this is our targetname
: A field rich in information as it contains title and family namessex
: male/femaleage
: Age, a significant portion of values are missingsibsp
: Number of siblings/spouses aboardparch
: Number of parents/children aboardticket
: Ticket number.fare
: Passenger fare (British Pound).cabin
: Does the location of the cabin influence chances of survival?embarked
: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)boat
: Lifeboat, many missing valuesbody
: Body Identification Numberhome.dest
: Home/destinationrpart
(titanic.rpart <- rpart(survived ~ pclass + sex + age + sibsp, data=titanic.train))
## n= 982 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 982 231.7974000 0.38187370 ## 2) sex=male 641 101.2324000 0.19656790 ## 4) age>=9.5 601 86.6555700 0.17470880 ## 8) pclass>=1.5 477 53.9413000 0.12997900 * ## 9) pclass< 1.5 124 28.0887100 0.34677420 * ## 5) age< 9.5 40 9.9750000 0.52500000 ## 10) sibsp>=2.5 13 0.9230769 0.07692308 * ## 11) sibsp< 2.5 27 5.1851850 0.74074070 * ## 3) sex=female 341 67.1788900 0.73020530 ## 6) pclass>=2.5 152 37.9407900 0.48026320 * ## 7) pclass< 2.5 189 12.1058200 0.93121690 *
rpart
plot(titanic.rpart); text(titanic.rpart, use.n=TRUE, cex=1)
ctree
(titanic.ctree <- ctree(survived ~ pclass + sex + age + sibsp, data=titanic.train))
## ## Conditional inference tree with 7 terminal nodes ## ## Response: survived ## Inputs: pclass, sex, age, sibsp ## Number of observations: 982 ## ## 1) sex == {female}; criterion = 1, statistic = 268.259 ## 2) pclass <= 2; criterion = 1, statistic = 80.466 ## 3)* weights = 189 ## 2) pclass > 2 ## 4) sibsp <= 2; criterion = 0.976, statistic = 7.522 ## 5)* weights = 138 ## 4) sibsp > 2 ## 6)* weights = 14 ## 1) sex == {male} ## 7) pclass <= 1; criterion = 1, statistic = 17.804 ## 8)* weights = 127 ## 7) pclass > 1 ## 9) age <= 9; criterion = 0.999, statistic = 13.497 ## 10) sibsp <= 2; criterion = 0.994, statistic = 10.174 ## 11)* weights = 24 ## 10) sibsp > 2 ## 12)* weights = 13 ## 9) age > 9 ## 13)* weights = 477
ctree
plot(titanic.ctree)
In a classification model, outcomes are either as positive (p) or negative (n). There are then four possible outcomes:
ROCR
Packagetitanic.pred <- predict(titanic.ctree) pred <- prediction(titanic.pred, as.integer(titanic.train$survived)) perf <- performance(pred, measure="tpr", x.measure="fpr") plot(perf, colorize=TRUE, yaxis.at=c(0,0.5,0.8,0.9,1), yaxis.las=1) lines(c(0,1), c(0,1), col="grey")
Ensemble methods use multiple models that are combined by weighting, or averaging, each individual model to provide an overall estimate. Each model is a random sample of the sample. Common ensemble methods include:
The random forest algorithm works as follows:
1 Draw \(n_{tree}\) bootstrap samples from the original data.
2 For each bootstrap sample, grow an unpruned tree. At each node, randomly sample \(m_{try}\) predictors and choose the best split among those predictors selected
3 Predict new data by aggregating the predictions of the ntree trees (majority votes for classification, average for regression).
Error rates are obtained as follows:
1 At each bootstrap iteration predict data not in the bootstrap sample (what Breiman calls “out-of-bag”, or OOB, data) using the tree grown with the bootstrap sample.
2 Aggregate the OOB predictions. On average, each data point would be out-of-bag 36% of the times, so aggregate these predictions. The calculated error rate is called the OOB estimate of the error rate.
titanic.rf <- randomForest(factor(survived) ~ pclass + sex + age + sibsp, data = titanic.train, ntree = 5000, importance = TRUE)
print(titanic.rf)
## ## Call: ## randomForest(formula = factor(survived) ~ pclass + sex + age + sibsp, data = titanic.train, ntree = 5000, importance = TRUE) ## Type of random forest: classification ## Number of trees: 5000 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 21.38% ## Confusion matrix: ## 0 1 class.error ## 0 539 68 0.1120264 ## 1 142 233 0.3786667
importance(titanic.rf)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini ## pclass 87.38497 130.789396 139.42443 49.51720 ## sex 225.28947 277.908481 291.07598 119.18735 ## age 76.02532 57.306800 101.84695 58.05450 ## sibsp 80.42652 -4.659423 65.09917 17.84692
min_depth_frame <- min_depth_distribution(titanic.rf)
## Warning: Factor `split var` contains implicit NA, consider using ## `forcats::fct_explicit_na`
plot_min_depth_distribution(min_depth_frame)
caret
PackageThe caret package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models.
The caret
package creates a unified interface for using many modeling packages.
Function | Package | Code |
---|---|---|
lda | MASS | predict(obj) |
glm | stats | predict(obj, type = “response”) |
gbm | gbm | predict(obj, type = “response”, n.trees) |
mda | mda | predict(obj, type = “posterior”) |
rpart | rpart | predict(obj, type = “prob”) |
Weka | RWeka | predict(obj, type = “probability”) |
logitboost | LogitBoost | predict(obj, type = “raw”, nIter) |
data("tutoring") tutoring$treat2 <- as.factor(tutoring$treat != 'Control') inTrain <- createDataPartition( y = tutoring$treat2, ## the outcome data are needed p = .75, ## The percentage of data in the ## training set list = FALSE ) tutoring.train <- tutoring[inTrain,] tutoring.valid <- tutoring[-inTrain,]
rfFit <- train( treat2 ~ Gender + Ethnicity + Military + ESL + EdMother + EdFather + Age + Employment + Income + Transfer + GPA, data = tutoring.train, method = "parRF" ) rfFit
## Parallel Random Forest ## ## 857 samples ## 11 predictor ## 2 classes: 'FALSE', 'TRUE' ## ## No pre-processing ## Resampling: Bootstrapped (25 reps) ## Summary of sample sizes: 857, 857, 857, 857, 857, 857, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 2 0.8082169 0.07968598 ## 7 0.7995694 0.16117684 ## 12 0.7918287 0.16048895 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2.
rf.valid.pred <- predict(rfFit, newdata = tutoring.valid) confusionMatrix(rf.valid.pred, tutoring.valid$treat2)
## Confusion Matrix and Statistics ## ## Reference ## Prediction FALSE TRUE ## FALSE 226 49 ## TRUE 3 7 ## ## Accuracy : 0.8175 ## 95% CI : (0.7677, 0.8606) ## No Information Rate : 0.8035 ## P-Value [Acc > NIR] : 0.3048 ## ## Kappa : 0.1622 ## ## Mcnemar's Test P-Value : 4.365e-10 ## ## Sensitivity : 0.9869 ## Specificity : 0.1250 ## Pos Pred Value : 0.8218 ## Neg Pred Value : 0.7000 ## Prevalence : 0.8035 ## Detection Rate : 0.7930 ## Detection Prevalence : 0.9649 ## Balanced Accuracy : 0.5559 ## ## 'Positive' Class : FALSE ##
bayesFit <- train( treat2 ~ Gender + Ethnicity + Military + ESL + EdMother + EdFather + Age + Employment + Income + Transfer + GPA, data = tutoring.train, method = "naive_bayes" )
bayes.valid.pred <- predict(bayesFit, newdata = tutoring.valid) confusionMatrix(bayes.valid.pred, tutoring.valid$treat2)
## Confusion Matrix and Statistics ## ## Reference ## Prediction FALSE TRUE ## FALSE 229 56 ## TRUE 0 0 ## ## Accuracy : 0.8035 ## 95% CI : (0.7526, 0.848) ## No Information Rate : 0.8035 ## P-Value [Acc > NIR] : 0.5357 ## ## Kappa : 0 ## ## Mcnemar's Test P-Value : 1.987e-13 ## ## Sensitivity : 1.0000 ## Specificity : 0.0000 ## Pos Pred Value : 0.8035 ## Neg Pred Value : NaN ## Prevalence : 0.8035 ## Detection Rate : 0.8035 ## Detection Prevalence : 1.0000 ## Balanced Accuracy : 0.5000 ## ## 'Positive' Class : FALSE ##
Feature Engineering
Are your predictions actually any good?
Loh, Soo, and Xing (2016) were able to predict one’s sexual orientation with approximately 90% accuracy. However, according to Gallop, only 4.1% of American identify as LGTBQ. If I were to guess every American was heterosexual I would be correct 95.9% of the time, doing better than the Loh, Soo, and Xing’s predictive models.
Are your predictions reinforcing social biases?
Kuhn and Johnson’s Applied Predictive Modeling
caret
package: http://topepo.github.io/caretrandomForestExplainer
: https://cran.rstudio.com/web/packages/randomForestExplainer/vignettes/randomForestExplainer.html
SuperLearner
package: https://github.com/ecpolley/SuperLearner
Ferndández-Delgado, Cernadas, Barro, & Amorin (2014). Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? Journal of Machine Learning Research, 15, 3133-3181
Email: jason@bryer.org
Twitter: [@jbryer](https://twitter.com/@jbryer)
Website: https://www.bryer.org
Github: https://github.com/jbryer