注ï¼åä½ããªãã³ã¼ããæ¸ãæããä¸è¶³ãã¦ãããã®ã«é¢ãã¦æ³¨ã¨ããå½¢ã§è£å®ãã¦ããã
é åºãã¸ã¹ãã£ãã¯å帰ã¯ã1ã¤ã¾ãã¯è¤æ°ã®äºæ¸¬å¤æ°ï¼æ°å¤ã¾ãã¯ã«ãã´ãªï¼ã¨é åºçµæã®éã®é¢ä¿ãã¢ãã«ããå帰åæã®ä¸ç¨®ã ãé åºçµæã¯ã次ã®ãããªè«ççé åºãæã¤2ã¤ä»¥ä¸ã®ã«ãã´ãªãæã¤å¤æ°ã§ããï¼
- çã®ã¹ãã¼ã¸ï¼0ã1ã2ã3ã4
- æå¾ã¬ãã«ï¼ä½ãä¸ãé«
ãã®ãã¥ã¼ããªã¢ã«ã§ã¯ãGitHubã®ãªã³ã¯ãããã¦ã³ãã¼ãã§ãã(https://github.com/datasciencedojo/datasets/blob/master/titanic.csv)ã
注ï¼ããã§ã¯titanicããã±ã¼ã¸ã®titanic_trainãã¼ã¿ã使ç¨ããã
titanicãã¼ã¿ã»ããã§é åºãã¸ã¹ãã£ãã¯å帰ã使ããäºæ¸¬å¤æ°Pclassï¼ä¹å®¢ã¯ã©ã¹ï¼ãäºæ¸¬ããï¼Pclassã¯ã3ã¤ã®å¤ï¼1: first class, 2: second class, 3: third classï¼ã®ãã¡ã®1ã¤ãåããã¨ãã§ããï¼ äºæ¸¬å¤æ°ï¼å¹´é½¢ï¼Ageï¼ãæ§å¥ï¼Sexï¼ãçåï¼Survivedï¼ã
library(titanic) titanic.raw <-titanic_train
## ãµãã»ãããä½æ titanic <- subset(titanic.raw, select = c(Pclass, Age, Sex, Survived)) ## Pclassã«ã¬ãã«ãã¤ãã titanic$Pclass <- factor(titanic$Pclass, levels = c(3, 2, 1), labels = c("1st", "2nd", "3rd"), ordered = TRUE) titanic$Sex <- factor(titanic$Sex) titanic$Survived <- factor(titanic$Survived, labels = c("Died", "Survived")) str(titanic)
'data.frame': 891 obs. of 4 variables: $ Pclass : Ord.factor w/ 3 levels "1st"<"2nd"<"3rd": 1 3 1 3 1 1 3 1 1 2 ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... $ Survived: Factor w/ 2 levels "Died","Survived": 1 2 2 2 1 1 1 1 2 2 ...
2. é åºãã¸ã¹ãã£ãã¯å帰ã¢ãã«ã®ãã£ããã£ã³ã°
æã ã¯ãé åºãã¸ã¹ãã£ãã¯å帰ã®æãä¸è¬çãªã¿ã¤ãã§ããæ¯ä¾ãªããºã»ãã¸ã¹ãã£ãã¯å帰ã使ç¨ããã
ãã®ã¢ãã«ã¯ãordinal ããã±ã¼ã¸ã®clm()é¢æ°ãç¨ãã¦ãä¸è¨ã®ããã«å®è¡ã§ããï¼
library(ordinal) model <- clm(Pclass ~ Age + Sex + Survived, data = titanic, na.action = na.omit) summary(model)
Coefficients: Estimate Std. Error z value Pr(>|z|) Age 0.063208 0.005857 10.792 <2e-16 *** Sexmale 0.315159 0.199921 1.576 0.115 SurvivedSurvived 1.969870 0.204830 9.617 <2e-16 *** --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 Threshold coefficients: Estimate Std. Error z value 1st|2nd 2.8120 0.2838 9.91 2nd|3rd 4.2024 0.3091 13.60
注
na.actionã¯ãªã¹ãã¯ã¤ãºãæå³ããna.omitãna.excludeã¯ï¼äºæ¸¬å¤ãæ¨æºèª¤å·®ãåºééçã«ããã¦ï¼æ®å·®å¤NAã¨è¡¨ç¤ºãããã
åäºæ¸¬å¤æ°ã¨çµæã®é¢ä¿ã®è§£é
äºæ¸¬å¤æ° X1 ã«é¢é£ã¥ããããé åºãã¸ã¹ãã£ãã¯å帰ä¿æ° β1 ã®ä¸è¬çãªè§£éã¯ï¼æ¬¡å¼ã§ããï¼
äºæ¸¬å¤æ°X1ã1åä½å¢å ããã¨ï¼çµæYãããé«ãã«ãã´ãªã¼ã«ãããã¨ã®å¯¾æ°ãªããºãβ1å¤åããï¼
äºæ¸¬å¤æ°ã®ä¿æ°ã¯å¯¾æ°ãªããºæ¯ãªã®ã§ããªããºæ¯ãå¾ãããã«ã¯ææ°åããªããã°ãªããªãã
Rã§ã¯ãä¿æ°ã¨på¤ãæ½åºããããã«ãã¢ãã«ã»ãã£ããã®é¢æ°tidy()ããå¼æ°exponentiate = TRUEã¨conf.int = TRUEï¼95%ä¿¡é ¼åºéã表示ããï¼ã§å¼ã³åºããã¨ãã§ãããããã¦ã表ãå°ããããããã«ãåºåããæ¨æºèª¤å·®ã¨zã¹ã³ã¢ãåãé¤ãï¼
library(tidymodels) tidy(model, exponentiate = TRUE, conf.int = TRUE) |> select(-c(std.error, statistic))
term estimate p.value conf.low conf.high coef.type <chr> <dbl> <dbl> <dbl> <dbl> <chr> 1 1st|2nd 16.6 3.77e-23 NA NA intercept 2 2nd|3rd 66.8 4.17e-42 NA NA intercept 3 Age 1.07 3.77e-27 1.05 1.08 location 4 Sexmale 1.37 1.15e- 1 0.930 2.04 location 5 SurvivedSurvived 7.17 6.77e-22 4.83 10.8 location
æã ã¯ãäºæ¸¬å¤æ°ã®ä¿æ°ï¼æå¾ã®3è¡ï¼ã®ã¿ãè¦ã¦ãåçã¯è¦ãªãã以ä¸ã¯ããã®è§£éæ¹æ³ ã§ããï¼
å¹´é½¢ã®ä¿æ°ã®è§£éï¼
å¹´é½¢ã«é¢ããææ°ä¿æ°ï¼ãªããºæ¯ï¼ã¯1.07ã§ãããå¹´é½¢ãé«ããã¨ã¯ãããé«ã客室ã§ãããã¨ã¨æ£ã®é¢ä¿ãããã解éã¯æ¬¡ã®ããã«ãªãï¼
å¹´é½¢ãé«ãä¹å®¢ã¯ãããé«ãã¯ã©ã¹ã®ãã±ãããæã¤å¾åãããï¼p < 0.001ï¼ãå ·ä½çã«ã¯ã1æ³ä¸ã®ä¹å®¢ã¯ã1.07åé«ã確çã§ã1ã¯ã©ã¹é«ã客室ã®ãã±ãããæã¤ã
Sexã®ä¿æ°ã®è§£éï¼
ç·æ§ã®Sexã«é¢ããææ°ä¿æ°ï¼ãªããºæ¯ï¼ã¯çµ±è¨çã«ææã§ã¯ãªãï¼p = 0.115ï¼ã
çåçã®ä¿æ°ã®è§£é
Survivedã®ææ°ä¿æ°ï¼ãªããºæ¯ï¼7.17ã§ããã
çåããä¹å®¢ã¯ãããé«ãã¯ã©ã¹ã®ãã±ãããæã£ã¦ããå¾åãããï¼p < 0.001ï¼ãå ·ä½çã«ã¯ãçåããä¹å®¢ã¯ãçåããªãã£ãä¹å®¢ããã7.17åã®é«ã確çã§1ã¯ã©ã¹ä¸ã®èªç©ºå¸ãæã£ã¦ããã
ã¢ãã«é©åã®ãã§ãã¯
rcompanionããã±ã¼ã¸ã®é¢æ°nagelkerke()ã¯ãæ¬ä¼¼R2ä¹ãè¨ç®ãã尤度æ¯æ¤å®ãå®è¡ããï¼
library(rcompanion) nagelkerke(model)
$Models Model: "clm, Pclass ~ Age + Sex + Survived, titanic" Null: "clm, Pclass ~ 1, titanic" $Pseudo.R.squared.for.model.vs.null Pseudo.R.squared McFadden 0.155862 Cox and Snell (ML) 0.277186 Nagelkerke (Cragg and Uhler) 0.316640 $Likelihood.ratio.test Df.diff LogLik.diff Chisq p.value -3 -115.88 231.77 5.738e-50 $Number.of.observations Model: 714 Null: 714 $Messages [1] "Note: For models fit with REML, these statistics are based on refitting with ML" $Warnings [1] "None"
ç¹ã«éè¦ãªã®ã¯ã
Nagelkerkeã®R2ä¹: ãã¸ã¹ãã£ãã¯å帰ã¢ãã«ã®é©å度ã測å®ãã0ãã1ã®éã®æ°å¤ã§ããã
尤度æ¯æ¤å®: ãã«ã»ã¢ãã«ï¼ãã¹ã¦ã®äºæ¸¬å¤æ°ãå«ãã¢ãã«ï¼ããã«ã»ã¢ãã«ï¼å¤æ°ãå«ã¾ãªãã¢ãã«ï¼ããããããã¼ã¿ã«é©åãããã©ãããæ¤å®ãããæã
ã®äºä¾ã§ã¯ãLogLik.diffã¯-115.88ã§ãp < 0.001ã§ãããããã¯äºæ¸¬å¤æ°ã追å ãããã¨ãäºæ¸¬å¤æ°ãå«ã¾ãªã帰ç¡ã¢ãã«ããããããã¨ãæå³ããã
ã注
nagelkerkeã®çä¼¼Räºä¹ãç¨ãã尤度æ¯æ¤å®ã¯è¡ããªãæ¹ãè¯ãã¨æããããä¸ã¤ã¯nagelkerkeã®çä¼¼Räºä¹ãé常ã®Räºä¹ã¨ã¯ç°ãªãã¨ãããã¨ãããä¸ã¤ã¯Räºä¹ã¯ãã¢ãã«ã®ãã£ããã£ã³ã°ã示ãææ¨ã¨ãã¦ã¯ä¸é©åã§ããããã§ããã
åèï¼
https://ides.hatenablog.com/entry/2022/05/06/165738
https://ides.hatenablog.com/entry/2022/06/19/030458
ã¾ããlipsitzæ¤å®ãå®è¡ãã¦é©å度ããã§ãã¯ãããã¨ãã§ããï¼
library(generalhoslem) lipsitz.test(model)
Lipsitz goodness of fit test for ordinal response models data: formula: Pclass ~ Age + Sex + Survived LR statistic = 5.2796, df = 9, p-value = 0.8093
帰ç¡ä»®èª¬ã¯ãæå®ãããã¢ãã«ããã¼ã¿ã«ããé©åããã5%æ°´æºä»¥ä¸ã®å°ããªPå¤ã¯ãã¢ãã«ãä¸å®å ¨ã§ãããã¨ã示åãããã¤ã¾ããæã ã®ã¢ãã«ã¯ãã¼ã¿ã«é©åãã¦ããã
注
ãªãã·ããæ¤å®ã¯ãé åºå¯¾å¿ãã¸ã¹ãã£ãã¯å帰ã¢ãã«ã®é©å度æ¤å®ã§ãããããã¯ãé åºå¿çã¹ã³ã¢ã«åºã¥ãã¦ã観å¯ããããã¼ã¿ãçãã大ããã®g群ã«åå²ãããã¨ãå«ãããã®ã¹ã³ã¢ã¯ãåçµææ°´æºã§ã®å被é¨è ã®äºæ¸¬ç¢ºçã«çééã®æ´æ°ã®éã¿ãæãã¦åè¨ãããã¨ã«ããè¨ç®ããããã¦ã¼ã¶ã¼ã¯ãããã©ã«ãã§ã¯10ã§ããgã«æ´æ°å¤ãä»£å ¥ãã¦ãã°ã«ã¼ãã®æ°ãæå®ã§ããã
ãã¼ã¿ã®ãã®åå²ãä¸ããããã¨ãåã°ã«ã¼ãã«ã¤ãã¦ã被é¨è ãé ågã«ããã°I = 1ããªããã°I = 0ã¨ãªããããªããã¼å¤æ°Iãå°ããããããã¦ãã¢ãã«ããããã®ããã¼å¤æ°ã§åãã£ãããããLipsitz ã (1996) ã¯ã尤度æ¯æ¤å®ãWald æ¤å®ãã¹ã³ã¢æ¤å®ã使ç¨ã§ãããã¨ã示åãã¦ãã; lipsitz.test ã¯ãèªç±åº¦ g-1 ã®å°¤åº¦æ¯æ¤å®ã使ç¨ããã ãã§ããã
ã¢ãã«ãå®è¡ããåã«ãçµæå¤æ°ãå åã«å¤æããªããã°ãªããªããã¨ã«æ³¨æãã¦ã»ãããã¢ãã«é¢æ°å 㧠as.factor() ã使ç¨ããã¨ã lipsitz.test ãã¢ãã«ãåé©åããããã« update() é¢æ°ã使ç¨ããæ¹æ³ã®ãããã¨ã©ã¼ãçºçããã
Lipsitz æ¤å®ã¯ãé åºHosmer-Lemeshow æ¤å®ï¼logitgofï¼ã¨Pulkstenis-Robinson æ¤å®ï¼pulkrob.chisq 㨠pulkrob.devianceï¼ã¨ä¸ç·ã«å®è¡ãããã¨ãæ¨å¥¨ã ããï¼Fagerland and Hosmer, 2016ï¼ã
https://www.rdocumentation.org/packages/generalhoslem/versions/1.3.4/topics/lipsitz.test
注ï¼Pulkstenis-Robinson æ¤å®
帰ç¡ä»®èª¬ã¯ãæå®ãããã¢ãã«ããã¼ã¿ã«ããé©åããã5%æ°´æºä»¥ä¸ã®å°ããªPå¤ã¯ãã¢ãã«ãä¸å®å ¨ã§ãããã¨ã示åããã c("Sex","Survived")ã®ã¨ããã¯ã«ãã´ãªã«ã«å¤æ°ãæå®ããå¿ è¦ãããã
pulkrob.chisq(model, c("Sex","Survived"))
Pulkstenis-Robinson chi-squared test data: formula: Pclass ~ Age + Sex + Survived X-squared = 28.637, df = 11, p-value = 0.002583
pulkrob.deviance(model, c("Sex","Survived"))
Pulkstenis-Robinson deviance test data: formula: Pclass ~ Age + Sex + Survived Deviance-squared = 28.448, df = 11, p-value = 0.002763
注ï¼Hosmer-Lemeshowæ¤å®
帰ç¡ä»®èª¬ã¯ãæå®ãããã¢ãã«ããã¼ã¿ã«ããé©åããã5%æ°´æºä»¥ä¸ã®å°ããªPå¤ã¯ãã¢ãã«ãä¸å®å ¨ã§ãããã¨ã示åããã
## æ¬ æå°ãåé¤ãã¦titanic.na.omitã«æ ¼ç´ titanic.na.omit <- na.omit(titanic) ## å¾å±å¤æ°ã®ã¿ã®ãã¼ã¿ãpredprobã«æ ¼ç´ predprob <- data.frame(Age = titanic.na.omit$Age, Sex = titanic.na.omit$Sex, Survived = titanic.na.omit$Survived) ## ãã£ããããããªãã¸ã§ã¯ãmodelããå¿ è¦ãªãã®ãfvã«æ ¼ç´ fv <- predict(model, newdata = predprob, type = "prob")$fit ## Hosmer-Lemeshowæ¤å®ãå®è¡ï¼å¾å±å¤æ°ããã£ããããããªãã¸ã§ã¯ããå¾å±å¤æ°ã®ã«ãã´ãªã¼æ°ãé åºå¤æ°ãã©ãããæå®ããã logitgof(titanic.na.omit$Pclass, fv, g = 3, ord = TRUE)
Hosmer and Lemeshow test (ordinal model) data: titanic.na.omit$Pclass, fv X-squared = 3.5784, df = 3, p-value = 0.3107
é åºãã¸ã¹ãã£ãã¯å帰ã¢ãã«ã®ç²¾åº¦
次ã«ã3ã¤ã®ã¹ãããã§ã¢ãã«ã®ç²¾åº¦ãè¨ç®ããï¼
ã¹ããã#1: ãã£ããå¤ãåå¾ããpredsã«ä¿åããï¼
注ï¼ããã§ã¯broomããã±ã¼ã¸ã®augment()ãç¨ããã
preds <- augment(model, type = "class", na.action="na.omit") preds
# A tibble: 714 Ã 6 .rownames Pclass Age Sex Survived .fitted <chr> <ord> <dbl> <fct> <fct> <fct> 1 1 1st 22 male Died 1st 2 2 3rd 38 female Survived 3rd 3 3 1st 26 female Survived 3rd 4 4 3rd 35 female Survived 3rd 5 5 1st 35 male Died 1st 6 7 3rd 54 male Died 3rd 7 8 1st 2 male Died 1st 8 9 1st 27 female Survived 3rd 9 10 2nd 14 female Survived 1st 10 11 1st 4 female Survived 1st
ã¹ããã2ï¼æ··åè¡åãè¦ãï¼
注ï¼yardstickã®conf_mat()ãç¨ããã
conf_mat(preds, truth = Pclass, estimate = .fitted)
Truth Prediction 1st 2nd 3rd 1st 306 114 73 2nd 0 0 0 3rd 49 59 113
ãã®ã¢ãã«ã¯ã2çå¸ã®ãã±ãããæã£ã¦ãã人ãåé¡ãã¦ããªããã¨ã«æ³¨æãã¦ã»ããã
ããã§åæ¢ãã¦ãæ··åè¡åãç¨ãã¦æåã§ç²¾åº¦ãè¨ç®ãããã¨ãã§ãã¾ãããyardstickã®accuracy()é¢æ°ãç¨ããã¨ããéãè¨ç®ã§ããã§ãããã
ã¹ããã3: ã¢ãã«ã®ç²¾åº¦ãè¨ç®ããï¼
yardstick::accuracy(preds, truth = Pclass, estimate = .fitted)
.metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy multiclass 0.587
ã¤ã¾ããå¹´é½¢ãæ§å¥ãçåçã«åºã¥ãã¦ãã¢ãã«ã¯ä¹å®¢ã®ã¯ã©ã¹ã58.7ï¼ ã®ç¢ºçã§æ£ããæ¨æ¸¬ãã¦ãããã¨ã«ãªãã
æ¯ä¾ãªããºã®ä»®å®ã®ãã§ãã¯
ä¸è¨ã®é åºãã¸ã¹ãã£ãã¯å帰ã¢ãã«ã®ç¹å®ã®ã¿ã¤ãï¼æ¯ä¾ãªããºã»ãã¸ã¹ãã£ãã¯å帰ï¼ã¯ãçµæã§ã®åäºæ¸¬å¤æ°ã®å¹æãçµæã®ãã¹ã¦ã®ã¬ãã«ã«ã¤ãã¦åãã§ããã¨ä»®å®ãã¦ããããã¨ãã°ã2çå¸ã®ãã±ãã vs 1çå¸ã®ãã±ãããæã¤ãªããºã§ã®å¹´é½¢ã®1æ³å¢å ã®å¹æã¯ã3çå¸ã®ãã±ãã vs 2çå¸ã®ãã±ãããæã¤ãªããºã§ã®å¹´é½¢ã®1æ³å¢å ã®å¹æã¨åãã¨ãã£ããã®ã ã
ããã§å¸°ç¡ä»®èª¬ã¯ãæ¯ä¾ãªããºã®ä»®å®ãæãç«ã¤ã¨ãããã®ã§ãã帰ç¡ä»®èª¬ã¯ï¼æ¯ä¾ãªããºã®ä»®å®ãæãç«ã¤ã¨ãããã®ã§ããï¼ãªã ããã¹æ¤å®ã§p < 0.05ã¨å¤æ°ã®å°ãªãã¨ã1ã¤ãéãªãã¨ï¼ä»®å®ãç ´ãããã¨ã¿ãªããã[sourceï¼ McNulty K. Handbook of Regression Modeling in People Analyticsï¼ With Examples in R and Python. 第1çãChapman and Hall/CRC; 2021.].
library(gofcat) brant.test(model)
Brant Test: chi-sq df pr(>chi) Omnibus 3.699 3 0.30 Age 2.027 1 0.15 Sexmale 1.554 1 0.21 SurvivedSurvived 0.679 1 0.41 H0: Proportional odds assumption holds
ããã§ã¯ããã¹ã¦ã®på¤ã0.05ãè¶ ãã¦ããã®ã§ãæ¯ä¾ãªããºã®ä»®å®ãæãç«ã¤ã