DEMO: Logistična regresija in ML modeliranje s SHAP analizo

STATA.si

25 February, 2022

Titanic data

Survived Pclass Sex Age SibSp Parch Fare Embarked
0 3 male 22 1 0 7.250 S
1 1 female 38 1 0 71.283 C
1 3 female 26 0 0 7.925 S
1 1 female 35 1 0 53.100 S
0 3 male 35 0 0 8.050 S
0 3 male NA 0 0 8.458 Q

Multipla logistična regresija

Končni model

Končni model je določen v koračni metod. Tu se izvede statistična analiza končnega modela.

VarName OR CI95 pval
Sexmale 0.082 [0.052, 0.130] 0.000
Pclass2 0.218 [0.118, 0.404] 0.000
Pclass3 0.060 [0.032, 0.112] 0.000
Age 0.955 [0.938, 0.972] 0.000
SibSp 0.749 [0.582, 0.962] 0.024

Validacija

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 72 19
##          1  6 36
##                                           
##                Accuracy : 0.812           
##                  95% CI : (0.7352, 0.8745)
##     No Information Rate : 0.5865          
##     P-Value [Acc > NIR] : 2.564e-08       
##                                           
##                   Kappa : 0.5985          
##                                           
##  Mcnemar's Test P-Value : 0.0164          
##                                           
##             Sensitivity : 0.6545          
##             Specificity : 0.9231          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.7912          
##               Precision : 0.8571          
##                  Recall : 0.6545          
##                      F1 : 0.7423          
##              Prevalence : 0.4135          
##          Detection Rate : 0.2707          
##    Detection Prevalence : 0.3158          
##       Balanced Accuracy : 0.7888          
##                                           
##        'Positive' Class : 1               
## 

ROC krivulja

ML model

## Created from 581 samples and 19 variables
## 
## Pre-processing:
##   - centered (19)
##   - ignored (0)
##   - 5 nearest neighbor imputation (19)
##   - scaled (19)
## [12:23:52] WARNING: amalgamation/../src/objective/regression_obj.cu:188: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:0.750784 
## [2]  train-rmse:0.577628 
## [3]  train-rmse:0.463238 
## [4]  train-rmse:0.386368 
## [5]  train-rmse:0.338873 
## [6]  train-rmse:0.306742 
## [7]  train-rmse:0.288635 
## [8]  train-rmse:0.278576 
## [9]  train-rmse:0.263796 
## [10] train-rmse:0.258578

Validacija

Klasifikacija

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 101  25
##          1   8  43
##                                           
##                Accuracy : 0.8136          
##                  95% CI : (0.7483, 0.8681)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 1.058e-08       
##                                           
##                   Kappa : 0.5865          
##                                           
##  Mcnemar's Test P-Value : 0.005349        
##                                           
##             Sensitivity : 0.6324          
##             Specificity : 0.9266          
##          Pos Pred Value : 0.8431          
##          Neg Pred Value : 0.8016          
##               Precision : 0.8431          
##                  Recall : 0.6324          
##                      F1 : 0.7227          
##              Prevalence : 0.3842          
##          Detection Rate : 0.2429          
##    Detection Prevalence : 0.2881          
##       Balanced Accuracy : 0.7795          
##                                           
##        'Positive' Class : 1               
## 

ROC krivulja

SHAP analiza