Introduction of sensitivity analysis for randamforest regression, binary classification and multi-class classification of random forest using {forestFloor} package
1 of 33
Downloaded 62 times
More Related Content
forestFloorパッケージを使ったrandomForestの感度分析
1. ランダムフォレストで感度分析+
“sensitivity analysis using forestFloor”
forestFloor packageを紹介します
(S. H. Welling, et al., ArXiv e-prints, June 2016.)
第55回R勉強会@東京(#TokyoR)
feature contribution
forestFloorを使ってみる
6. 変数重要度を求めたら、その次は…
• In data mining applications the input predictor variables are seldom equally relevant. Often
only a few of them have substantial influence on the response; the vast majority are
irrelevant and could just as well have not been included. It is often useful to learn the relative
importance or contribution of each input variable in predicting the response.
• After the most relevant variables have been identified, the next step is to attempt to
understand the nature of the dependence of the approximation f(X) on their joint values.
in, Hastie, Tibshirani, Friedman (2008), ESLII(2nd) pp367-
変数が予測にどう影響するかを知りたい
⇒感度分析
7. 多変量回帰の例
デモ用データ
𝑌 = 2 𝑋1 + 2 sin 𝑝𝑖 𝑋2 + 3 𝑋3 + 𝑋2
2
+ 𝜀
+0(X4 + 𝑋5 + 𝑋6)
#simulate data
obs=1500
vars = 6
X = data.frame(replicate(vars,runif(obs)))*2-1
Y = with(X, X1*2 + 2*sin(X2*pi) + 3* (X3+X2)^2 ) #X4, X5,X6 are noises
Yerror = 1 * rnorm(obs)
var(Y)/var(Y+Yerror)
Y= Y+Yerror
8. Partial Dependency Plot (PDP) using randomForest
library(randomForest)
library(forestFloor)
#grow a forest, remember to include inbag
multireg.rfo=randomForest::randomForest(X,Y,
keep.inbag=TRUE,
ntree=1000, sampsize=500,
replace=TRUE, importance=TRUE)
names.X <- c("X2","X3","X1","X5","X6","X4")
# randomForest::partialPlot()
par(mfrow=c(2, 3))
for (i in seq_along(names.X)) {
partialPlot(multireg.rfo, X, names.X[i], xlab=names.X[i],
main = names.X[i], ylim=c(-4,10))
}
par(mfrow=c(1, 1))
注目する変数がある値のとき、
「平均的な予測値」を可視化する
9. PDP focuses on marginal averages
① 𝑥i がある値のとき、残りの変数の
違いによる y の値を平均する
② すべての 𝑥i のついて y の平均を
算出し、線でつなぐ
19. # (続き)
library(rgl)
rgl::plot3d(ff$X[,2],ff$X[,3],apply(ff$FCmatrix[,2:3],1,sum),
#add some colour gradients to ease visualization
#box.outliers squese all observations in a 2 std.dev box
#univariately for a vector or matrix and normalize to [0;1]
col=fcol(ff,2,orderByImportance=FALSE))
#add grid convolution/interpolation
#make grid with current function
grid23 = convolute_grid(ff,Xi=2:3,userArgs.kknn= alist(k=25,kernel="gaus"),grid=50,zoom=1.2)
#apply grid on 3d-plot
rgl::persp3d(unique(grid23[,2]),unique(grid23[,3]),grid23[,1],alpha=0.3,
col=c("black","grey"),add=TRUE)
Feature Contribution
Rglパッケージに渡すことで変数間の相互作用も観察できる
X2の大きさでグラデーションを作ってプロット
20. ある事例 Xi がクラスA に所属する確率は P(Xi)
ある事例 Xi がクラスB に所属する確率は 1 - P(Xi)
-P=0
2-class分類木の場合
P=1
sklearn.randomForestClassifier などは、弱学習器における確率値出力をしてくれるらし
い
sklearn.randomForestClassifier would rather pass on the probabilistic vote from terminals nodes and
弱学習器 = あるクラスへの所属確率をマップする関数、と考える
クラスA に所属する確率
21. ある事例 Xi がクラス m に所属する確率 Pm(Xi)
-Pm=0
multi-class分類木の場合
Pm=1
弱学習器 = あるクラスへの所属確率をマップする関数、と考える
25. 所属確率の可視化(3-class限定)
# つづき
pars = plot_simplex3(ff.test42,Xi=c(1:3),restore_par=FALSE,zoom.fit=NULL,
var.col=NULL,fig.cols=2,fig.rows=1,fig3d=FALSE,includeTotal=TRUE,auto.alpha=.4, set_pars=TRUE)
pars = plot_simplex3(ff.test42,Xi=0,restore_par=FALSE,zoom.fit=NULL,
var.col=alist(alpha=.3,cols=1:4),fig3d=FALSE,includeTotal=TRUE, auto.alpha=.8,set_pars=FALSE)
for (I in ff.test42$imp_ind[1:4]) {
#plotting partial OOB-CV separation(including interactions effects)
#coloured by true class
pars = plot_simplex3(ff.test42,Xi=I,restore_par=FALSE,zoom.fit=NULL,
var.col=NULL,fig.cols=4,fig.rows=2,fig3d=TRUE,includeTotal=FALSE,label.col=1:3,
auto.alpha=.3,set_pars = (I==ff.test42$imp_ind[1]))
#coloured by varaible value
pars = plot_simplex3(ff.test42,Xi=I,restore_par=FALSE,zoom.fit=TRUE,
var.col=alist(order=FALSE,alpha=.8),fig3d=FALSE,includeTotal=(I==4), auto.alpha=.3,set_pars=FALSE)
33. 参考文献
• randomForest
• Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32.
• Hastie, Tibshirani, Friedman (2008), Partial dependency plot. IN: The Elements of Statistical Learning.(2nd)
pp367-
• http://statweb.stanford.edu/~tibs/ElemStatLearn/
• forestFloor
• CRAN
• https://cran.r-project.org/web/packages/forestFloor/index.html
• Official Site
• http://forestfloor.dk/
• Welling et al. (2016). Forest Floor Visualizations of Random Forests. ArXiv e-prints, June 2016.“
• http://arxiv.org/abs/1605.09196
• Palczewska et al (2014). Interpreting random forest classification models using a feature contribution method.
• http://arxiv.org/abs/1312.1121
• ICEbox
• CRAN
• https://cran.r-project.org/web/packages/ICEbox/index.html
• Goldstein et al, (2015). Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual
Conditional Expectation. Journal of Computational and Graphical Statistics, 24(1): 44-65
• http://arxiv.org/abs/1309.6392