å®åã§RandomForestã使ã£ãã¨ãã«èããããã¨
Machine Learning Advent Calendar 2012 ã® 21 æ¥ç®ã®è¨äºã§ãã
ç§ã¯æ®æ®µã¯åè¨ã®ãã¼ã¿è§£æãä»äºã«ãã¦ã¾ããéå»ã«ä½åº¦ãå®åã§RandomForestãå©ç¨ããæ©ä¼ãããã¾ããã®ã§ä»æ¥ã¯ä»¥å顧客ã«ãã¬ã¼ã³ãããæã«ã質åãããå 容ã¨ãã®åçãç´¹ä»ãããã¨æãã¾ããæ®æ®µã¯æ©æ¢°å¦ç¿ã»ãã¼ã¿ãã¤ãã³ã°ãå®åã®ç«å ´å©ç¨ãã¦ãããææ³ãã®ãã®ã®å°é家ã§ã¯ãªãã®ã§ãééããªã©ãæãã¾ããããææãã ããã
ãã¦RandomForestã¯æåãªã¢ã«ã´ãªãºã ã§ãã®ã§ããåãã®æ¹ãå¤ãã¨ã¯æãã¾ãããCARTã®éçºè ã§ããããLeo Breimanã2001å¹´ã«ææ¡ãã決å®æ¨ãç¨ããéå£å¦ç¿ã¢ã«ã´ãªãºã ã®ï¼ã¤ã§ããä¸è¨ã§è¨ãã°ã大éã®æ±ºå®æ¨ãä½æãã¦ãããããã®æ±ºå®æ¨ãåºããçããå¤æ°æ±ºããæãæ¯æã®å¤ãã£ãã¯ã©ã¹ã«åé¡ããææ³ã§ãã(å帰ã®å ´åã¯å¹³åãè¿ãã¾ãï¼
RandmoForestããåããªãæ¹ã¯id:hamadakoichiããã®è³æãé常ã«ããããããã§ãã
ãã¦æ¬é¡ãéå»ã«ãããã£ã質åã¯ä»¥ä¸ã®å 容ã§ãã
- ãªãRandomForestã¯ç²¾åº¦ãé«ããªãã®ãï¼
- ãã®ã³ã°ã¨ã®éãã¯ä½ãï¼
- ãã©ã¡ã¼ã¿ãã¥ã¼ãã³ã°ã¯ã©ãããã°ãããï¼
ä¸ï¼ã¤ã¯åã質åã¨ãåãã¾ãããç¥ã£ã¦ãç¯å²ã§åçãã¦ããããã¨æãã¾ãã
ãªãRandomForestã¯ç²¾åº¦ãé«ããªãã®ãï¼
RandomForestã«éãã決å®æ¨ã®ãããªå¼±å¦ç¿å¨ã¯éå£å¦ç¿ã«åãã¦ããã¨ããè¨ããã¾ãããããç解ããããã«ã¯ãã¤ã¢ã¹ã¼ããªã¢ã³ã¹ã®è¦³ç¹ãã説æãã§ãã¾ãã
バイアス-バリアンス - 機械学習の「朱鷺の杜Wiki」
ãã¤ã¢ã¹ã¼ããªã¢ã³ã¹çè«ã«ããã¨æ±å誤差ã¯æ¬¡ã®ããã«å解ããã¾ãã
æ±å誤差=ãã¤ã¢ã¹+ããªã¢ã³ã¹ï¼ãã¤ãº
ããã§ãã¤ã¢ã¹ã¯ã¢ãã«ã®è¡¨ç¾åã«ç±æ¥ãã誤差ãããªã¢ã³ã¹ã¯ãã¼ã¿ã»ããã®é¸ã³æ¹ã«ç±æ¥ãã誤差ããã¤ãºã¯æ¬è³ªçã«æ¸ãããªã誤差ã§ãã
決å®æ¨ã¯ã¢ã«ã´ãªãºã ã®æ§è³ªä¸ãã¢ãã«ãå¦ç¿ãã¼ã¿ãããããå½±é¿ã大ããããªã¢ã³ã¹ãé«ãå¦ç¿ã¢ãã«ã«ãªãã¾ããRandomForestãbaggingãªã©ã®éå£å¦ç¿ã¢ã«ã´ãªãºã ã¯ãã®ããªã¢ã³ã¹ãä½æ¸ããããã¨ã§ç²¾åº¦ãåä¸ãå³ãã¾ãã
ã¡ãªã¿ã«é«ç²¾åº¦ã§æåãªã¢ã«ã´ãªãºã ã§ããSVMã§ãã¾ãéå£å¦ç¿ã®è©±ãèããªãã®ã¯ãSVMãä½ããªã¢ã³ã¹ã®ã¢ãã«ã ããã§ãã
baggingã¨ã®éãã¯ä½ãï¼
baggingã¯éå£å¦ç¿ã¢ã«ã´ãªãºã ã®ä¸ç¨®ã§ããã¼ãã¹ãã©ãããµã³ããªã³ã°ã§æ½åºãããã¼ã¿ã»ãããå¤æ°ä½æããåã ã®ãã¼ã¿ã»ããã«å¯¾ãã¦å¦ç¿ããèå¥å¨ã®å¤æ°æ±ºã§ã¯ã©ã¹ãåé¡ããæ¹æ³ã§ããRandomForestã¨baggingã®éãã¯RandomForestãç¹å¾´éã®ãµã³ããªã³ã°ãè¡ãªã£ã¦ããç¹ã§ãã
確çå¤æ°å士ãç¸é¢ãä¿ã¤å ´åãå¹³åã®åæ£ã¯ä»¥ä¸ã®å¼ã§è¡¨ç¾ããã¾ãã
ããã§ãã¯çæãã決å®æ¨ã®æ°ãã¯åæ£ãã¯å¤æ°éã®ç¸é¢ã§ããbaggingã§æ½åºãã決å®æ¨ã¯ãã¼ã¿ã«ãã£ã¦ã¯ãã¼ãã¹ãã©ãããµã³ããªã³ã°ã§ä½æããåã
ã®æ±ºå®æ¨å士ã®ç¸é¢ãé«ããã¨ãããã¾ããããã«å¯¾ãã¦ä½¿ç¨ããç¹å¾´éãéãæ¨ãããããçæãã¦ããRandomForestã¯æ±ºå®æ¨éã®ç¸é¢ãä½ãçºãä¸ã®å¼ã®ç¬¬äºé
ãå°ãããªãbaggingãããããªã¢ã³ã¹ãä¸ãããbaggingããRandomForestã®æ¹ãçæããå¤æ§æ§ãé«ããªãã¾ãã
ãã©ã¡ã¼ã¿ãã¥ã¼ãã³ã°ã¯ã©ãããã°ãããï¼
RandomForestã®ä¸»è¦ãªãã©ã¡ã¼ã¿ã¯æ¬¡ã®2ã¤ã§ãã
- ä½æãã決å®æ¨ã®æ°
- ï¼ã¤ï¼ã¤ã®æ±ºå®æ¨ãä½æããéã«ä½¿ç¨ããç¹å¾´éã®æ°
ä½æãã決å®æ¨ã®æ°ã決å®ããæ¹æ³ã¯ç°¡åã§ããäºæ¸¬ã«ç¨ããæ¨ã®æ°ãå¢ããã¦ããçµæãå®å®ããæ°ãå©ç¨ããã°ããã ãã§ããä¸è¿°ã®æ§ã«RandomForestã§ã¯æ±ºå®æ¨éã®ç¸é¢ãä½ä¸ãããããã«ã決å®æ¨ãä½æããã¨ãã«ä½¿ç¨ããç¹å¾´éããµã³ããªã³ã°ãã¾ãããã®æããã¤ã®ç¹å¾´éã使ç¨ãããã¯ãã©ã¡ã¼ã¿ã¨ãªã£ã¦ãã¦ã決å®æ¨ã®å ´åã¯ç¹å¾´éã®æ°ãNã®æâNãæ¨å¥¨å¤ã¨ãªã£ã¦ãã¾ãã
ããããªãããå®éã®ã¨ããã¯æé©ãªç¹å¾´éæ°ã¯ãã¼ã¿ä¾åã§ããç¹å¾´éãå¤ãå ´åããæå³ã®ããç¹å¾´éãå ¨ä½ã®ä¸ã§å°ãªãå ´åã¯æ¨å¥¨å¤ããã大ããã®å¤ãè¨å®ããã»ããè¯ãçµæãå¾ãããå¾åãããã¾ãã®ã§ãã°ãªãããµã¼ãã§æ±ºå®ãããã¨ããå§ããã¾ãã
å®è£
Machine Learning Advent Calendar ã®ã³ã¡ã³ãæ¬ã«ããªã«ãå®è£
ãã¾ããã¨æ¸ãã¦ãã¾ã£ããã¨ãå¾æãã¤ã¤RandomForestã®ã³ã¢é¨åãPythonã§å®è£
ãã¦ã¿ã¾ãããOut-Of-Bugã¯éã«åããªãã£ãã®ã§ç¡ãã§ãã1ã¤ï¼ã¤ã®æ¨ã¯scikit-learnã¨ããã©ã¤ãã©ãªãç¨ãã¦è¨ç®ãã¦ãã¾ããï¼ã¡ãªã¿ã«scikit-learnã«ã¯RandomForestãå®è£
ããã¦ãã¾ãï¼
#!/usr/bin/env python # -*- coding: utf-8 -*- ''' Created on 2012/12/21 @author: shakezo_ ''' from sklearn.datasets import load_iris from sklearn import tree from sklearn import cross_validation from sklearn.cross_validation import train_test_split import numpy as np def feature_sampling(data,feature_num,mtry): partial_data = [] arr = np.arange(feature_num) np.random.shuffle(arr) for d in data: partial_data.append(d[arr[0:mtry]]) return [partial_data,arr[0:mtry]] def predict(clf_list,data): #å¤æ°æ±ºã«ããã¢ãã«ã®æ±ºå® predict_dic ={} for clf in clf_list: input = data[clf[1][1]] model = clf[0] pid =int(model.predict(input)[0]) predict_dic[pid] = predict_dic.get(pid,0) + 1 #å¤æ°æ±ºã§ã¯ã©ã¹ãæ±ºå® max_count = 0 max_id =-1 for k,v in predict_dic.iteritems(): if v>max_count: max_count = v max_id = k return max_id if __name__ == '__main__': target_names = {} #irisãã¼ã¿ã»ãããåå¾ iris = load_iris() #ã¿ã¼ã²ãããåå¾ for i,name in enumerate(iris.target_names): target_names[i] = name #ãã¼ã¿åå² x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.33, random_state=42) #parameter tree_num = 500; train_num = int(len(x_train)*(2.0/3)) test_num = len(x_train)-train_num feature_num = len(x_train[0]) mtry = 2 #ãã¼ãã¹ãã©ãããµã³ããªã³ã° data_list = [] target_list = [] input_data_list = [] clf_list = [] bs = cross_validation.Bootstrap(len(x_train),n_bootstraps=tree_num,train_size=train_num,test_size=test_num, random_state=0) #ã©ã³ãã ãã©ã¬ã¹ãã®å®è¡ #使ç¨ããç¹å¾´éã¨ãã¼ã¿ããµã³ããªã³ã°ãã¦æ±ºå®æ¨ãæ§ç¯ for train_index, test_index in bs: data = x_train[train_index] target = y_train[train_index] data_list.append(data) target_list.append(target) #ç¹å¾´éã®é¸æ input_data = feature_sampling(data,feature_num,mtry) input_data_list.append(input_data) #決å®æ¨ã®ä½æ clf = tree.DecisionTreeClassifier() clf = clf.fit(input_data[0], target) #ä½æããæ¨ã¨ãã¼ã¿ã追å clf_list.append([clf,input_data]) #ãã¼ã¿ã®äºæ¸¬ predict_id_list = [] #test_data_list = iris.data correct_num = 0 for i,data in enumerate(x_test): pid=predict(clf_list,data) predict_id_list.append(pid) if pid == y_test[i]: correct_num += 1 #Accuracy print "Accuracy = " ,correct_num/float(len(x_test))
çµæ
Accuracy = 0.98
ããã§ã¯ã