LightGBMã§downsampling+bagging
ã¯ããã«
æ°å¹´åã®æè¡ç³»ã®è¨äºã§ãã
å¹´æ«å¹´å§ããæè¿ã«ããã¦ã¯ãPyTorchã®åå¼·ãªã©ã¤ã³ãããéè¦ã§éããã¦ãã¾ãããã®ä¸ç°ã§ä¸åè¡¡ãã¼ã¿ã®æ±ããåå¼·ãã¾ããã
ã¯ã©ã¹æ¯1:99ã®äººå·¥çãªä¸åè¡¡ãã¼ã¿ä½¿ã£ã¦ãã¦ã³ãµã³ããªã³ã°ã試ãã¦ããã©ããã«ã¿ããã«è² ä¾ãæ¨ã¦ã¡ãã£ã¦ãæå¤ã¨å¤§ä¸å¤«ãªãã ãªãè¨ç®æéãå§åçã«æ¸ãã®ã§ããã®æéã§ã¢ã³ãµã³ãã«çãªãã¨ããã°ç²¾åº¦ã確ä¿ã§ãããã
— u++ (@upura0) January 8, 2019
ä¸è¨ã®ãã¤ã¼ãã奿©ã«å¤ãã®ãªãã©ã¤ãªã©ã§æ å ±ãé æ´ãã¾ãããã以åã«è©±é¡ã«ãªã£ããdownsampling+baggingãã®ææ³ãè¯ãããã§ãããæ¬è¨äºã§ã¯ã模æ¬çã«ä½æãããã¼ã¿ã»ããã«LightGBMã使ãããdownsampling+baggingãã®ææ³ã試ãã¦ã¿ããã¨æãã¾ãã
imbalanced data ã«å¯¾ãã対å¦ãåå¼·ãã¦ããã®ã ãã©ï¼[Wallace et al. ICDM'11] https://t.co/ltQ942lKPm ⦠ã§ãundersampling + bagging ããããã¨ããçµè«ãåºã¦ããï¼
— ⢠(@tmaehara) July 29, 2017
ãã¼ã¿ã»ããã®ä½æ
ãã¼ã¿ã»ããã®ä½æã«å½ãã£ã¦ã¯ãä¸è¨ã®è¨äºãåèã«ãã¾ããã
from sklearn.datasets import make_classification from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import StratifiedShuffleSplit args = { 'n_samples': 7000000, 'n_features': 80, 'n_informative': 3, 'n_redundant': 0, 'n_repeated': 0, 'n_classes': 2, 'n_clusters_per_class': 1, 'weights': [0.99, 0.01], 'random_state': 42, } X, y = make_classification(**args)
ç®ç夿°ã¯{0, 1}ã®2å¤åé¡ã§ãåè¨700ä¸ä»¶ã®ãã¼ã¿ã®ãã¡æ£ä¾ï¼ã©ãã«1ï¼ãç´1%ã®ä¸åè¡¡ãã¼ã¿ã使ãã¾ããã

ã©ãã«ã®å²åãåçã«ãªãããã«ããã¼ã¿ãå¦ç¿ã»æ¤è¨¼ã»ãã¹ãç¨ã«åå²ãã¦ããã¾ãã
def imbalanced_data_split(X, y, test_size=0.2): sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0) for train_index, test_index in sss.split(X, y): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] return X_train, X_test, y_train, y_test X_train, X_test, y_train, y_test = imbalanced_data_split(X, y, test_size=0.2) # for validation X_train2, X_valid, y_train2, y_valid = imbalanced_data_split(X_train, y_train, test_size=0.2)
LightGBM
ã¾ãã¯ãæ®éã«LightGBMã試ãã¦ã¿ã¾ãã
import lightgbm as lgb from sklearn.metrics import roc_auc_score lgbm_params = { 'learning_rate': 0.1, 'num_leaves': 8, 'boosting_type' : 'gbdt', 'reg_alpha' : 1, 'reg_lambda' : 1, 'objective': 'binary', 'metric': 'auc', } def lgbm_train(X_train_df, X_valid_df, y_train_df, y_valid_df, lgbm_params): lgb_train = lgb.Dataset(X_train_df, y_train_df) lgb_eval = lgb.Dataset(X_valid_df, y_valid_df, reference=lgb_train) # ä¸è¨ã®ãã©ã¡ã¼ã¿ã§ã¢ãã«ãå¦ç¿ãã model = lgb.train(lgbm_params, lgb_train, # ã¢ãã«ã®è©ä¾¡ç¨ãã¼ã¿ã渡ã valid_sets=lgb_eval, # æå¤§ã§ 1000 ã©ã¦ã³ãã¾ã§å¦ç¿ãã num_boost_round=1000, # 10 ã©ã¦ã³ãçµéãã¦ãæ§è½ãåä¸ããªãã¨ãã¯å¦ç¿ãæã¡åã early_stopping_rounds=10) return model
ã¢ãã«ã®å¦ç¿æéã¯2min 21sã§ããã
%%time model_normal = lgbm_train(X_train2, X_valid, y_train2, y_valid, lgbm_params)
ï¼åç¥ï¼ [62] valid_0's auc: 0.831404 Early stopping, best iteration is: [52] valid_0's auc: 0.831614 CPU times: user 2min 16s, sys: 4.87 s, total: 2min 21s Wall time: 58.7 s
ãã¹ããã¼ã¿ã§äºæ¸¬ãã¦ã¿ãã¨ãããaucã§0.829287295077ã¨ãªãã¾ããã
y_pred_normal = model_normal.predict(X_test, num_iteration=model_normal.best_iteration) # auc ãè¨ç®ãã auc = roc_auc_score(y_test, y_pred_normal) print(auc)
downsampling
次ãã§ãdownsamplingã試ãã¦ã¿ã¾ãã
downsamplingã¯ãä¸åè¡¡ãã¼ã¿ã®å¤ãæ¹ã®ã©ãã«ã®ãã¼ã¿ããå°ãªãæ¹ã®ã©ãã«ã®ãã¼ã¿æ°ã¨çãããªãã¾ã§ã©ã³ãã ã«é¤å¤ããææ³ã§ããä»åã®å ´åãè² ä¾ï¼ã©ãã«0ï¼ã®ãã¼ã¿ã大éã«æ¨ã¦ã¦ãã¾ãã¾ãã
imbalanced-learnã¨ããã©ã¤ãã©ãªã§ãç°¡åã«å¦çãè¨è¿°ã§ãã¾ãã
from imblearn.under_sampling import RandomUnderSampler sampler = RandomUnderSampler(random_state=42) # downsampling X_resampled, y_resampled = sampler.fit_resample(X_train, y_train) # for validation X_train2, X_valid, y_train2, y_valid = imbalanced_data_split(X_resampled, y_resampled, test_size=0.2)

ï¼å¦ç¿ãã¼ã¿ã®ï¼æ£ä¾ã®æ°ã«æãã¦ããã®ã§ããã¼ã¿ãµã¤ãºã¯ããªãå°ãããªã£ã¦ãã¾ãã
å ã»ã©ã¨åããLightGBMã§å¦ç¿ãããã¨ãããã¢ãã«ã®å¦ç¿æéã¯5.24 sã¾ã§ç縮ããã¾ããã
%%time model_under_sample = lgbm_train(X_train2, X_valid, y_train2, y_valid, lgbm_params)
ï¼åç¥ï¼ [38] valid_0's auc: 0.83336 Early stopping, best iteration is: [28] valid_0's auc: 0.833389 CPU times: user 5.02 s, sys: 229 ms, total: 5.24 s Wall time: 2.76 s
ãã¹ããã¼ã¿ã§äºæ¸¬ãã¦ã¿ãã¨ãããaucã¯0.828820480993ã«ãªãã¾ãããaucã¯å¤å°æªåãã¦ãã¾ãã
| ææ³ | auc | å®è¡æé |
|---|---|---|
| LightGBM | 0.829287295077 | 2min 21s |
| LightGBM + downsampling | 0.828820480993 | 5.24 s |
downsampling+bagging
æå¾ã«ãbaggingã追å ãã¦ã¿ã¾ãã
baggingã¯ãæåã®ä¸åè¡¡ãã¼ã¿ããéè¤ã許ãã¦è¤æ°åã®ãã¼ã¿ã»ããã使ããããããå¦ç¿ãããã¢ãã«ãã¢ã³ãµã³ãã«ããææ³ã§ãã
imbalanced-learnã®RandomUnderSampler()ã§ã¯ãreplacementã®å¼æ°ãTrueã«ãããã¨ã§ãéè¤ã許ãããã¼ã¿æ½åºãå®è¡ãã¦ããã¾ãã
ä»åã¯ä¹±æ°ã®seedãå¤ããªããã10åã®ã¢ãã«ãå¦ç¿ããã¦ã¿ã¾ãã
def bagging(seed): sampler = RandomUnderSampler(random_state=seed, replacement=True) X_resampled, y_resampled = sampler.fit_resample(X_train, y_train) X_train2, X_valid, y_train2, y_valid = imbalanced_data_split(X_resampled, y_resampled, test_size=0.2) model_bagging = lgbm_train(X_train2, X_valid, y_train2, y_valid, lgbm_params) return model_bagging
10åã®ã¢ãã«ã®å¦ç¿æéã¯ã1min 24sã§ããã
%%time models = [] for i in range(10): models.append(bagging(i))
ï¼åç¥ï¼ CPU times: user 1min 17s, sys: 6.4 s, total: 1min 24s Wall time: 47.9 s
ä»åã®ã¢ã³ãµã³ãã«ã§ã¯ãããããã®ã¢ãã«ã§äºæ¸¬ããçµæã®å¹³åå¤ããå
¨ä½ã®äºæ¸¬å¤ã¨ã¿ãªãã¦ã¿ã¾ãã
aucãè¨ç®ããã¨ãããåç¬ã®ã¢ãã«ãããå°ã
é«ã0.829094611662ã«ãªãã¾ããã
y_preds = [] for m in models: y_preds.append(m.predict(X_test, num_iteration=m.best_iteration)) y_preds_bagging = sum(y_preds)/len(y_preds) # auc ãè¨ç®ãã auc = roc_auc_score(y_test, y_preds_bagging) print(auc)
| ææ³ | auc | å®è¡æé |
|---|---|---|
| LightGBM | 0.829287295077 | 2min 21s |
| LightGBM + downsampling | 0.828820480993 | 5.24 s |
| LightGBM + downsampling + bagging (10 models) | 0.829094611662 | 1min 24s |
ãããã«
æ¬è¨äºã§ã¯ä¸åè¡¡ãã¼ã¿ã®æ±ãæ¹ã®åå¼·ã¨ãã¦ãLightGBMã使ãããdownsampling+baggingãã®ææ³ã試ãã¾ããã
å½ç¶ãªãããã¼ã¿ã®ä¸å衡度åãã大ãããªã©ã®ç¹æ§ã«ä¾åããé¨åã大ããã¨æãã¾ãããä»å使ãããã¼ã¿ã«é¢ãã¦ããã°ã以ä¸ã®ãããªå®æãæ±ãã¾ããã
- downsamplingã§å¤§éæã«ãã¼ã¿ãæ¨ã¦ã¦ããããã¾ã§æ§è½ã¯æªåããªã
- 忏ãããå®è¡æéãå©ç¨ãã¦ãç¹å¾´éãå¢ããããã¢ã³ãµã³ãã«ããããã§æ§è½ãæ ä¿ã§ããã
ä¸ã®ä¸ã§æ±ããã¼ã¿ã«ã¯ä¸åè¡¡ãã¼ã¿ãå¤ãã®ã§ãä»å¾ãããããªãã¼ã¿ã«å¯¾ãã¦è©¦ãã¦ããããã¢ããã¼ãã ã¨æãã¾ããã
å®è£
ã¯GitHubã§å
¬éãã¦ãã¾ãã
github.com