ããã¯ããBASE Advent Calendar 2018ãã®6æ¥ç®ã®è¨äºã§ãã
DataStrategyã®é½è¤(@pigooosuke)ãæ å½ãã¾ãã
ã¯ããã«
æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã®äººã¯ãåé¡ãå帰ãªã©ã®èª²é¡ã«åãçµãã«ããã£ã¦ãåã人ãå°å ¥å ã®é¨éããããã®äºæ¸¬ã©ããããå¤ããã®ï¼ããå¦ç¿ã¢ãã«ã®äºæ¸¬ã«å¯¾ãã¦ã©ããªã¹ã¯è©ä¾¡ãããã°ããã®ï¼ãã¨å°ãããããã¨ã¯ããã¾ãããï¼ ãã®ãããªå ´é¢ã§æ´»èºãããããããªãQuantile Regression(åä½ç¹å帰)ã®ã話ããã¾ãã
å帰ã¢ãã«ã®è©ä¾¡
ã«ãã´ãªã¼ãäºæ¸¬ãããããªåé¡åé¡ã§ã¯ãåã¯ã©ã¹ã§ã®ç²¾åº¦ã確èªãããã¨ã¯ã§ãã¾ãããããã売ä¸ãä½ãããã®å¤ãäºæ¸¬ããå帰åé¡ã§ã¯ããã®ã¢ãã«ã«ãããç¹å®å°ç¹ã§ã®åä¸ã®äºæ¸¬å¤ããåºåãããã¨ãåºæ¥ãããã®åå¸ãææ¡ãããã¨ã¯åç´ã§ã¯ããã¾ããã
# scikit-learnã®ç·å½¢å帰ãµã³ãã« >>> import numpy as np >>> from sklearn.linear_model import LinearRegression >>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]]) # å ¥åå¤ã®ã»ãã >>> y = np.dot(X, np.array([1, 2])) + 3 # æ£è§£ãã¼ã¿ >>> reg = LinearRegression().fit(X, y) >>> reg.predict(np.array([[3, 5]])) # å ¥å[3, 5]ã«å¯¾ãã¦ãäºæ¸¬å¤ã¯1㤠array([16.])
ããã§èª¬æã«ããã£ã¦äºæ¸¬ã®ç¢ºåº¦ã¨ããè¨èã使ãã¨å®ç¾©ãçãã®ã§ããå°ãå解ãã¾ããå帰äºæ¸¬ã®ç¯å²ãææ¡ããä¸ã§ããåºã¦ããã®ã¯ããã®2種é¡ã§ã¯ãªãã§ãããããæ··åããªãããã«äºåã«å確èªãã¾ãããã
- ä¿¡é ¼åºé Confidence interval
- ex. 95%ä¿¡é ¼åºéããã¡ãã¯ãåãæ¡ä»¶ã§100å測å®ããããè¨ç®ããå帰æ²ç·ããã®ä¿¡é ¼åºéå ã«åã¾ã£ã測å®ã95åããã¾ããã¨ããæå³ã«ãªãã¾ãã
- äºæ¸¬åºé Prediction interval
- ex. 95%äºæ¸¬åºéããã¡ãã¯ãä»å¾è¦³æ¸¬ããããã¼ã¿ã100ååºç¾ããããäºæ¸¬åºéå ã«åã¾ã観測ã95åããã¾ããã¨ããæå³ã«ãªãã¾ãã
ä¸å³ã¯ãæ¨æ¬ã«å¯¾ãã¦å帰æ²ç·ãè¨ç®ãããã®ã§ãã
å帰æ²ç·ã¯ä¿¡é ¼åºéã«åã¾ããæ¨æ¬ã¯äºæ¸¬åºéã«åã¾ãã¾ãã
åºæ¬çã«ã¯äºæ¸¬åºéã®ã»ããä¿¡é ¼åºéã®å¤å´ã«ä½ç½®ãã¾ãã
使ãåãã¨ãã¦ã¯ã
- ã¢ãã«ãã®ãã®ã®ç²¾åº¦ã測ãããã®ã§ããã°
ä¿¡é ¼åºé
- ã¢ãã«ã«ãããç¹å®å°ç¹ã§ã®äºæ¸¬å¤ã®ã°ãã¤ãã調ã¹ããã®ã§ããã°
äºæ¸¬åºé
ãç®å®ã«ãªãã¨æãã¾ãã ä»åã話ãããQuantile Regressionã¯ãäºæ¸¬åºéã説æããããã«å©ç¨ãã¾ãã
Quantile Regression ~ åä½ç¹å帰 ~
Quantileã¨ã¯ãæ¥æ¬èªã§ååä½ã®ãã¨ã§ãããã¼ã¿ãã½ã¼ããã¦åºåã£ãå ´åãããããã®ãã¼ã¿ãä¸ä½ä½ï¼ ã«ä½ç½®ããã®ãã表ç¾ããã¨ãã«ä½¿ãã¾ãã 2 quantileã¯ãä¸å¤®å¤ã¨ä¸è´ãã¾ãã
0 quantile = 0 %ileãï¼percentile: ãã¼ã»ã³ã¿ã¤ã«ï¼ 1 quantile = 25 %ile 2 quantile = 50 %ile = median(ä¸å¤®å¤) 3 quantile = 75 %ile 4 quantile = 100 %ile
Quantile Regressionã¯ãç·å½¢å帰ã®æ失é¢æ°ãæ¡å¼µãããã®ã§ãé常ã®ããã«äºä¹èª¤å·®ãæ±ãã¦å¹³åå¤ãæé©åããã®ã§ã¯ãªããäºãè¨å®ããquantile(percentile) ã§ã®æ失é¢æ°ãæé©åãã¦ããã¾ããå¹´åãªã©åããããåå¸ãå¹³åå¤ã§ã¯ãªããä¸å¤®å¤ã§ç¢ºèªãããå ´åã«å©ç¨ããã¾ããç¨éã¨ãã¦ããåç´ã«ååä½ç¹ã§ã®äºæ¸¬å¤ã®éããã¿ãã ãã§ãªããä¾ãã°ãå¹´åor家è³ã®å¤§å°ã«ãã£ã¦æ ¼å·®ãå³ãææ¨ãã©ããããå¤åããã®ããæ¯è¼ãããã¨ã«ãã£ã¦å¤æ°ã®ç解ã«ç¹ãããã¨ãããã社ä¼å¦ãçµæ¸å¦ã®åéã§æ´»ç¨ãããäºä¾ãå¤ãããã¾ãã
ã¡ãªã¿ã«ãQuantile Regressionã®æ¨å®ã«ã¯ãcheck functionã¨å¼ã°ãã常ã«çµ¶å¯¾å¤ãåãã¤ã³ã¸ã±ã¼ã¿ã¼é¢æ°ãå©ç¨ãã¦ã
以ä¸ã®æ失é¢æ°ãæé©åãã¦ããã¾ãã
æçã«èª¬æããã¨ãæ£è§£å¤ã¨äºæ¸¬å¤ã®å·®ãæ£ã®ã¨ãã«ã è² ã®ã¨ã(å転ããã®ã§æ£ç¢ºã«ã¯æ£)ã«ãéã¿ä»ããããåãLossã¨ãã¦ãã¾ãã
å°åºã詳ããç¥ããã人ã¯ãã¡ãã®ãªã³ã¯ã詳ããã§ãã
- åè(pdf) QUANTILE REGRESSION : ROGER KOENKER
- wikipedia, Quantile Regression
- Statistics/Numerical Methods/Quantile Regression
çè«çãªé¨åã®è§£èª¬
ãªã³ã¯å ãè¦ã¦ãç´°ããªé¨åã®çç¥ãå¤ãã£ãã®ã§ãåå¼·ã¤ãã§ã«èªåã§ãå°åºããã£ã¦ã¿ã¾ãããèå³ããªããã°èªã¿é£ã°ãæ¨å¥¨ã§ãã
ã確çå¤æ°ããå¹³åå¤ããquantile(percentile)ããåå¸ã®ä¸å¿ã¨ãã¦å®ç¾©ããå ´åãå¹³åã¯èª¤å·®ã®å¹³æ¹åã®æå°å¤ã§è¡¨ããããã«ãä¸å¤®å¤ã¯çµ¶å¯¾åå·®åã®æå°å¤ã§è¡¨ããã¨ãã§ãã¾ãã
ã以ä¸ã®ã¨ãã®ã以ä¸ã®ããã«ç´¯ç©åå¸é¢æ°ãç¨ãã¦å®ç¾©ãã¾ãã
ãããéé¢æ°ã§è¡¨ãã¨ãããªãã¾ãã
ä¸ã§ä¸å¤®å¤ãå®ç¾©ãã¾ãããããããä¸è¬åããã¨ã
]
ããã§ã®ãæåã«åºã¦ããéè² ã¨ãªãã¤ã³ã¸ã±ã¼ã¿ã¼é¢æ°ã§ãã
ãã®ã¤ã³ã¸ã±ã¼ã¿ã¼é¢æ°ãä¸è¡ã§è¡¨ç¾ããã¨ãããªãã¾ãã
確çå¤æ°ã®ã¨ãã«ãããã確çå¯åº¦é¢æ°ããç´¯ç©åå¸é¢æ°ãã¨ãã¾ããããã¦ãé¢æ£å¤ã¨ãã¦ã§ãé£ç¶å¤ã¨ãã¦ã§è¡¨ç¾ããã¨ããããã以ä¸ã®éãã«ãªãã¾ãã
ä¸è¨ã«å¯¾ããã§å¾®åãã左辺ã0ã¨ç½®ããã¨ãã
ã¨ãªããã¨åä½ç¹ãçãããªãã¾ããã
Pythonã§è©¦ãã¦ã¿ã
Pythonç³»ã©ã¤ãã©ãªã®ããã¤ãã§ã¯ããã®Quantile Regressionã®æ失é¢æ°ããµãã¼ãããã¦ããããµã³ãã«ã¨ãã¦å°ãç´¹ä»ãã¦ã¿ã¾ãã
ã¿ããªå¤§å¥½ãscikit-learn
import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import GradientBoostingRegressor #### sinæ³¢ã«ãã¤ãºãå ãããã¼ã¿ã»ããã§ãã£ã¦ã¿ã¾ãã n = 300 # æ¨æ¬æ° noise = 0.2 # noiseã®å¼·ã np.random.seed(1) x = np.linspace(0, 2*np.pi, n) y = np.sin(x) + noise * np.random.randn(n) x = x.reshape(-1, 1) # 95%äºæ¸¬åºéãæ±ãããããä¸éã»ä¸éã¨ãã«5%ã2ã§çåãã0.025, 0.975ã®ä½ç½®ã§äºæ¸¬ãã alpha = 0.975 # model å®ç¾© clf = GradientBoostingRegressor(loss='quantile', alpha=alpha, n_estimators=250, max_depth=3, learning_rate=.1, min_samples_leaf=9, min_samples_split=9) # ä¸éã®äºæ¸¬ clf.fit(x, y) y_upper = clf.predict(x) # alphaãå転ãã¦ä¸éã®äºæ¸¬ clf.set_params(alpha=1.0 - alpha) clf.fit(x, y) y_lower = clf.predict(x) # æ失é¢æ°ãæå°2ä¹æ³ã«è¨å®ãã¦äºæ¸¬ clf.set_params(loss='ls') clf.fit(x, y) y_pred = clf.predict(x) # vizualization fig = plt.figure(figsize=(8, 4)) plt.plot(x, y, 'b.', markersize=10, label="æ¨æ¬") plt.plot(x, y_upper, 'k-') plt.plot(x, y_lower, 'k-') plt.plot(x, y_pred, 'r-', label='äºæ¸¬å¤') plt.fill(np.concatenate([x, x[::-1]]), np.concatenate([y_upper, y_lower[::-1]]), alpha=.5, fc='b', ec='None', label='95%äºæ¸¬åºé') plt.legend(loc='upper right') plt.show()
ç¾å¨ã¯ãã¢ã³ãµã³ãã«ã®GradientBoostingRegressorã§ããããã®æ失é¢æ°ã¯è¨å®åºæ¥ã¾ããã
ã¡ãªã¿ã«ã以åããç·å½¢ã¢ãã«ã«å¯¾ãã¦Quantile Regressionã追å ãããã¨ããåãã¯ããããã§ããç¶å ±ãå¾ ããã¾ããã https://github.com/scikit-learn/scikit-learn/issues/3148
å¾é ãã¼ã¹ãã£ã³ã°ã®LightGBM
import lightgbm as lgb # 95%äºæ¸¬åºéãæ±ãããããä¸éã»ä¸éã¨ãã«5%ã2ã§çåãã0.025, 0.975ã®ä½ç½®ã§äºæ¸¬ãã alpha = 0.975 # model å®ç¾© clf = lgb.LGBMRegressor(objective='quantile', alpha=alpha, n_estimators=250, max_depth=3, learning_rate=.1, min_samples_leaf=9, min_samples_split=9) # ä¸éã®äºæ¸¬ clf.fit(x, y) y_upper = clf.predict(x) # alphaãå転ãã¦ä¸éã®äºæ¸¬ clf.set_params(alpha=1.0 - alpha) clf.fit(x, y) y_lower = clf.predict(x) # æ失é¢æ°ãæå°2ä¹æ³ã«è¨å®ãã¦äºæ¸¬ clf.set_params(objective='regression') clf.fit(x, y) y_pred = clf.predict(x) # vizualization fig = plt.figure(figsize=(8, 4)) plt.plot(x, y, 'b.', markersize=10, label="æ¨æ¬") plt.plot(x, y_upper, 'k-') plt.plot(x, y_lower, 'k-') plt.plot(x, y_pred, 'r-', label='äºæ¸¬å¤') plt.fill(np.concatenate([x, x[::-1]]), np.concatenate([y_upper, y_lower[::-1]]), alpha=.5, fc='b', ec='None', label='95%äºæ¸¬åºé') plt.legend(loc='upper right') plt.show()
å®éãscikit-learnã«å®è£ ããã¦ããæ失é¢æ°èªä½ãä¸èº«ã¯
pred = pred.ravel() diff = y - pred alpha = self.alpha mask = y > pred if sample_weight is None: loss = (alpha * diff[mask].sum() - (1.0 - alpha) * diff[~mask].sum()) / y.shape[0]
ã¨ãé常ã«ã·ã³ãã«ãªæ§é ã«ãªã£ã¦ãã¾ãã ã¡ãªã¿ã«ãLightGBMã®å®è£ ã¯ä»¥ä¸ã®ããã«ãªã£ã¦ãã¾ããã
class QuantileMetric : public RegressionMetric<QuantileMetric> { public: explicit QuantileMetric(const Config& config) :RegressionMetric<QuantileMetric>(config) { } inline static double LossOnPoint(label_t label, double score, const Config& config) { double delta = label - score; if (delta < 0) { return (config.alpha - 1.0f) * delta; } else { return config.alpha * delta; } } inline static const char* Name() { return "quantile"; } };
ããããã®leaf_valueã§ã®æ£è§£ã¨äºæ¸¬ã®å·®ã«å¯¾ãã¦éã¿ãä»ãã¦ããã ããªã®ã§ãæ ¹æ¬ã¯åãã§ããã
ãã ããããä¸ã¤ã®å¾é ãã¼ã¹ãã£ã³ã°ä»£è¡¨æ ¼ã®Xgboostã§ã¯æ¨æºå®è£ ããã¦ããããèªåã§æ失é¢æ°ãè¨å®ããå¿ è¦ãããããã§ãã èå³ããã人ã¯èªä½ãã¦ã¿ãã¨é¢ç½ãããããã¾ãããã
ä»åã¯ãå帰ã¢ãã«ã®åºåã説æããã¨ããç¹ã§Quantile Regressionãç´¹ä»ãã¾ããã å帰ã¢ãã«ã®åºåã«é¢ãã¦èª¬æãæ±ããããã¨ãã®ä¸ã¤ã®ææ³ã¨ãã¦è¦ãã¦ããã¨è¯ãããã§ããã
ææ¥ã¯ãid:lllitchi ããããã¶ã¤ã³ã«ã¤ãã¦èªã£ã¦ããã¾ããã楽ãã¿ã«ï¼