ãéææ´æ°ãKaggleãã¼ãã«ãã¼ã¿ã³ã³ãã§ãã£ã¨å½¹ç«ã¤Tipsã¾ã¨ã
ããã¯ãªã«ï¼
- Kaggleã®ãã¼ãã«ãã¼ã¿ã³ã³ãã«åå ããã¨ãã«å½¹ç«ã¤(ã¨æã)Tipsã Kaggle Coursera ã®ææ¥ã¡ã¢ã«è²ã 追è¨ããå½¢ã§ã¾ã¨ãããã®ã§ã
- èªåã§ç解ã§ãã¦ããå 容ãä¸å¿ã«ã¾ã¨ãã¾ãããå種è³æã®å 容ã¯ã¾ã ã¾ã ç解ã§ãã¦ããªãå 容ãå¤ããä»å¾ãéææ´æ°ãã¦ããã¾ã(éææ´æ°ã§ããããã«åå¼·ãã¾ã)ã
- ãã®è¨äºã«æ¸ãã¦ããTipsãã©ã®ããã«æ´»ããããã¯Kaggleåæ¦è¨ã«æ¸ããã®ã§ãä½µãã¦ã©ããã
åèæç®
主ã¨ãã¦ä»¥ä¸ã®è³æã®å 容ãããã¯ã¢ããããã¦ããã ãã¾ãããå¼ç¨ãæè¨ãã¦ããªãé¨åã¯(ã»ã¼100%) Kaggle Coursera ã®å 容ã§ãã
- Kaggle Coursera
- kaggle_memo by nejumiãã
- Kaggleã§ä¸ç11ä½ã«ãªã£ããã¼ã¿è§£æææ³ãSansané«éç¦èµ·ã®æ¨¡ç¯ã³ã¼ãã«å¦ã¶
- Kaggle TalkingData Fraud Detection ã³ã³ãã®è§£æ³ã¾ã¨ã(å¿ç¨ç·¨)
- Kaggle Memo by amaotoneãã
- æè¿ã®Kaggleã«å¦ã¶ãã¼ãã«ãã¼ã¿ã®ç¹å¾´éã¨ã³ã¸ãã¢ãªã³ã°
- Santander Product Recommendationã®ã¢ããã¼ãã¨XGBoostã®å°ãã¿
- Tips for data science competitions
- Feature Engineering
('20/02/09追è¨) ãã¼ãã«ã³ã³ãã§ãã£ã¨å½¹ç«ã¤æ¬
- ãã®ããã°è¨äºãæ¸ããå¾ã«ããã¼ãã«ã³ã³ãã§å½¹ç«ã¤æè¡ãæºè¼ã®æ¬ãåºçããã¾ããã
- ãã®è¨äºã§ã¯kaggle-couseraã大ãã«åèã«ãã¦ãã¾ããã2020å¹´2æç¾å¨ã§ã¯ã¾ãã¯ãã®æ¬ãèªãã®ãè¯ããã¨ãããæ¥æ¬èªè©±è ã®kagglerãªãèªã¾ãªãã¨å¤§ãã«ä¸å©ã«ãªãã¨æãã¾ãã®ã§ãç´¹ä»ãã¦ããã¾ãã
EDA
å ¨è¬
- ãã¼ã¿ã¯ã¡ããã¨è¦ããããã¾ã«ééã£ã¦ããã(ã ããã«ã©ã ã®æå³ã¯ç解ããã)
- EDAã®ã¨ã£ãããããªãã£ããXGBoostãLightGBMã®feature_importanceãã¾ãè¦ãã®ã§ãè¯ã(2)
- ééã£ã¦ãåã¯æ¶ãããããªãã¦ãééã£ã¦ããã¼ã¿åã ã¨ã¢ã«ã´ãªãºã ã«æãããã(ããªãããã¢ã«ã´ãªãºã ã®ã»ããè³¢ããã¨ãããããã)
- å¯è¦åã¯å¤§åã ã
- ä¸è»¸ãå½¹ã«ç«ããªããªãäºè»¸ã«æ£å¸å³ããã
- æ£å¸å³ã¯Train/Testãå質ãªãã¼ã¿ãå¦ãããããã便å©ã ã
- 以ä¸ã®ã³ã¼ãã§ä¸çº
- ä¸è»¸ãå½¹ã«ç«ããªããªãäºè»¸ã«æ£å¸å³ããã
pd.scatter_matrix(df)
- ã¹ãã¤ã¯è¦ã¤ããããã«ãã¹ãã°ã©ã æ¸ãã (ããã¯çµ¶å¯¾)
- Excelãªã©ã§è²ã¤ãã¦çºãã¦ã¿ãã
- anonymous featureã¯ãªãã¼ã¹ã¨ã³ã¸ãã¢ãªã³ã°ããããã¾ã«è¯ããã¨ããã
- èªçæ¥ã®å¹´ã¨ãåºã¦ãã¦å½¹ç«ã£ãä¾ã¨ãããã
- 以ä¸ã®ä½¿ãã表ç¾ã¯è¦ããã
df.dtypes, df.info, x.value_counts, x.isnull
- æç³»åãªã©ã®å ´åã¯ç¹ã«æå³ã®ããé åºã«ã½ã¼ããã¦çºããã(5)
ãã¼ã¿ã¯ãªã¼ãã³ã°ã®è©±
- æ
å ±éã¼ãã®ã«ã©ã ãæ¶ãã
- 1種é¡ã®å¤ããå ¥ã£ã¦ãªãã«ã©ã
- å®å ¨ã«ããã£ã¦ãã«ã©ã
- ç¸é¢ä¿æ°ã1ã®ã«ã©ã
- åèKernel
- Testã§ããæ°ããå¤ãåºã¦ããªãã«ã©ã æ¶ãã
- ã©ãã«å¤ãããåãã«ãªãã«ã©ã æ¶ãã
pd.factorize
使ãã°æ¥½ã ã
- Train/Testã§ãã¼ã¿è¢«ã£ã¦ããã¨ããã®ã§ã念ã®ããçç±ãæ¨æ¸¬ãã¦ã¿ããã
- 移åå¹³ååã£ã¦ã¿ãããã¼ã¿ãæ£å¸¸ã«ã·ã£ããã«ããã¦ãªããã¨æ°ä»ããããï¼
- ãããããã¦ã¿ãã¨ãã¼ã¿ãã·ã£ããã«ããã¦ãã¨ãããããã
- ç·ãåºç¾ããããã®å¤ãé »åºã¨ãããããããã
- ã«ãã´ãªå¤æ°ã®åãæãåºã表ç¾ã¯è¦ããã
df.select_dtypes(include=['object']).columns
Numerical Data
æ¦è¦
- æ¡ç¨ããã¢ãã«ã«ãã£ã¦åå¦çã®æ¹æ³ã¯å¤ããã
- ç·å½¢ã¢ãã«ã¨æ±ºå®æ¨ã¢ãã«ã§ã¯å¿ è¦ãªãã¼ã¿ãéãã
- ç°ãªãåå¦çãè¡ã£ãã¢ãã«ãæ··ããã¨å¹ããã¨ãããã
Scaling
- 決å®æ¨ã¯ã¹ã±ã¼ãªã³ã°ã¯åºæ¬çã«é¢ä¿ãªãã
- MinMaxScaler â StandardScaler
- ç·å½¢ã¢ãã«ã§ã¯ã©ã£ã¡ã§ãå¤ãããªãã
- kNNã¯çµæ§å½±é¿ãããã
ç°å¸¸å¤æé¤ / å種å¤æ
clipping
- 99%ã¿ã¤ã«ã«ã¯ãªããã³ã°ããã¨ãå¹ãã
upperbound, lowerbound = np.percentile(x, [1, 99]) y = np.clip(x, upperbound, lowerbound)
rank transformation
- outliersãããããå¦çã§ããªãã¨ãã¯rankã«å¤æãã¦ãã¾ãã
- kNNãNNã¯ãã®å¤æãå¹ãã
scipy.stats.rankdata
log transformation
- NNã«æ¿ããå¹ãã
np.log(1+x) np.sqrt(x+2/3)
Numericalãã¼ã¿ã®ç¹å¾´éä½æ
- ã¨ã«ãã沢山ã®ç¹å¾´éãä½ã£ã¦ã¿ãã
count
,sum
,max
,min
,rolling
is_null
,is_zero
,is_not_zero
,over_0.5
- è¡ãã¨ã®çµ±è¨æ
å ±ãã¾ã¨ãã¦ã¿ãã(9)
Nanã®æ°
,0ã®æ°
,è² ã®æ°
ã¨ã
- è¦ç´ ã®æãç®ãå²ãç®ã¨ãããã
- â¯â¯ã¨â³â³ã¨ã®å·®
- ã«ãã´ãªå ã®å¹³åXX
- å¤æ°éã®ç¸äºä½ç¨ã¯
PolynomialFeatures
ã§è¨ç®ã§ãã
- å°æ°ç¹ä»¥ä¸ã ãåãåºããè¦ç´ ã¨ãä½ãã
- 4980åã®
980
åçãªãã¤- 4ãã«ã¨3.98ãã«ã¯äººéå¿ççã«ããããéãã
- ãã¡ããåãæ¨ã¦/åãä¸ãã¨ããä½ã価å¤ã¯ããã
- ãããã«ãã´ãªã«ã«å¤æ°ã¨ãã¦æ±ã£ã¦ãè¯ãããã(9)
- 4980åã®
- æ¨ã¢ãã«ã§ã¯ä»¥ä¸ã®ãã¼ã¿å¦çãå¿
è¦
- ç©ã¿ä¸ãã®å¤ãæ®éã®å¤ã«æ»ã (ç·å½¢ã¢ãã«ã§ã¯ä¸è¦)
- åãå·®ãç©ãªã©åãã®ã¯éè¦ã ããé å¼µãã
- æ¨ã¢ãã«ã¯ã«ã©ã éã®å¦çãé å¼µãã¨è¯ãã
- å·®ãæ¯çãç´æ¥è¡¨ç¾ã§ããªããã
- Owenæ°ã(8)
GBM only APPROXIMATE interactions and non-linear transformations.
Strong interactions benefit from being explicitly defined.
- æ¨ã¢ãã«ã¯ã«ã©ã éã®å¦çãé å¼µãã¨è¯ãã
- ãã¼ã¿ã«ãã£ã¦ã¯NumericalâCategoricalå¤æããã®ããã
- ä¾: 年齢層(30代åå)ã¨ã
- 網ç¾
çã«ããããããã¸ãã¹çã«æå³ãããæ¼ç®ãããã¨å¹çãã
- HomeCreditã§ã®ä¾
- åå ¥é¡ã¨æ¯æã®è¿æ¸é¡ã®æ¯
- å©ç¨é度é¡ã¨æã ã®å©ç¨é¡ã®æ¯
- æ¯æãäºå®æ¥ã¨å®éã®æ¯ææ¥ã®æ¥æ°å·®
- åå ¥é¡ã¨é éã®æ¯ç ãªã©
- Avitoã§ã®ä¾
- åä¸ã«ãã´ãªå
ã®å¹³åä¾¡æ ¼ã¨èªåã®ä¾¡æ ¼ã¨ã®å·®/æ¯
- ä¾: iPhoneã®ä¸ã§ã¯å®ã
åååã®ä¸ã®åè©
ã§ã°ã«ã¼ãåããä¾¡æ ¼å¹³åã使ã£ãç¹å¾´é
- åä¸ã«ãã´ãªå
ã®å¹³åä¾¡æ ¼ã¨èªåã®ä¾¡æ ¼ã¨ã®å·®/æ¯
- Kaggle Days Paris by CPMP
- 2Sigma Appt Rental: apartmentã®èª¬ææããç¸å ´ãäºæ¸¬ â ãã®ç¸å ´äºæ¸¬ã¨å®¶è³ã¨ã®å·®ãç¹å¾´éã«ãã
- TalkingData: ã¢ããªDLãããããåºåè¸ã¾ãªãããï¼ â åã端æ«ããåãADã¸ã®æ¬¡ã®ã¯ãªãã¯ã¸ã®æéãäºæ¸¬ããå®éã®æéã¨ã®å·®ã (ry
- Avito 9th
- HomeCreditã§ã®ä¾
- ãã³è©°ãããã¨å¹ããã¨ããã(9)
- Testã®æ°å¤ç¯å²ãTrainã¨å¤ãã¦ããã¨ãã¨ã
Categorical / Ordinal Data
Encodings
LabelEncoding
- æ¨ã¢ãã«ã«æå¹
- ç·å½¢ã¢ãã«ã®å ´åã¯
OnHotEncoding
ã追å ã§å¿ è¦ - æ¨ã¢ãã«ã¯
OnHotEncoding
ã¯ä¸è¦ãªãã¨ãå¤ã- é ããªãã®ã§ãããä¸å©ãããããªã
- ç·å½¢ã¢ãã«ã®å ´åã¯
- Så¸ãAå¸ãã¿ãããªãã¤ã¯æ°åã«ãã¦ãããã¨å¹ããã¨ããã
- æ¨ã¢ãã«ã«æå¹
FrequencyEncoding
- åºç¾æ¯çã§ã¨ã³ã³ã¼ãã£ã³ã°ãã
- ç·å½¢ã¢ãã«ã ãã§ã¯ãªããæ¨ã¢ãã«ã«ãæå¹
- ç°å¸¸å¤ã«å¼±ã(9)
- logãåãã¨å¹ããã¨ããã(9)
TargetEncoding
- å¼·åã ãã©data leakageã«ã¯æ³¨æããã
- leakageãé²ã工夫ã¯H2O.aiã®äººã®ã¹ã©ã¤ãã詳ãã
- Smoothing: æ¥æ¬èªè¨äºããã
- leave-one-out
- Weight of Evidence
- ãã®ä»ã®å·¥å¤«
- out-of-fold (Kaggle Meetup #4 LT by Jackãã)
- random noise ãå ãã (Port Seguro Kernel)
- æç³»åãªã©éå¦ç¿ãæ°ã«ãªãéã¯Targetãã®ãã®ã§ã¯ãªããéè¦ç¹å¾´éã«ãããã¦è¡ãã¨é¸æè¢ãã¢ãª(HomeCredit 2nd)
- åºæ¬çã«Leakãæ°ã«ãªãã±ã¼ã¹ã§ã¯åå¿è ã¯ç¨ããªãã»ããè¯ã(2)
- æç³»åã®å ´å
- CatBoost ã§éå»å
¨ä½ã®targetã®å¹³åãç¨ããã¨ããæ¹æ³ããã(kaggler-ja mamasãã)
- Column descriptionsã§
Timestamp
ãæå®ãã - ããã«training parametersã®
has_time
ãæå®ããã
- Column descriptionsã§
- éå»Næã®targetå¹³åã使ãã®ãè¯ã
- CatBoost ã§éå»å
¨ä½ã®targetã®å¹³åãç¨ããã¨ããæ¹æ³ããã(kaggler-ja mamasãã)
Tips
- è¤æ°ã®ã«ãã´ãªã«ã«å¤æ°ããã£ã¤ãã¦ä½¿ããã¨ããã
- æ§å¥â客室çæ°ã¨ã
- ç·å½¢ã¢ãã«ã§æå¹
- å¤ãã®æãåããããããã¨ãæãããª
- Owenã¯AmazonCompetitionã§7æãåããå¤æ°ã使ã£ã(8)
- ã¢ãã¿ãããªæãåãããå¹ããã¨ãããã
weird interaction
ã試ãã¦ã¿ã(9)
Datetime / Coordinates Data
Date and time
- æ¥ä»ã¯éè¦ãªãã¨å¤ãããå¾¹åºãã¦ç¹å¾´éä½æããã
- é常ã®Datetimeã«å ãã¦ãâ¯â¯ããâ³åãã¨ãã使ãã
- åæ§ã«ãâ¯ããæ¥ã¨â³ããæ¥ã®å·®ãã¨ãã
- åå/åå¾/æ·±å¤ã¨ããå¹³æ¥/ä¼æ¥/ç¥æ¥/ç¥åæ¥ãã¨ã
- å½æ°ã®ç¥æ¥ã大ããªã¹ãã¼ãã¤ãã³ãã第ä¸åææ¥ã¨ã(9)
- éè¦ãªæ¥ä»ããè¿ããã®ã®éã¿ã大ãããã¦ã¿ãã
- éã«é ããã®ã¯æ¸è¡°ããã¦éã¿ãå°ãããã¦ã¿ãã
- ææ¥ã»æå»ã»(æã®)æ¥ä»ãªã©ã®å¨ææ§ã®ããç¹å¾´éãåå¨ä¸ã«é ç½®ããcosã»sinã«å解ãã¦ã循ç°é£ç¶æ§ã表ç¾ããã®ãå¹æç(2)(9)
Coordinates(座æ¨/å°ç) Data
- éè¦ãªã©ã³ããã¼ã¯ãé½å¸ããã®è·é¢ã¨ã
- å¨è¾ºå°ä¾¡ã¨ãã®æå¹ãªãã¼ã¿ã®è¿½å ã¨ããå¹ã
- é½å¸ã¯geographic featureä½ã£ãããããããã£ã¦ã¿ãã
- è¿ãã«ä½å家ãããããå¦æ ¡ããããããªã©è¿ãã®æ å ±ã使ãã
- GPS座æ¨ãéµä¾¿çªå·ã®ä½æã¸ã®ä»ä¸ã¨ãããã£ã¦ã¿ããè¯ãããã(9)
- ãã»ã®ç©ºéæ
å ±ã«æ°ã¥ã(9)
- ä¸å¯è½ãªé度ã§ã®ç§»å
- ä»ããå ´æã¨éãå ´æã§ã®åºè²»
ãã®ä» ç¹å¾´éä½æ Tips
次å åæ¸ç³»
- PCAããNMFã®ã»ããæ¨ã¢ãã«ã«ä½¿ãããã次å 縮ç´ããã¦ããã
- PCAãå¹ãçç±
- XGBoostãªã©ã®æ±ºå®æ¨ç³»ã¢ã«ã´ãªãºã ã軸ã«æãã®è¡¨ç¾ãä¸å¾æ(表ç¾ã§ããªãããã§ã¯ãªãããPCAã§å転ããæ¹ãå¹çç)ã¨ãã¦ãããã¨ã«èµ·å ããã¨ã®ãã¨(2)
- ãããã¯ã¢ãã«(LDAç)ã®æ´»ç¨ãæ¤è¨ããã(4)
æç³»åãã¼ã¿
- åºæ¬çã«ã¯ç´è¿ã®ãã®ã§å¦ç¿ãããã»ããè¯ããã1å¹´åãªã©ã«ç¹å¾´çãªå¨ææ§ããªããã¯ç¢ºèªãã¹ã(5)
- æç³»åã§éã¿ä»ãããã«ã¦ã³ããªã©ã¯æå¹ãªç¹å¾´éã«ãªãå¾ã(5)
- ãè²»ç¨ãã¯å é±ã®è²»ç¨ãå æã®è²»ç¨ãæ¨å¹´ã®è²»ç¨ãªã©ã«å解ãã¦ã¿ãã(9)
- å
¨ã¦ã®ãã¼ã¿ãä¸åº¦ã«ä¸ããããã³ã³ãã§ã¯ã¦ã¼ã¶ã¼ã®æªæ¥ã®è¡åã®æ
å ±ã®ç¹å¾´éãä½ãã¨å¼·ã (TalkingData 1st)
- éå»ã®è¡åãããæªæ¥ã®è¡åã«åºã¥ãç¹å¾´ã®ã»ããããã¦ãå¼·ã
- ç´åã®ã¯ãªãã¯ããã®çµéæé
- ä»å¾1æé以å ã§ã®ãã®ã«ãã´ãªå¤æ°ã®å¤ã®åºç¾åæ°
- éå»ã®è¡åãããæªæ¥ã®è¡åã«åºã¥ãç¹å¾´ã®ã»ããããã¦ãå¼·ã
- éå»ã®ãã©ã°æ¨ç§»ãé£çµããæååãã«ãã´ãªå¤æ°ã¨ãã¦æ±ã(7)
- æ°´æºæ°ã100ãè¶ ãããããªå ´åã«æå¹ãªãã¨ãå¤ã
- DeepLearningã§ã¯ãªããå¹ããªããã¨ãå¤ã
- æååãã¹ãã°ã©ã ã¯bosch 15thã§ã使ããã¦ãã
- æç³»åãããªãã¦ãæååé£çµãã¦ã«ãã´ãªå¤æ°ã«ããã®ã¯ä»ã®ã³ã³ãã§ã使ããã¦ãã模æ§(kaggle: Porto Seguro's Safe Driver Prediction ã¾ã¨ã)
巨大ãã¼ã¿ãæ±ã
- ãã¼ã¿ã大ããã¨ãã¯è¤æ°ããfoldã®ãã¡1ã¤ã ãã§æ¤è¨¼ããã®ããã(Kaggle boschã³ã³ãæ¯ãè¿ã)
ãã¯ããã¯ç³»
ãã¾ãã«ãã´ã©ã¤ãºã§ããªãã£ããã¯ããã¯é
- DenoisingAutoEncoder
- kNNã«ãã£ã¦è¦³æ¸¬ãããè¿åç¹ã®æ°ãç¹å¾´éã¨ãã¦è¿½å ãã(2: Facebook Predict check-insã®1st place Winnerâs solution)
- KMeansçã®ã¯ã©ã¹ã¿ãªã³ã°ãã¦ãã¯ã©ã¹ã¿IDã使ã or/and ã¯ã©ã¹ã¿ä¸å¿ã¾ã§ã®è·é¢ãç¹å¾´éã¨ãã¦å ãã(Malware '19 kernel)
- äºæ¸¬å¤ãç¹å¾´éã«å ¥ãã¦ããä¸åº¦å¦ç¿ãã(Home Credit)
- (Targetã§ã¯ãªã)éè¦ç¹å¾´éãäºæ¸¬ããã¢ãã«ãä½ã£ã¦æ´»ç¨ãã (HomeCredit 2nd)
- äºæ¸¬ããæ°å¤ã¨å®éã®æ°å¤ã®å·®ãªã©ãæ´»ç¨ã§ãã (Elo 18th: äºæ¸¬ããæ¥ä»ã¨å®éã®æ¥ä»ã®å·®)
- æ´»ç¨äºä¾ã¯(6)ã«è©³ãã
- ã¡ã¤ã³ãã¼ã¿ã¨ãµããã¼ã¿ã1:å¤ã®ãããªå ´å (Home Credit ã Elo ã®å ´å)
- ãµãã«æç³»åæ§ãããã°ã«ãã´ãªå¤æ°ã¯æå¾ã®å¤ãåãããªã©ã使ãã
- ãµããã¼ã¿ã«ã¡ã¤ã³ãã¼ã¿ããã¼ã¸ãã¦ãµãã ãã§å¦ç¿ãããã¨å¹ããã¨ããã (Home Credit 17th / Elo 1st)
- å
é /æ«å°¾ã®Nè¡ã®ã¿ã§Aggregationãªã©ã使ãã
- Nã®å¤ãããããå¤ããã¨ãªãè¯ã
- 代表çãªç¹å¾´éã ãã§è¨ç®ããkNNã§è¿å500ç¹ã®targetã®å¹³åãç¹å¾´éã¨ãã¦å ãã (Home Credit 1st)
- Adversarial Validation ã§ããªãããã«ç¹å¾´éãå å·¥ãã¦Trainã¸ã®éå¦ç¿ãé¿ãã (Malware '19 6th)
- Kè¿åãç¨ããç¹å¾´éæ½åº
- FMãæ´»ç¨ããç¹å¾´éä½æ
- Kaggle Past Solutions
- ã¡ããã¨æ¼ãã°å®ã®å±±ã®ã¯ã
ç¹å¾´éé¸æ
- Ridgeãªã©ã«ããã¦1å¤æ°ãã¤æå¹ãã©ãã確èªããstepwiseãæ©ã(5)
- LightGBMãªã©ã§ã§ããããã¡ãã®ã»ããè¯ãããè¨ç®æéã¨ã®å ¼ãåãã§æ±ºãã¦ãè¯ãã
- boruta, eli5, NullImportances
- (ä»ã®æ使ã£ã¦æåãããã¨ãªãâ¦orz)
æ¬ æå¤å¦ç
è¦ã¤ãæ¹ & Fillana Approaches
- è¦ã¤ãæ¹: ãã¹ãã°ã©ã ãããã
- Fill NA Approaches
- ã«ãã´ãªå¤æ°ãæ°å¤å¤æ°ãã«ãã£ã¦æ¬ æå¤ã®æ±ãæ¹ãå¤ããã
- é©å½ãªoutlierã«æ¸ãæãã
- æ¨ã¢ãã«ã®å ´åã«æå¹
-99999
ããMAXå¤+1
ã¨ãminå¤-1
ã¨ã
- å¹³åãä¸å¤®å¤ã§æ¸ãæãã
- ç·å½¢ã¢ãã«ã§æå¹
- æ¨ã¢ãã«ã«ã¯ãã¬ãã£ã
- ã«ãã´ãªå¤æ°ã®å ´å
- æ¬ æå¤ããæ°ããªã«ãã´ãªãã¨ãã¦æ±ãã®ãè¯ã(3)
- Kåã®ã«ãã´ãªããã£ãããæ¬ æå¤ãã«ãã´ãªã¨ã¿ãªããK+1åã®
One Hot Encoding
ãè¡ã
- Kåã®ã«ãã´ãªããã£ãããæ¬ æå¤ãã«ãã´ãªã¨ã¿ãªããK+1åã®
- æ¬ æå¤ããæ°ããªã«ãã´ãªãã¨ãã¦æ±ãã®ãè¯ã(3)
- æ¨æ¸¬ãã
- 注æ
-99999
ã¨ãã§ç½®ãæããã¨ãã«CategorycalEncoding
ãªã©ã«å½±é¿ããªãããã«æ³¨æ- feature generationã®åã«fillnaããªãã»ããè¯ã
Tips
category_x_isnull
ã®2å¤ã«ãã´ãªã¼ãæå¹- Trainãã¼ã¿ã«åºã¦ããªãã«ãã´ãªã¼ãTestã«åºã¦ãããï¼
- Trainã¨Testã§ã®åºç¾åæ°ãã«ã¦ã³ããã¦ãããã«ãã´ãªã¼ã«ãã
- XGBoost/LightGBMã¯NaNãç´æ¥æ±ãã
- å¤ã0ã§ãã説æå¤æ°ã®æ°ãæ°ãã¦è¿½å ã®èª¬æå¤æ°ã¨ãã¦å ãã¦ã¿ã(2)
ä¸åè¡¡ãã¼ã¿ã®æ±ã
- Over-/Under-sampling
- SMOTEã¯å¹ããªããã¨ãå¤ã
- Under-samplingã§ããã¡ãã¨ãã®ã³ã°ããã°ç²¾åº¦åºãããåé¡ãªã(TalkingData 1st)
- imbalanced-learnã®å®è£
ã使ã£ã¦ã¿ãã
- å©ç¨ä¾: Porto Seguro Kernel
- éã¿ã®å¤æ´
- LightGBM:
class_weight
ã®èª¿æ´ã¨ã
- LightGBM:
Validation
Good Validation is MORE IMPORTANT than Good Model.
(8)- validationã®splitã¯ãªã¼ã¬ãã¤ã¶ã¼ã®ããã模å£ããã
- 模å£ãã¦è¯ãValidationãåºæ¥ãã
TRUST LOCAL CV
- æ£ããValidationãªãã§é²ããã®ã¯ç¾ éç¤ç¡ãã§èªæµ·ããã®ã¨åã(é²ã¾ãªãã)
- ã模å£ãããã®å
·ä½ä¾: Malware '19 6th
- ã¤ãã§ã«ãæç³»åçã®å ´åã®Validationã®åãæ¹ã®æ³¨æã®åæãããã£ã¦è¯ã
- 模å£ãã¦è¯ãValidationãåºæ¥ãã
Normal-KFold
ãStratified-Kfold
ãï¼- å¤ã¯ã©ã¹åé¡ãªã
Stratified-Kfold
å¿ é (5) - å帰ã§ã1å¤æ°
k-means
ãã¦Stratified
ã«ãããã¨ã(5)
- å¤ã¯ã©ã¹åé¡ãªã
- Adversarial Validation
- Testã«è¿ãTrainãã¼ã¿ã§Validationãè¡ãã®ããã
- LBãä¿¡ããã¹ããå¦ã(8)
Ensemble / Stacking
- KAGGLE ENSEMBLING GUIDEãåèã«ãªã
- åºç¤ã¯Ensemble/Stackingã«ããå®æãªã¹ã³ã¢ã¢ããã¯æ§ããã(2)
- æéã¨ä½è£ããããã¡ã«ãä»ã®å¤ãã®äººã¯ãããªããããªç¬åµçãªææ³ã模索ããã
- 人ã¨åãã¢ããã¼ãã¯å¾ããåãå ¥ããã°è¯ã
- ãã©ã¡ã¼ã¿å¤ããã¢ãã«ãé åããã ãã§ãã¹ã³ã¢ä¼¸ã³ãã
- random seed average ã¯ãã¯ã常è(4)
- Averagingã¯åç´ã ãå¼·åãªææ³(2)
- Nä¹å¹³åã対æ°å¹³åãªã©è©¦ãã¦ã¿ãã¨è¯ããããããªã(2)
- å°ããªãã¼ã¿ã ã£ããå¹³åã¢ã³ãµã³ãã«ã§åå
- SingleModelã®ç²¾åº¦ã«ãã ãããããªãããã«
- Ensembleãããã¨ã¯å¤§åæã¨ãã¦å¤æ§ãª(ç¸é¢ã®ä½ã)ã¢ãã«ç¾¤ãç¹å¾´éãä½æããã
- å¼±ãã¢ãã«ã§ãEnsembleã§åãããã¨ããããã念ã®çºã«åã£ã¦ããã
- Ottoã®ä¾: P57ããããã
- Owen(8):
The strongest individual model does not necessarily make the best blend.
- ã¨ã¯ããæè¿ã¯
A great model is better than ensemble of weak models
ã¨ããèãæ¹ã復活ãã¦ãã¦ã¯ãã- CPMPãããKaggle Days Parisã§ãbestfittingãããè¨ã£ã¦ããã¨ç´¹ä»
- CPMPããã¯TalkingDataã§48åã®ç¹å¾´éã®LightGBMã·ã³ã°ã«ã¢ãã«ã§å¤§å¤æ°ã®ãã¼ã ã®Ensembleã¢ãã«ã«åã£ãã¨ã®ãã¨
- GBMã¨ç¸æ§ãè¯ãã¢ãã«
- RandomForest(2)
- Neural Network(çç±: Tree Baseã¢ã«ã´ãªãºã ã¯æ±ºå®å¢çãç¹å¾´è»¸ã«å¹³è¡ãªç©å½¢ã«ãªãããNNãªã©ã¯æ²ç·(æ²é¢)ã¨ãªããã(2))
- Glmnet(RidgeãLASSO)(8)
- æµ ãã¢ãã«ã¨æ·±ãã¢ãã«ãæ··ããã¨å¹æãããã¨ããã
- Adversarial Validationã誤åé¡ãããã¼ã¿ãéè¦ãã¦Ensemble Weightãããã(6)
- LightGBMã¨XGBoostãã¢ã³ãµã³ãã«ãã¦ããã¾ã精度ããããããä¾ã¯ãªã(7)
- Stakingã®å®è£ ä¾: Porto Seguro Kernel
- AUCãè©ä¾¡ææ¨ã®ã¨ãã¯ã¢ã³ãµã³ãã«ããåã«rankåããã(KAGGLE ENSEMBLING GUIDE)
æ§ã ãªã¢ãã«
- Regularized Greedy Forest: ããé
ããã©è©¦ãã¦ã¿ã価å¤ããã
- (ç¸å¯¾çã«)æ©ãå®è£ ããã: FastRGF
- CatBoost: ããã©ãã©ã¡ã¼ã¿ãè¯ããããã£ã¡ãæéã®ç¯ç´ã«ãªãã
- tSNE
- å¼·åã ãã©ãã¼ãã¬ã·ãã£ã®ãã¥ã¼ãã³ã°ã大å¤é£ãã
- TSNEã®sklearnããé ãããtsneããã±ã¼ã¸æ¨å¥¨
å¦ç¿ã®Tips
- è©ä¾¡ææ¨ã«ã¨ããããªãã¦ãè¯ã
- (æç³»åãã¼ã¿ãªã©ã®å ´å) ç¹å¾´çãªå¾åãããã°å¦ç¿ãã¼ã¿æéããããã¦ãã¾ãã®ããã(7)
- XGBoostã«ããããã¿(7)
- åããã¼ã¿ã»ããã«å¯¾ãã¦ã¯æé©ãªå¦ç¿åæ°ã¯ãµã³ãã«æ°ã«å¯¾ãã¦ãããç·å½¢ãªã®ã§æ¨æ¸¬ãã§ãã(åçã0ãããªãã®ã§ããã ãã¯æ³¨æ)
- åä¸ã¢ãã«ã§ã¯å¦ç¿åæ°ãä¸è¶³ã§ãéå°ã§ã精度ãè½ã¡ãããrandom seed averagingã¯å¦ç¿åæ°éå°æ°å³ã§ããã¾ã精度ãè½ã¡ãªã
max_depth
ãå¤ãã»ã©averagingã®å¹æã大ãããªãã®ã§ãåä¸ã¢ãã«æ¤è¨¼ã§ç²¾åº¦ãæ®æãã¦ãããããæ·±ãæ¹ãé¸ã¶ã®ããã¿ã¼å¦ç¿ç
ã¯å°ããããã°ç²¾åº¦ãè¯ããªãããaveragingã§ã¯ãã®å¹æã大ããæ¸å°ããã®ã§ãå¿ è¦ä»¥ä¸ã«å°ããããªãã»ããå¦ç¿æéçã«æå©ãcolsample_by*
ã¯å°ããããã¨å¦ç¿æéãç·å½¢ã«æ¸å°ããã精度åä¸ã«ãå¯ä¸ãããã¨ãå¤ãã®ã§å°ããã®å¤ã試ãã¦ã¿ããã
Tuning
- ãã©ã¡ã¼ã¿ã®æå³ã¯ ãã®è¨äº ãã¨ã¦ããããæãã§ã
- XGBoost
max_depth=7
ãããããå§ããã®ããã- Lightgbmã¯
æ·±ã
ã«å ãã¦æ¨å ¨ä½ã®èã®æ°
ãå¶å¾¡ã§ããã®ã§èéãå©ãã¦è¯ã bagging_fraction
: ï¼ã¤ã®æ¨ãè²ã¦ãã®ã«ã©ããããã®ãã¼ã¿ã使ã£ã¦è¯ããã決ãã ã°ãªã¼ã³ (ãªã¼ãã¼ãã£ããã£ã³ã°ãããä¸ããã°ãã)feature_fraction
: ï¼ã¤ã®æ¨ãè²ã¦ãã®ã«ã©ããããã®ç¹å¾´éã使ã£ã¦è¯ããã決ãã ã°ãªã¼ã³min_data_in_leaf
: æ±åã«æãéè¦ãªãã©ã¡ã¼ã¿ ã¬ãã (ãªã¼ãã¼ãã£ããã£ã³ã°ãããä¸ãã)lambda_l1
,lambda_l2
: ãããéè¦ã0,5,15,300 ããããè¦ã¦ã¿ããnum_round
: =ä½æ¬ã®æ¨ãä½ãããæåã®æ¹ã®æ¨ã§ååã«èª¬æã§ãã¦ãã¾ããã¨ããããlearning_rate
: é«ãããã¨å ¨ãåæããªãå¯è½æ§ãããé©åãªå°ããã®å¦ç¿çã¯æ±åã«æå¹ã- trick: 以ä¸ãè¡ãã¨å¤§æ¦ç²¾åº¦ãä¸ãã (å¦ç¿æéã¯é·ããªãã )
step * alpha
eta / alpha
boosting_type
: dropoutã模ããdart
ã£ã¦ããã®ãæå¹- Owenæµ(8)
- CPMPæµ (Kaggle Days Paris)
- subsample=0.7 & ä»ã¯ããã©ããå§ãã
- min_child_weight: train/val gap 大ãããã°å¢ãã
- ãããã max_Depth or number_of_leaves ã調æ´ãã
- LBãCVããä½ããã°æ£ååãå¼·ãã
- LightGBM
- ãªã«ã¯ã¨ããã å ¬å¼ Parameters Tuning ã¬ã¤ã
num_leaves
å¤ããfeature_fraction
ããªãå°ãããã¨ãããã(5)feature_fraction = sqrt(n_features)/n_features
ç¨åº¦ã ã¨ç¹å¾´éæ°ã®å½±é¿ãåãã¥ãããè¯ã(5)- dartã ã¨early_stoppingããã¨ãã«æåãå¤ã ããæ°ãã¤ãã¦
- Random Forest
- gbmããæ·±ãæ¨ã§ã大ä¸å¤«ãªã®ã§ãã©ã¤ãã¦ã¿ãã
- giniã大æ¦è¯ããããã¾ã«ã¨ã³ãããã¼ãåã¤
- ã³ã³ãã®åºç¤ã¨ä¸ãçµç¤ã®2åããã¨ããæ¹æ³ããã(Malware '19 Discussion)
- Optunaãå¦ç¿ä¸æ/åéã¨ããã§ãã¦ä½ãã¨ä¾¿å©
ãã®ä»Tips
- Testãã¼ã¿ã«å¯¾ãã
semi-supervised learning
ã¯ç¾ç¶ã§ä¸»è¦ãªå·®å¥åãã¤ã³ãã®ã²ã¨ã¤(2) - çµäºç´åã§æ¿å¼·kernelãåºç¾ãããã¨ãããã®ã§å¯¾å¿ãå¿
è¦(5)
- ç¹ã«é ã¡ãã«åå ãããã®ã¨ã
- æçµæ¥ã¯2ãµãæ®ãã¦ããèªåã®solutionã¨ã¢ã³ãµã³ãã«ã§ããããã«ãã¦ããã°å®å¿
- 1ãµãæ®ãã ã¨ç²¾ç¥çã«ãã¤ãã®ã§ã2ãµãæ¨å¥¨ããã
é»éè¡ç³»
ã¾ã 使ã£ã¦ãªããã©å½¹ã«ç«ã¡ãããªãã®
- Linear Quiz Blending
- éºä¼çããã°ã©ãã³ã°ã«ããç¹å¾´éä½æ
- æ´»ç¨ä¾: HomeCredit 2nd
- Porto Seguro Kernel
æªæ´ç: 使ããã©ã¤ãã©ãªã¨ã
ãã¨ã§ã¾ã¨ãã(ãã)
- pandas-profiling
- ã³ããã§ä½¿ãããKaggleã§ã®å®é¨ãå¹çåããå°æã¾ã¨ã
- ããªãã®çç£æ§ãåä¸ãããJupyter notebook Tips
- Jupyter Notebook Viewer
- kaggle-apiã¨ããKaggleå ¬å¼ã®apiã®ä½¿ãæ¹ãã¾ã¨ãã¾ã
- JupyterLabã®ããããæ¡å¼µæ©è½7é¸
- Vaexå ¥é / å¯è¦åãXGBoostã
- Preemptive Instance ã®ã·ã£ãããã¦ã³æã«ä½¿ããScript
- pandas.DataFrame ã®forã«ã¼ãããããµãâ³æ¹è¯ãã¦300åé«éåãã
æå¾ã«
åé ã«ãæ¸ãã¾ããããå人çã«ç解ã§ãã¦ãã(ã¨æã£ã¦ãã)å 容ã®ã¿æ¸ãã¦ãã¾ããã¾ã ã¾ã ç解ã§ãã¦ããªãå 容ãå¤ãã®ã§ãä»å¾ãéææ´æ°ãã¦ããã¾ãã
ãã¨ãå®é¨ã«é£½ãã¦ããããã¼ãã¹ã»ã¨ã¸ã½ã³åè¨éãèªãã¨ãã¾ã«å æ°ã¥ãããã¾ãããããããã°åèã«ãã¦ãã ããç¬ã
ç§ã¯å¤±æãããã¨ããªãã ãã ã1ä¸éãã®ããã¾ãè¡ããªãæ¹æ³ãè¦ã¤ããã ãã ã
ç§ãã¡ã®æ大ã®å¼±ç¹ã¯è«¦ãããã¨ã«ããã æåããã®ã«æã確å®ãªæ¹æ³ã¯ã常ã«ããä¸åã ã試ãã¦ã¿ããã¨ã ã
ã»ã¨ãã©ãã¹ã¦ã®äººéã¯ããããã以ä¸ã¢ã¤ãã¢ãèããã®ã¯ä¸å¯è½ã ã ã¨ããã¨ããã¾ã§è¡ãã¤ããããã§ããæ°ããªããã¦ãã¾ãã åè² ã¯ããããã ã¨ããã®ã«ã