Kaggle Data Science Bowl 2019 åæ¦è¨ ã10ä¸ãã«ã®å¤¢ãè¦ã話ã
ããã¯ãªã«ï¼
- Kaggleã§10/24-1/23ã«éå¬ãããData Science Bowl 2019ã³ã³ãã®åå è¨é²ã§ã
- åä¾åãã®æè²ã¢ããªã®ãã°ãã¼ã¿ãå ã«ãåä¾ãã¡ã課é¡ãã©ããããã®ç²¾åº¦ã§è§£ããã¨ãã§ããããæ¨å®ããã¿ã¹ã¯ã§ããã
- åªåè³é10ä¸ãã«ã®å¤§ç¤æ¯ãèããªã³ã³ãã§ãå人ã§æé«5ä½ã¾ã§é ä½ãä¸ãã£ãã¨ãã«ã¯ãªããªããã夢ãè¦ããã¨ãã§ãã¾ããã
- ãã ãè©ä¾¡ææ¨ã®ç¹æ§åã³publicLB(æ«å®é ä½)ã®ç®åºã«å©ç¨ãããã¼ã¿æ°ä¸è¶³ãªã©ãããæ«å®é ä½(publicLB)ã¨æçµé ä½(privateLB)ãæ¿ããå ¥ãæ¿ããã³ã³ãã§ããã
- è©ä¾¡ææ¨ã«æ¯ãåããã¦ã¢ã¿ãã¿ããæãå¥ãpublic 17thããprivate 56thã¨å¤§ããé ä½ãä¸ããã¨ãããã¾ããããããªãçµæã«çµãã£ã¦ãã¾ã£ãã®ã§ãããåçãè¾¼ãã¦ãã£ããã¨ã®ã¡ã¢ãæ®ãã¦ããã¾ãã
ãã夢è¦ã¦ããã¨ãã®ãã¤ã¼ã
æ¨ã¦ãµãã®ã¤ããã§æãããµãã§ããªãããããã¨ããã¾ã§æ¥ã¦ãã¾ã£ãâ¦ã pic.twitter.com/2zudPLf4ut
— ML_Bear (@MLBear2) 2020å¹´1æ11æ¥
ãã£ããã¨
ã³ã³ãã«åå ããã¾ã§
- æ¨å¹´10æã®IEEEã³ã³ãã§åãã¦éã¡ãã«ãåããã®ã§ãããã³ã³ãã®ç· åãçµäºç´åã«å»¶é·ããããªã©ã®ãã©ãã«ããã大å¤ã«ç²å¼ãã¦ãã¾ããã
- ãªã®ã§ãkaggleãã°ãããããã¨æã£ã¦å°ãkaggleãä¼ãã§ãã¾ãããã12æã«åå ããKaggle Days Tokyo ã®ãªãã©ã¤ã³ã³ã³ãã楽ããã¦ãã¼ãã«ã³ã³ã欲ã復活ãã¦ãã¾ããã(Kaggle Days Tokyo ãªãã©ã¤ã³ã³ã³ãåæ¦è¨)
- ããã§ãå¹´æ«å¹´å§æéããã£ãã®ã§è»½ãKernelãDiscussionãè¦ã¦ã¿ãã¨ã以ä¸ã®ãããªå·¥å¤«ãä½ãè¨åããã¦ããªãã£ãã®ã§ãããããã ãã§ãã¾ãã¾ãè¡ããããªãã¨ãæã£ã¦åå ãã¦ã¿ããã¨ã«ãã¾ããã
- test-setå ã§AccuracyGroupãç¹å®ã§ãããã¼ã¿ãtrainã«å©ç¨ãã
- targetãå¤ããã¢ãã«ãæ´»ç¨ãã
- accuracyãã®ãã®
- ããããæ£è§£ãããã©ãã
- 4100(4110)ã¤ãã³ããä½åèµ·ããã
â» QWKã¯æºããææ¨ã¨èãã¦ãããã¨ãåã£ãã®ã§ãã¯ã³ãã£ã³æºãã¦ã½ãéãããããã¨ããæç®ãåã£ãã®ã¯æ¸ãã¾ã§ããªãã¨æãã¾ãç¬ã
ã¯ã³ãã£ã³ã§ã½ãéåãããããæãããããã¨ãæã£ã¦ãé ã®ãã¤ã¼ããããã3é±éåã¨ãä¿¡ããããªãã
ã©ããããæéãããããããããªããã©ãã¨ããããå¹´æ«å¹´å§æãªã®ã§DSBã¯ããã¦ã¿ãã
— ML_Bear (@MLBear2) 2020å¹´1æ2æ¥
ã³ã¼ãã³ã³ãåãã¦ãªãã ãã©ãã³ã¼ããç¹å¾´é管çã¨ãããªããªãè¾ãããããã«æ £ããã ãã§çµãã£ã¡ããããw
ã³ã³ãåå ç´å¾
- ã¾ããã¼ã¿ãã£ããè¦ãå¾ãKernelããã¼ã¹ã«ãã¦åºç¤çãªç¹å¾´éãä½ãã¾ããã
- ãã¼ã¹kernel
- installation_idå ¨ä½ã§SUMåããã¿ãããªãã¤ã¨ãã¯å½ç¶æãã¾ããã(ææ°çã¨ãã ã¨æ¶ãã¦ããã)
- adjust_factorã¨ãã®ããããããããããªãã£ãã®ã§ç¡è¦ãã¾ããã
CVæ§ç¯
- trainã¯testã«æ¯ã¹ã¦ããããã¬ã¤åæ°ãå¤ãã¦ã¼ã¶ã¼ãæ£è¦ãããã®ã§åããªãã¨ãããªãã¨æã£ã¦ãã¾ããã
ãã®ãããAdversarial Validation ããã¨ã«testã¨ä¹é¢ãã¦ãããªä¸ä½30%ç¨åº¦ã®ãã°ãç¹å®ãã以ä¸ã®å¦çã«æ´»ç¨ãã¾ããã
- QWKã®ãããæé©åã¸ã®æ´»ç¨
- testã¨ä¹é¢ãã¦ããªãtrainã®ãã¼ã¿ãããtestã®åå¸ã«åããã¦500åç¨åº¦ãµã³ããªã³ã°ãè¡ã£ã¦è©ä¾¡ãã¼ã¿ã»ãã群ãä½ããQWKã®ãããå¤æé©åãè¡ãã¾ããã
- 500ã£ã¦é©å½ã«æ±ºãããã©ããããªæãã§ãµã³ããªã³ã°ãã¦ãã人ã¯å¤ãã£ãã¤ã¡ã¼ã¸ã§ãã
- trainã¨testã®åå¸ãã©ããããããã¦ããããããªãã£ãã®ã¨ãthreshold optimizerã®æåãä¸å®å®ã«æããã®ã§å¹³ååãããã£ãã
- ã¢ãã«ã®è©ä¾¡ã¸ã®æ´»ç¨
- ä¸è¨ã¨åããã¼ã¿ã»ãã群ãç¨ãã¦ãRMSEã®å¹³åãåã£ã¦ã¢ãã«ã®ç²¾åº¦ã®ç¢ºèªãè¡ãã¾ããã
- early_stoppingããã®æé¤
- å¦ç¿æã®ValidationSetããåé¤ãã¦early_stoppingã®åèã«ããªãããã«ãã¾ãã
- TrainSetããæ¶ããã¹ãããã¦ã¿ã¾ãããå ¨ç¶ãã¡ã ã£ãã®ã§å¦ç¿ã«ã¯ä½¿ãã¾ããã
- QWKã®ãããæé©åã¸ã®æ´»ç¨
ä¸è¨æ¹éã¯Trainãhold-outãã¦é©å½ã«åã£ããã®ã§æå ã§å®é¨ããªãã決ãã¾ãã
- testããã¾ãåç¾åºæ¥ã¦ãããå¾®å¦ãªã®ã§ãããä½ããªãããã¯ãã·ããªã¨ä¿¡ãã¦ãã£ã¦ã¾ããã
- IEEEã³ã³ãã§ãã¼ã ã¡ã³ãã¼ããã£ã¦ãã®ãå¦ãã§ãã®ã§åèã«ãã¦ããã¾ããã
Data Augumentation
- test-setå ã§AccuracyGroupãç¹å®ã§ãããã¼ã¿ãtrainã«å©ç¨ãã¾ããã
- æå ã®æ°å¤ã¯å ¨é¢çã«è¯ããªãã®ã§ãããpublicLBããªããä¸ãã£ã¦ãã¾ããã
- ãã®ãããæå¾ã®æå¾ã§æ¶ãã¦ãã¾ãã¾ããã-0.004ãããã®ãã¹ã§ãããæçµãµãã®çæ¹ã§ã¯æ®ãã°ããã£ãã
Private Dataset Probing
- Assessment1åããã£ããã¨ãªã人ãã©ããããããã®ãç¥ãããã£ãã®ã§å°ãã ãè¡ãã¾ããã
- publicããçµæ§å¤ãã£ãã®ã§ãpublicã¯ãã¾ãåèã«ããªãããã«ãã¾ããã(çµå±æçµçã«åèã«ããã®ã§ãã)
Feature Engineering
以ä¸ãè¡ã£ã¦ããã¼ã¹ããåã£ã¦ãããã®ã¨åããã¦1150åãããã«ãªãã¾ããããã ãå®è³ª3æ¥ããã£ã¦ãªãã®ã§ãã®ããããã£ã¨ããããã£ãã
- ãã¼ã¹ã®Kernelã«ããã¯å¹ãã§ããã£ã¦ãã¤ã足ãã¦ããã¾ããã
- åãã¿ã¤ãã«ã®éå»ã®æ績ãã¤ãã³ãã«ã¦ã³ã
- åãã¯ã¼ã«ãã®(以ä¸åã)
- éä¸ã®ã²ã¼ã ã®è©ä¾¡ã詳ãã
- 課é¡ãã¨ã® correct rate
- 4020/4025ç³»ã¤ãã³ãã®éè¨
- correctç³»ã®éè¨
- clip length
- ä»ã«ç¬èªã®ãã®ãããã¤ã足ãã¾ãã
- target encoding
- title (âã¿ã¤ãã«ã®é£æ度)
- title x ä½åç®ã®ãã©ã¤ã
- å¥ã®ã¢ãã«ã§äºæ¸¬ããå¤ãç¹å¾´éã¨ãã¦æ»ã
- äºæ¸¬ãããã®
- accuracy(mean of correct)ãã®ãã®
- 4100(4110)ã¤ãã³ããä½åèµ·ããã
- ããããæ£è§£ãããã©ãã
- ããã¯ç¹å¾´éã¨ãã¦å©ç¨ãã¾ããããæçµã¢ã³ãµã³ãã«æã®Stackingã®1ã¢ãã«ã¨ãã¦ä½¿ã£ã¦ãè¯ãã£ãããã
- keeeeei79ããã¯Stackingã®ã¢ãã«ã¨ãã¦å©ç¨ãããããã(æ£è¦åãªã©ãç¹ã«ãã)
- äºæ¸¬ãããã®
- word2vec: event_id(+correct)ãåèªã¨ãã¦ã¿ãªã â ã¦ã¼ã¶ã¼ãã¨ã«ã¤ãªãã¦æç« ã«ãã â event_id ããã¯ãã«å â SWEM
- æ¯è¼çããå¹ãã¦ã¾ãã
- target encoding
- æ¨ã¦ããã¤
- PageRank
- titleãevent_idã®é·ç§»ãã°ã©ãå â Assessmentã®AccuracyGroupãä¼æ¬ããtitleãªã©ã®éè¦åº¦ãç®åº â ã¦ã¼ã¶ã¼ãã¨ã«éè¨
- feature importancesã§ä¸ä½ã«ä¸ãã£ã¦ããã®ã§ãã(publicLB)ã¹ã³ã¢ã«ã¯ã»ã¼ç¡é¢¨ã ã£ãã®ã§ã³ã¼ããç ©éã«ãªããªãããã«æ¨ã¦ã¾ãã
- LDA
- titleãevent_idã®é·ç§»ãLDA â ã¦ã¼ã¶ã¼ãã¨ã«éè¨
- ããã¯importanceãä½ãã£ã
- PageRank
Feature Selection
- Null Importancesã§600åãããåããæçµçã«ã¯550åãããã®ç¹å¾´éã«ãã¼ãã¾ããã
- Kaggle Days Tokyo ã® senkin-san slide ã®P18ã®å¼ãå©ç¨
- gain_scoreãã»ãã®å°ãã ããã¤ãã¹ã®ãã®ã¾ã§ä½¿ãã¨ã¡ããã©ããã£ã
QWK threshold optimization
- kernelã¨åããã®ãããã¾ãã
- ã¿ã¤ãã«ãã¨ã«æé©åããã®ãä½åº¦ããã©ã¤ãã¾ãããçµå±ãã¾ãè¡ãã¾ããã§ããã
Models
- ç¹å¾´éã¯ãã¹ã¦åã(LightGBM以å¤ã¯æ£è¦åãã¦ãã)ã§ä»¥ä¸ã®ã¢ãã«ãä½ãã¾ããã
- LightGBMx3 (èãå¤ã/æ®é/å°ãªã)
- ãã©ã¡ã¼ã¿ã¯optunaã§æé©åãã¾ãã
- NN
- ç¹å¾´éä½æã§åèã«ããã«ã¼ãã«ã¨åãã§ã
- ä½è«ã§ããBaseModelãçµæ§ãã¬ã¤ã«ä½ããã¦ããã®ã§å®è£ ã§åèã«ãã¾ããã
- Random Forest
- depth=6ãããã§ãã©ã¡ã¼ã¿ã¯é©å½ã§ãã
- LightGBMã«ããå£ããããã®ç²¾åº¦ãåºã¦é©ãã¾ãã
- ããJackããã¯LightGBMãæ¨ã¦ã¦XGBoostãé¸ãã ãããã®ã§ããããLightGBMãå¾®å¦ã ã£ãï¼
- Ridge
- LightGBMx3 (èãå¤ã/æ®é/å°ãªã)
- LightGBM以å¤ãæå¤ã¨å¼·ãã£ãã®ã§é©ãã¾ããã
- ãã¹ã¦ã®ã¢ãã«ã§Seed Averagingãè¡ã£ã¦ãã¢ã³ãµã³ãã«ãè¡ãã¾ããã
Ensemble
- Stackingã¨WeightedAverageã§è¿·ãã¾ãããæå
ã®å®é¨ã§ã¯Stackingãå¼·ãã£ãã®ã§ããã以ä¸2ã¤ã®çç±ã§çµå±WeightedAverageã«ãã¾ããã(-0.002ãããã®ãã¹ã§ãã)
- publicLBã®æ°åãWeightedAverageãå¼·ãã£ã
- trainã¨privateãä¹é¢ãã¦ãã¨ãã«çæ»ããã®ãæãã£ã
- WeightedAverageã®ã¦ã§ã¤ãã¯optunaã§æ¢ç´¢ããããã«ãã¾ãã
- Stackingã¯2段ã®ã¤ããã§ããã
- LightGBMx3+NN+Ridge+RF â Ridge
- 4th solutionè¦ã¦ãã¨3段ã§ãã£ã¦ãã®ã§é©ãã¾ãã(ãã¤ã試ã)
- petfinderã®è§£æ³ã«åºã¦ããrankåãããã©(å¤å)ããã¾ãé¢ä¿ãªãã£ã
æçµçµæ
- public: 17th (0.570) â private: 56th (0.551) (3500teams)
ã³ã³ããéãã¦ã®ææ³
- Shake downãã¦ç²ãã
- ä¸æã¯ã¾ããã§5ä½ã¾ã§é ä½ãä¸ãããã¨ãåºæ¥ã¾ããã
- æ¨ã¦ãµãã ã£ãã®ã§èªåã§ãã¾ããã ã¨ããã£ã¦ãããã®ã®ã夢ã¨å¸æãè¨ãã¾ãããå¾ã¾ããã§ããã
- æçµçã«ã¯1ãã¼ã¸ç®ã®å¤ã¾ã§é£ãã§ãã¾ããç²ãã¾ããã
- èªåããpublicLBãä¿¡ãã¦ç²ãã
- ã³ã³ãéå§ç´å¾ã¯publicLBã¯æ°ã«ããªãã§ãããã¨å¿ã«èªã£ã¦ããã®ã§ãããã³ã³ãçµäºéè¿ã§å¤æã«å°ã£ãã¨ãã«ããªã«ãå¿ã®æ ãæã欲ããã¦çµå±è¦ã¦ãã¾ãã¾ããããã®çµæãããã¤ã信念ãæ²ãããã¨ã§-0.006ç¨åº¦ã®ãã¹ã«ãªã£ã¦ãã¾ã£ãã(çµæè«ã§ãã)
- ã³ã³ãä¸ç¤ã®ä½è£ã®ããã¨ãã«ããªã·ã¼ãææåãã¦è²¼ã£ã¦ãããã¨ãããã»ããè¯ããããããªãã
- æ¯ãåããã¦ç¹å¾´éçæããããªãã«ãªã£ããããã®ãçãã£ãã§ãããä½ããããªãã§ãããªã«é å¼µã£ã¦ãèªåãä¿¡ãã¦ãããããªãã£ããã ãã¨ã¨ã¦ãæ²ããæ°æã¡ã«ãªãã¾ããã
- QWKã«é常ã«æããã£ã¦ç²ãã
- æ§ã ãªæ¹æ³ã§QWKããã¯/å®å®åã試ã¿ãã大åã¯å¾å´ã«çµããã¾ããã
- ãã ãoofããµã³ããªã³ã°ãã¦å¹³ååãããããå¤ãæ±ããããªã©çµæçã«ã¿ããªãã£ã¦ãææ³ãèªåã§è¦ã¤ãããã¦ããã£ãã
- JackãããQWKã®ç´æ¥ã®æé©åããã¦ããããã度èãæããã¾ããã
- ã³ã¼ãã³ã³ãã«æ
£ãããã©ç²ãã
- ä»ã¾ã§ã®ã³ã³ãã¯ã²ãããBQã«SQLæãããã³ã ã£ãã®ã§pandasåä¸ãã£ã¦è¯ãã£ãã§ãã
- ã³ã³ãçµç¤ã¯FastSubmissinã§åããã¨ãè¦ããPDCAã®ãµã¤ã¯ã«ãæ ¼æ®µã«æ©ããªãã¾ããããã£ã±ãGCPã®ãã«ãã¤ã³ã¹ã¿ã³ã¹æé«ã
- ãã¹ããã¡ããã¨æ¸ããã®ã§ãããã¤ãã®ãµãã§ãã¹ãçºè¦ã§ãã¦å©ããã¾ããã
èªåãä¿¡ããããªãã£ã人ã®æ«è·¯
ãµãå±¥æ´ãè¦è¿ãããpubã«æãããã¦æå ã®ãã¹ãçµæãèªåã®ä¿¡å¿µãæ²ããã®ãä¸è¨2ç¹ã ã£ããããããªããã°0.556+ãããã ã£ãããéã¯ç¡çã ã£ãããã ãã©å¿ã®å¼±ããåºãã®ã§ã¨ã¦ãæãã
— ML_Bear (@MLBear2) 2020å¹´1æ23æ¥
ã»DataAugumentationãæ¨ã¦ã(-0.004)
ã»Ensembleã«Stackingã§ã¯ãªãWeightedAverageãé¸æãã (-0.002)
ã¾ã¨ã
ææ³ã3ã¤ã¨ãç²ããã«ãªã£ã¦ãã¾ãã¾ããw å種ãã©ãã«ãå«ã3é±éãã£ã¬ã³ã¸ã§ã¨ã«ããç²ããã®ã§ãããã¾ã楽ããã£ãããªã¼ã¨ã
3æããã¦ã©ã«ãã¼ãã®ã³ã³ããå§ã¾ãã¿ãããªã®ã§ãããã¾ã§è²ã åå¼·ãã¦æºåãã¦ãã¾ãç²ããæ¥ã ãéãããã°å¬ãããªã¨æãã¾ãã
å·éã«èªãã§ã¿ãã¨ããªã«ãè¨ã£ã¦ããã®ã(ry ãªææ³ã§ããç¬ã