Hivemallãå©ç¨ããæ©æ¢°å¦ç¿å®è·µå ¥é ï¼ç¬¬ä¸å: ãã©ãã°ã¹ãã¢ã®ã»ã¼ã«ã¹äºæ¸¬ï¼
æ¬è¨äºã¯ç§»è»¢ãã¾ãããæ°ãµã¤ãã«ãªãã¤ã¬ã¯ããã¾ãã
"); // ãªãã¤ã¬ã¯ã ããsetTimeout("redirect()", 0);ã// 0 sec ããfunction redirect(){ ãã ãlocation.href = url; ãã} ã ã// canonical ã®æ¸ãæã ããvar link = document.getElementsByTagName("link")[0]; ããlink.href = url; -->æ¬ç¹éã§ã¯ãTreasure Dataç°å¢ã§å©ç¨å¯è½ãªæ©æ¢°å¦ç¿ã©ã¤ãã©ãªHivemallãå©ç¨ããæ©æ¢°å¦ç¿ã®å®è·µæ¹æ³ãç´¹ä»ãã¾ããä¸çã®ãã¼ã¿ãµã¤ã¨ã³ãã£ã¹ããè
ã競ããã¼ã¿ãµã¤ã¨ã³ã¹ã³ã³ããã£ã·ã§ã³ãµã¤ãKaggleã®ä¸ãããå®è·µçãªèª²é¡ãæ±ã£ã¦ããã¾ãã
1. ã¯ããã«
ã第ä¸åã¯å°å£²æ¥ã®å£²ãä¸ãäºæ¸¬ããã¿ã¹ã¯ã§ããRossmann Store Salesコンペティションã課é¡ã«ç¨ãã¾ããã¢ã«ã´ãªãºã ã¨ãã¦ã¯ã決å®æ¨ãå©ç¨ããã¢ã³ãµã³ãã«å¦ç¿ææ³ã®ä¸ç¨®ã§ããRandom Forest回帰ãå©ç¨ãã¾ã*1ã
ãRossmannã¯ã¨ã¼ãããã®ï¼ã«å½ã§3,000以ä¸ã®åºèãå±éããè¬å±ãã§ã¼ã³ã§ããååºèã®ããã¼ã¸ã£ã¼ã¯6é±éå ã¾ã§ã®åºèã®å£²ãä¸ããäºæ¸¬ãããã¨ãã¿ã¹ã¯ã¨ãã¦èª²ããã¦ãã¾ããååºèã®å£²ãä¸ãã¯ããã¢ã¼ã·ã§ã³æ´»åã競åè¦ç´ ãå¦æ ¡ã®ä¼ã¿ãç¥æ¥ãå£ç¯æ§ãå°åæ§ãªã©æ§ã ãªè¦å ã«å·¦å³ããã¾ãã
ãRossmann Store Salesã³ã³ãã§ã¯ãRossmannããã¤ãã«å±éãã1,115åºèã®å£²ãä¸ã6é±éåã«ã¤ãã¦1æ¥ã®å£²ãä¸ããäºæ¸¬ãããã¨ãç®æ¨ã¨ãããã¼ã¿ãæä¾ããã¦ãã¾ããè¨ç·´ãã¼ã¿ã¨ãã¦ååºèã«ããã2013å¹´1æ1æ¥ãã2015å¹´7æ31æ¥ã¾ã§ã®ãã¼ã¿ããæ¤è¨¼ç¨ãã¼ã¿ã¨ãã¦2015å¹´8æ1æ¥ãã2015å¹´9æ17æ¥ã¾ã§ã®ãã¼ã¿ãä¸ãããã¦ãã¾ãã
ãããã§ã¯ãå®éã«ã©ã®ããã«Treasure Dataä¸ã§ãã¼ã¿ãæ±ãã®ã解説ãã¦ããã¾ãã
2. ãã¼ã¿ã®æºå
2.1 ãã¼ã¿ã®æå ¥
ãRossmann Store Salesタスクã§ã¯ãè¨ç·´ãã¼ã¿ (train.csv)ãæ¤è¨¼ç¨ãã¼ã¿ (test.csv)ãããã¦ãè¨ç·´/æ¤è¨¼ç¨ãã¼ã¿ã§ç¨ããåºèæ å ± (store.csv)ãCSVå½¢å¼ã§æä¾ããã¦ãã¾ããåºèæ å ±ã¯Store IDã«ãã£ã¦è¨ç·´ãã¼ã¿ã¨æ¤è¨¼ç¨ãã¼ã¿ã¨ç´ä»ãã¾ãã
ãã¾ããTreasure Dataã¸ã®ãã¼ã¿ã®æå ¥ãã®GUIããè¡ãã¾ããæä¾ããã¦ããCSVãã¼ã¿ãããã¼ãã«ã®ä½æã¯ããã©ãã°ã¢ã³ãããããã§ç°¡åã«è¡ããã¨ãã§ãã¾ããTreasure Dataã®ãã¡ã¤ã«ã¤ã³ãã¼ãã§ã¯ãCSVã®ä¸è¡ç®ã®ãããæ å ±ã¨å®éã®ãã¼ã¿åãè¦ã¦ããã¼ãã«å®ç¾©ãèªåä½æããã¾ã*2
ãåãã¼ã¿ã®æ§æã¯æ¬¡ã®ã¨ããã§ãã
è¨ç·´ãã¼ã¿ (train.csv)
æ¤è¨¼ç¨ãã¼ã¿ (test.csv)
åºèæ
å ± (store.csv)
ãåãã¼ã¿é ç®ã¯æ¬¡ã®ãããªãã®ã§ãã
Id | æ¤è¨¼ç¨ | æ¤è¨¼ç¨ãã¼ã¿ã«ã®ã¿å²ãå½ã¦ããã¦ãã¦åºèã¨æ¥ã«ãã£ã¦ç°ãªã |
Sales | ç®çå¤æ°/éçå¤æ° | äºæ¸¬å¯¾è±¡ã®ããåºèã«ãããä¸æ¥ã®å£²ãä¸ã |
Store | 質çå¤æ° | ååºèã«å²ãå½ã¦ãããåºæã®Id |
Customers | éçå¤æ° | ããæ¥ã«åºèã訪ãã客æ°*3 |
Open | 質çå¤æ° | 0 ã®ã¨ãéåºã 1 ã®ã¨ãéåº |
StateHoliday | 質çå¤æ° | a = public holiday, b = Easter holiday, c = Christmas, 0 = None |
SchoolHoliday | 質çå¤æ° | ãã®åºèã®å¶æ¥æ¥ããå ¬ç«ã®å¦æ ¡ã®ä¼æ¥æ¥ã®å½±é¿ãåãããã示ãã¦ãã |
StoreType | 質çå¤æ° | åºèã®å½¢æ ( a, b, c, d) ã®4 ç¨®é¡ |
Assortment | 質çå¤æ° | åé¡( a = basic, b = extra, c = extended) |
CompetitionDistance | éçå¤æ° | æè¿æ¥ã®ç«¶ååºèã¾ã§ã®è·é¢ |
CompetitionOpenSince[Month/Year] | 質çå¤æ° | æè¿æ¥ã®ç«¶ååºèãéåºããæ/å¹´ |
Promo | 質çå¤æ° | ããã¢ã¼ã·ã§ã³ããã¦ãããã©ãã |
Promo2 | 質çå¤æ° | è¤æ°ã®åºèã«ãããããã¢ã¼ã·ã§ã³ã§ã 0 ã¯ä¸åå åºèã1 ã¯åå åºè |
Promo2Since[Year/Week] | 質çå¤æ° | ãã®åºèãPromo2 ãå§ããå¹´/é± |
PromoInterval | 質çå¤æ° | Promo2 ãå§ããææã§ãä¾ãã°ã"Feb,May,Aug,Nov" ã®å ´åã¯ç°ãªãããã¢ã¼ã·ã§ã³ã2æã5æã8æã11æã«éå§ãããã¨ãæå³ãã |
2.2 ãã¼ã¿ã®å¾åãå¯è¦å
ãåèã¾ã§ã«Store ID=1ã®è¨ç·´ãã¼ã¿ã®æéã«ããã売ãä¸ããè¦ã¦ã¿ã¾ããããこのスクリプトã§TD-pandasã¨Jupyter Notebookãå©ç¨ãã¦å¯è¦åãã¾ã*4ã
ããã¼ã¿ãè¦è¦åãããã¨ã§ã売ãä¸ãã®å¨ææ§ã売ãä¸ãã®ã¹ãã¤ã¯ã確èªãããã¨ãã§ãã¾ããè¦è¦åãããã¨ã§ãå¨ææ§ãèæ ®ã§ããç¹å¾´éï¼ææ¥ãé±æ«ãå¹³æ¥ããªã©ï¼ãããã¨ãããªã©ãæ¨æ¸¬ãä»ãã¾ãã
2.3 ãã¼ã¿ã®åå¦ç
2.3.1 ãã¼ã¿ã®çµå
ãã¾ããè¨ç·´ãã¼ã¿ã«åºèæ å ±ï¼ããã¢ã¼ã·ã§ã³ã競åããåºãªã©ã®æ å ±ãå«ã¾ããï¼ãçµåãã¾ãããã®ãããªãã¼ã¿ã®çµåå¦çãTreasure Dataãµã¼ãã¹ã¯åæ£å¦çã«ãã£ã¦å¹ççã«å®è¡ãããã¨ãã§ãã¾ãã
ããªããæ¥ä»ã®ãã¼ã¿ãyyyy-mm-ddï¼ä¾ãã°ã2014-05-25 ãªã©ï¼ã®å½¢å¼ã§ä¸ãããã¦ãããããå¹´ãæãæ¥ãé¨åæååã¨ãã¦æ½åºãã¾ããã¾ããè©ä¾¡æã«åºèãä¼ã¿ãªã©ã§å£²ãä¸ããã¼ãã®ã±ã¼ã¹ã¯äºæ¸¬å¯¾è±¡ããé¤å¤ããã¦ãããããäºåã«è¨ç·´ãã¼ã¿ãããåãé¤ãã¾ãã
ãä¸è¨ã®Hiveã¯ã¨ãªãå®è¡ãã¦ä½æãããtraining2ãã¼ãã«ã®å 容ã¯æ¬¡ã®ããã«ãªãã¾ãã
2.3.2 ç¹å¾´ãã¯ãã«ã®çæ
ãåã®ã¹ãããã§ä½æãããã¼ãã«ã«ã¯training2ã«ã¯ãéæ°å¤ãã¼ã¿ãå«ã¾ãã¦ãã¾ããæ¬ã¹ãããã§ã¯æ¬¡ã®ãããªã¯ã¨ãªãç¨ãã¦ãéæ°å¤ãã¼ã¿ãRandomForestã§æ±ããããã«æ°å¤ãã¼ã¿ã«å¤æããä¸ã§ãç¹å¾´ãã¯ãã«ï¼æ°å¤ã§è¡¨ãããç¹å¾´éã®é åï¼ãä½ã£ã¦ããã¾ã*5ã
ãtrain_quantifiedã®ä¸ã®quantifyé¢æ°ã§æ°å¤åã¯ãã®ã¾ã¾ãéæ°å¤åã«æ¡çªï¼æ°å¤IDä»ãï¼ããä¸ã§åºåãã¦ãã¾ã*6ãã¾ããæ¬ æå¤ã0ã§è£å®ããç®çå¤æ°ã¨ãªã売ä¸ãã¼ã¿salesã¯logã¹ã±ã¼ã«ãåãã¾ã*7ã
ããã®ããã«å¤æãã¦ã§ãããã¼ãã«ã¯æ¬¡ã®ããã«ãªãã¾ãã
2.3.3 æ¤è¨¼ç¨ãã¼ã¿ã®æºå
ãæ¤è¨¼ç¨ãã¼ã¿ãåæ§ã«å¤æãã¦ããã¾ããã¾ãã¯ãåºèæ å ±ã®ãã¼ãã«ãæ¤è¨¼ç¨ãã¼ã¿ã«çµåãã¾ãã
ãå¤æçµæã®ãã¼ãã«ã¯æ¬¡ã®ãããªå½¢ã§ãã
ã次ã«ã質çå¤æ°ãéçå¤æ°ã«å¤æããéã«ãã¬ã¼ãã³ã°ãã¼ã¿ã¨åæ§ãªå¤æãããå¿ è¦ããããã¨ã«æ³¨æãã¦ãæ¤è¨¼ç¨ãã¼ã¿ãå å·¥ãã¾ããä¾ãã°StoreTypeã¯a, b, c, dã®4 種é¡ããã¾ããããã¬ã¼ãã³ã°ãã¼ã¿ã§a â 1, b â 2, c â 3, d â 4ã§ããã®ã«å¯¾ãã¦ãæ¤è¨¼ç¨ãã¼ã¿ã«StoreTypeãb â 1, c â 2, d â 3ã¨ãªã£ã¦ãã¾ã£ã¦ã¯å°ã*8ããã§ãã
ãæçµçãªå¤æçµæã®æ¤è¨¼ç¨ãã¼ãã«ã¯æ¬¡ã®ãããªå½¢ã§ãã
3. Random Forestãå©ç¨ããå¦ç¿
ãååã¾ã§ã®ã¹ãããã§ä½æããè¨ç·´ãã¼ã¿ã¨ããRandom Forestãç¨ããå¦ç¿ãè¡ãã¾ããRandom Forestã¯ç°ãªãæ¡ä»¶ã§æ§ç¯ããè¤æ°ã®æ±ºå®æ¨å©ç¨ããéå£å¦ç¿ææ³ã§ãããä»åã¯100æ¬ã®æ±ºå®æ¨ãæ§ç¯ãããã®ã¨ãã¾ãã
ãå¦ç¿ãè¡ãã¯ã¨ãªã¯æ¬¡ã®ã¨ããã§ããããã§ã¯ãUNION ALLå¥ãç¨ãããã¨ã§5並å*9ã§æ±ºå®æ¨ã®å¦ç¿ãè¡ã£ã¦ãã¾ãã"-attr"ãªãã·ã§ã³ã«ã¯å¤æ°ã質çå¤æ°ãªãã®ã«ã¯Cãéçå¤æ°ã«ã¯Qãæå®ãã¾ã*10ãããã§ã¯ãcompetitiondistanceã«ã©ã ã®ã¿ãéçå¤æ°ã¨ãã¦æå®ãã¦ãã¾ãã
3.1 å¤æ°éè¦åº¦
ãRandomForestã®å¦ç¿çµæãããå説æå¤æ°ã®éè¦åº¦ãåå¾å¯è½ã§ããããã§ã¯ãTreasure Dataã§ã®å®è¡çµæãJupyter notebook/Pandasと連携ãããã¨ã§å¯è¦åãã¾ã*11ã
ãæ£ã°ã©ããããåºèIDã¨ç«¶ååºèããã®è·é¢ãååºèã®å£²ãä¸ãã«å½±é¿ããè¦ç´ ãé«ãå¤æ°ã§ãããã¨ãåããã¾ããå½ç¶åºèãã¨ã«å£²ãä¸ãã大ãããã¨ãªããããåºèIDãæãæ¯é çãªèª¬æå¤æ°ã§ãããã¨ã¯ç´æã«ãåè´ãã¾ãã競ååºèã¨è·é¢ãè¿ãã¨å£²ãä¸ããé£ãæ½°ãåã£ã¦ããã®ã§ãããã*12ã
ããªããå¯è¦åã«å©ç¨ããJupyter notebookã¯ä»¥ä¸ãåç §ãã ããã
4. äºæ¸¬
ãå
ã»ã©ä½æããã¢ãã«ã使ç¨ãã¦äºæ¸¬ãè¡ãã¾ããå¦ç¿æã«ç®çå¤æ°ãLN(1 + t1.sales)ã¨ãã¦ã¹ã±ã¼ã«ãå¤æãããããEXP(predicted-1)ã§ã¹ã±ã¼ã«ã®éå¤æããã¦ãã¾ãã
ãKaggleã¸ã®äºæ¸¬çµæã®æåºç¨ã«IDã®æé ã«çµæãã½ã¼ããã¾ãã
5. è©ä¾¡
ãè©ä¾¡ææ¨ã¯ãRMSPEï¼Root Mean Square Percentage Errorï¼ãããªãã¡
ã§è¡ããã¾ããããã§ã ã¯ããåºèã«ããã1 æ¥ã®å£²ãä¸ã , ã¯ãã®äºæ¸¬å¤ã§ãã
ããªãã交差æ¤å®ã«å©ç¨ããã³ã¼ãã¯こちらã§ãã
5.1. Kaggleã¸ã®çµæã®æåº
ãKaggleã®Public Leader Boardï¼LBï¼ã§ã¯æ¤è¨¼ç¨ãã¼ã¿ã®39% ã使ç¨ããPrivate LBã¯æ®ãã®61%ã®ãã¼ã¿ã使ç¨ãã¦è©ä¾¡ãè¡ããã¾ããPublic LBã®ã¹ã³ã¢ã¯ã競æä¸ã«å ¬éãããæçµè©ä¾¡ã¯ããã¾ã§éå ¬éã®Private LBã§è¡ããã¾ããä»åã®ã¢ãã«ã§ã¯ãPublic LB: 0.12771ãPrivate LB: 0.13763ã¨ãªãã¾ãã*13ã
ãRandomForestã¯ãã®åã®ã¨ãããä¹±æ°ãå©ç¨ããããã試è¡ã«ãã£ã¦ã¯äºæ¸¬æ§è½ã¯å¤å°åå¾ãããã¨ãããã¾ãã®ã§ããçæãã ãã*14ã
6. ãããã«
ãKaggleã¯ãã¼ã¿åæã®åºç¤ãå¦ã¶ã®ã«æé©ãªå ´æã§ããä»ååãä¸ããå 容ã¯eã³ãã¼ã¹ãåºåãã£ã³ãã¼ã³ã®å£²ãä¸ãäºæ¸¬çã«ãå¿ç¨ãå¹ãã¯ãã§ããæ¬è¨äºã®å 容ã追ããã¨ã§Treasure Dataä¸ã§ã®ãã¼ã¿åæ/æ©æ¢°å¦ç¿å®è¡ã®åºæ¬ãæãããã¨ãã§ããã¯ãã§ãã®ã§ãèå³ã®ããæ¹ã¯æ¯éã試ãä¸ããã
ããªããTreasure Dataã§ã¯ãDigdagã¨ããWorkflowã¨ã³ã¸ã³ãéçºä¸ã§ããä»åã®ã¯ã¼ã¯ããã¼ã¯次のようなYAML形式ã§workflowãå®ç¾©ãããã¨ã«ãã£ã¦ä¸æ¬å®è¡ãScheduledå®è¡ãå¯è½ã¨ãªãã¾ããç ©éã«ãªããã¡ãªãã¼ã¿åæã®ã¯ã¼ã¯ããã¼ããã®ãããªå½¢ã«ããã¨è¦éããè¯ããªãã¾ãã
ãäºå: ç¹éè¨äºç¬¬äºå¼¾ã¯ãã£ã¹ãã¬ã¤åºåæ¥çã§æåãªCriteoのデータセットを使った広告クリック率推定ã®å
¥éè¨äºãäºå®ãã¦ããã¾ãã
*1:Random Forestãå©ç¨ããçç±ã¯ãæ¬ã¿ã¹ã¯ã§ç¹å¾´ã¨ã³ã¸ãã¢ãªã³ã°ã«æéããããã¨ãæ¯è¼çé«ã精度ãæå¾ ã§ããããã§ãã
*2:ãã ããstateholidayã«ã©ã ã¯å é æ°è¡ãæ°åã®ã¿ã§ãã£ãããlongã¨å¤å®ããã¦ãã¾ãã¾ããããå®éã«ã¯éæ°å¤ãã¼ã¿ãå«ã¾ãã¦ããããstringåã«åã®åãæåå¤æ´ãã¾ããã
*3:è¨ç·´ãã¼ã¿ã®ã¿ã®åå¨ãããããä»åã¯èª¬æå¤æ°ã¨ãã¦å©ç¨ããªã
*4:ä»åã®ä¸åºèã®å£²ä¸ããå¯è¦åãã¾ãããã売ãä¸ãã®å ¨ä½å¹³åãªã©ãå¯è¦åãããªã©ãã¯ã¨ãªãã¼ã¹ã§ç°¡åã«ã§ãã¾ãã
*5:scikit-learnã§ããDictVectorizerç¸å½ã®å¦çããã¦ãã¾ããSpark MLlibã§ããStringIndexerç¸å½ã®å¦çãHiveQLã¯ã¨ãªã§è¡¨ç¾ãã¦ãã¾ãã
*6:ããã§ãtrain_orderedã§quantifyé¢æ°ã¸ã®å ¥åãä¸å®ã®ã«ã¼ã«ã§é åºä»ãããã¨ã§æ¡çªæ¹æ³ãä¸è²«æ§ã®ãããã®ã¨ãã¦ããã¾ãããªããäºåã«ã«ãã´ãªã«ã«å¤æ°ã®ã«ã©ã ã«ã¤ãã¦ã¯stringåçã«å¤æ´ããããquantifyé¢æ°ã¸ã®å ¥åæã«CAST(xxx as string)ã§æåååã«å¤æ´ãã¦é ãå¿ è¦ãããã¾ãã
*7:交差æ¤å®ã«ããsalesã®å¤ãç®çå¤æ°ã¨ãã¦ãã®ã¾ã¾ä½¿ããlogãåãããå¤æãã¾ãããç®çå¤æ°ãlog scaleã«å¤æããã»ãã精度ãè¯ãã£ããããlogã¹ã±ã¼ã«ãå©ç¨ãã¦ãã¾ãã
*8:aãªãã®ãåå¨ããªãã¨ããªã©ã«èµ·ããå¾ãã
*9:Hadoopã¯ã©ã¹ã¿ã®ãªã½ã¼ã¹ç¶æ³ã«ãã£ã¦ã¯5æªæºã®ã¿ã¹ã¯ãåæã«èµ°ããã¨ãããã¾ãã
*10:ãªããããã©ã«ãã§ã¯å ¨ã¦éçå¤æ°ã¨ãã¦å¤æ°ãæ±ããã¾ããScikit-learnでも全て量的変数として決定木が構築されますããååãªæ·±ãã®æ¨ãæ§ç¯ããã°éçå¤æ°ã¨ãã¦æ±ã£ã¦ãçµé¨çã«åé¡ãªãåãã¾ãã
*11:Treasure Dataã®ãã¼ãã«ãPandasã®Dataframeã«åæ¹åå¤æåºæ¥ã¾ãã®ã§ããã²ãæ´»ç¨ãã ããã
*12:å½å ã®ã³ã³ããã¨ãã§ãåãã§ããããã
*13:Scikit-learnã®RandomForestå®è£ ã§Public LB: 0.12490, Private LB: 0.13529ã§ããã®ã§æ¦ãåæ§ã®çµæãå¾ããã¾ããã
*14:Hivemallã§ã¯"-seed"ãªãã·ã§ã³ãæå®ãããã¨ã§Determisticãªæåãã¨ããã¨ãå¯è½ã§ãã