(Image by Pixabay)
"Top 10 Statistics Mistakes Made by Data Scientists"ã¨ããåºæ¿çãªã¿ã¤ãã«ã®è¨äºãåºã¦ããã®ãKDnuggetsçµç±ã§ç¥ãã¾ãããããã¼ã¿ãµã¤ã¨ã³ãã£ã¹ãããããããã¡ãªçµ±è¨å¦çãªèª¤ãããã10ãã¨ãããã¨ã§ãããã«ããªããããäºä¾ãè²ã
è¼ã£ã¦ãã¦é¢ç½ãã§ãã
ã¨ãããã¨ã§ãä»åã¯ãã®è¨äºãå ¨è¨³ã«ãªããªãç¯å²ã§æ訳ãã¦ããã®å 容ãåå³ãã¦ã¿ããã¨æãã¾ãï¼ç´è¨³ãã¦ãæå³ãåãã¥ããç®æãå¤ãã£ãããããªãã®é¨åãæ訳ãªããæ訳ãã¦ãã¾ãï¼ãããã訳ã®æ¹ãè¯ããªã©ã®ã³ã¡ã³ãããã°æ¯éãå¯ããã ããï¼ãè¨ãããããªã§ãããããããæµ·å¤è¨äºç´¹ä»ãããæã¯ãã¿åãã¨ãããã¨ã§ããæªããããããã
- å
è¨äºã®å
容
- 1. Not fully understand objective functionï¼ç®çé¢æ°ãä½ãããã¡ãã¨ç解ãã¦ããªãï¼
- 2. Not have a hypothesis why something should workï¼ä½ãããã¾ãããçç±ã説æãã仮説ãæã£ã¦ããªãï¼
- 3. Not looking at the data before interpreting resultsï¼çµæã解éããåã«ãããããã¼ã¿ãã®ãã®ãè¦ã¦ããªãï¼
- 4. Not having a naive baseline modelï¼é©åãªãã¼ã¹ã©ã¤ã³ã¢ãã«ãç½®ããªãï¼
- 5. Incorrect out-sample testingï¼æ£ãããªãéãµã³ãã«ãã¹ãï¼ä¸é©åãªCVï¼
- 6. Incorrect out-sample testing: applying preprocessing to full datasetï¼ä¸é©åãªCVï¼åå¦çãsplitããåã«ä¸æ¬ãã¦ãã£ã¦ãã¾ãï¼
- 7. Incorrect out-sample testing: cross-sectional data & panel dataï¼ä¸é©åãªCVï¼ã¯ãã¹ã»ã¯ã·ã§ã³ãã¼ã¿ã¨ããã«ã»æç³»åãã¼ã¿ã¨ã§åããã¨ããã£ã¦ãã¾ãï¼
- 8. Not considering which data is available at point of decisionï¼ããããã¢ãã«ãå°å ¥ããéã«ã©ã®ãã¼ã¿ã使ããããèæ ®ãã¦ããªãï¼
- 9. Subtle Overtrainingï¼å¾®å¦ãªéå¦ç¿ï¼
- 10. "need more data" fallacyï¼ããã£ã¨ãã¼ã¿ãå¿ è¦ãã¨ãã誤ã£ãæãè¾¼ã¿ï¼
- ææ³ãªã©
å è¨äºã®å 容
ã¨ãããããã¾ãã¯å
è¨äºã®å
容ãã¶ãã¨è¦ç¹ãã¾ã¨ãã¦ããã¾ãã対訳ã¨ããããã¯ãåæãè¨ãããã£ãã§ããããã¨ãåã®ç¬æã§è£å®ãã¦ãã¾ãã®ã§ãåæãæ°ã«ãªãæ¹ã¯ä¸è¨ãªã³ã¯å
ãããèªã¿ã«ãªããã¨ããè¦ããã¾ãããªããæä¸ã®å³ã¯å
¨ã¦GitHubã«ä¸ãã£ã¦ããå
è¨äºã®ç»åãªã³ã¯ãç´æ¥è²¼ã£ã¦è¼ãã¦ãã¾ãã
1. Not fully understand objective functionï¼ç®çé¢æ°ãä½ãããã¡ãã¨ç解ãã¦ããªãï¼
ããã§ã¯ãä½ãã¢ãã«ãæ§ç¯ããéã«ãä½ãç®æ¨ï¼ï¼KPIï¼ã«ããã¹ããï¼ãã¨ããã®ãåãã£ã¦ããªãã¨æå³ã®ãªãä»äºããããã¨ã«ãªããã¨è¦éãé³´ããã¦ãã¾ãããããããªã®ããã¢ãã«ã®ã精度ãã¨ããæå³ã§ã¯å¤§ãã¦è¯ããªãã¦ãããã¸ãã¹ä¸ã®ææ¨ã®æ¹åã«ã¯å¤§ããè²¢ç®ãããããªã¢ãã«ããã¡ã ã¨è¨ã£ã¦æ¨ã¦ã¦ãã¾ã£ã¦ããã¨ããã±ã¼ã¹ãããããæã帰çµãé¿ããããã«ããç®çãã«ãã ããããã¸ãã¹ä¸ã®ææ¨ãæ¹åãããå ´åã¯ãããé©åãªæ°å¦çã»çµ±è¨å¦çãªç®çé¢æ°ã«å¤æãããã¨ããã話ã§ãã
2. Not have a hypothesis why something should workï¼ä½ãããã¾ãããçç±ã説æãã仮説ãæã£ã¦ããªãï¼
ä¸è¨ã§æ¸ãã¨ããã¼ã¿ãã¾ã¨ãã«è¦ããã©ããããã¼ã¿ãªãç®ã®åã«ãããã¼ã¿ã«ããå½ã¦ã¯ã¾ãããããã¨ããã¤ã¡ã¼ã¸ãç¹ã«ä½ãæããªãã¾ã¾éé²ã«ã¢ããªã³ã°ãããã¨ãããã¨ãããã¨ããããªãã¨ãé©å½ã«è¤æ°ã®ã¢ãã«ããã¼ã¿ã«å¯¾ãã¦é©å½ã«å½ã¦ã¯ãã¦ã¿ã¦ãé©å½ã«ä¸çªç²¾åº¦ãè¯ãã£ããã®ã ããé©å½ã«é¸ãã§é©å½ã«ä½¿ãã¨ãããé©å½ãããã«ãªããããªãããã§ãã
ãã®å³ã®å·¦ã®ä¾ã§ã¯æ£å¸å³ãæãã°ä¸ç¬ã§ç·å½¢ã¢ãã«ã§æ¸ããã¨ãåããããå³ã®ä¾ã§ã¯ç·å½¢ã¢ãã«ã§ã¯ãã¡ã ã¨ãããã¨ãä¸ç¬ã§åããããã§ãããããããããªãã®ã¯æããããã¨ã
3. Not looking at the data before interpreting resultsï¼çµæã解éããåã«ãããããã¼ã¿ãã®ãã®ãè¦ã¦ããªãï¼
ããã§å¿µé ã«ç½®ããã¦ããã®ã¯å¤ãå¤(outlier)ãä¸åè¡¡ãã¼ã¿ã¸ã®å¯¾å¿ããããçµå±ãã¼ã¿ãã®ãã®ãè¦ã¦ããªããã°åãããªãããã§ãã
åãå³ã®ç¹°ãè¿ãã§ãããå·¦ã®ä¾ã§ã¯å¤ãå¤ãå
¥ããã ãã§å帰ä¿æ°ã0.906ãã-0.375ã«é£ãã§ãã¾ãã¾ããã¨ã«ããã¾ããã¼ã¿ãã®ãã®ããã¡ãã¨è¦ãï¼ã¨ãããã¨ã§ããã
4. Not having a naive baseline modelï¼é©åãªãã¼ã¹ã©ã¤ã³ã¢ãã«ãç½®ããªãï¼
å®é¨ç§å¦ããã£ã¦ãã人éãªãポジコンとかネガコンとかãã³ã¨æ¥ããã¨æããã§ãããã¢ããªã³ã°ãè¡ã£ã¦äºæ¸¬ãè¡ããããã¼ã¿ãµã¤ã¨ã³ãã£ã¹ããåæ§ã«é©åãªãã¼ã¹ã©ã¤ã³ãç½®ãã¹ãã§ãããã¨è¨ã£ã¦ãã¾ããããã§ã¯æç³»åãã¼ã¿ã¢ããªã³ã°ãä¾ã«æãã¦ãç·å½¢å帰ã§CV MSEãäºã
ãã©ã³ãã ãã©ã¬ã¹ãã§CV MSEãäºã
ã¨ããããªãã¨ããã§ããããã1æåã®ãã¼ã¿ã§èªå·±å帰ãããã©ããªããèãã¦ã¿ããï¼ã¨ããããã³ããå
¥ãã¦ãã¾ãã
5. Incorrect out-sample testingï¼æ£ãããªãéãµã³ãã«ãã¹ãï¼ä¸é©åãªCVï¼
R&Dã¨ãã¦ä½ãããã¢ãã«ã¯ç´ æ´ãããã£ãããããæ¬çªç°å¢ã«çªã£è¾¼ãã§ã¿ããå
¨ã使ãç©ã«ãªããªãã£ããããã¿ãããªè©±ã¯æå
端ã®ããNetãçå¨ãæ¯ããç¾ä»£ã§ããè¯ãããã話ãå¦ç¿ãã¼ã¿(train/dev)å
ã§CVããã ãã§æ¸ã¾ããã®ã§ã¯ãªãããã¡ãã¨å¦ç¿ãã¼ã¿(train/dev)ã®å¤å´ã®ãã¹ããã¼ã¿(test or private)ã使ã£ã¦æ§è½æ¤è¨¼ãããCVã§ããã©ã¼ãã³ã¹ãè¯ãã¦ãããããããåãªãoverfittingãããããªãããã¨ãããã¨ã§ããããã§ä¸ãã£ã¦ããä¾ã¯ãã©ã³ãã ãã©ã¬ã¹ãã®CV MSEã0.04ãç·å½¢å帰ã®CV MSEã0.183ã ã£ãã¨ãã¦ããããæ°è¦ãã¹ããã¼ã¿ã«å½ã¦ã¯ãã¦ã¿ããRFã®MSEã0.261ãç·å½¢å帰ã®MSEã0.187ã ã£ãã¨ããããã©ã¡ãã使ããï¼ã¨èãã¦ãã¾ãã
6. Incorrect out-sample testing: applying preprocessing to full datasetï¼ä¸é©åãªCVï¼åå¦çãsplitããåã«ä¸æ¬ãã¦ãã£ã¦ãã¾ãï¼
ã·ã³ãã«ã«è¨ãã°ããã¡ãã¨train/devããtestã¸ã®leakageãèµ·ããªãããã«æ³¨æããã¨ããã話ã§ãããã®ä¸ä¾ã¨ãã¦ãtrain/dev vs. testã¨ã§åããåã«ä¸æ¬ãã¦åå¦çãè¡ã£ã¦ãã¾ãã¨ããã±ã¼ã¹ãæãã¦ãã¾ããæ¬æ¥ãªãtrain/dev vs. testã¨ã§åãããå¾ã«ãåå¦çããªããã°ãããªãã®ã«ããåã«ãè¬ã£ã¦åå¦çãã¦ãã¾ããã¨ã§ä½ãããã®leakageãèµ·ãããããããªãã¨ãããã¨ã§ãã
7. Incorrect out-sample testing: cross-sectional data & panel dataï¼ä¸é©åãªCVï¼ã¯ãã¹ã»ã¯ã·ã§ã³ãã¼ã¿ã¨ããã«ã»æç³»åãã¼ã¿ã¨ã§åããã¨ããã£ã¦ãã¾ãï¼
ããã¯ã¡ãã£ã¨åã«ããã¨ããã§å¤§ããªè°è«ãå¼ãã 話ã§ãæç³»åãã¼ã¿ã«å¯¾ãã交差æ¤è¨¼ãã¯ãã«ã»ã¯ã·ã§ã³ãã¼ã¿ã¨åæ§ã«random splitã§ãã£ã¦ãã¾ãã±ã¼ã¹ãã¾ã¾ãããã¨ããã話ã§ããå½ç¶ãªãããæç³»åãã¼ã¿ã¯åä½æ ¹éç¨ã»ãã¬ã³ãã»å£ç¯èª¿æ´ãªã©éç·å½¢æåãå«ããã¨ãå¤ããã©ã³ãã ã«åãåºãã¦ãåå¾ã®ãµã³ãã«å士å«ãã¦èªå·±ç¸é¢ï¼ç³»åç¸é¢ï¼ã®å½±é¿ãå¼·ãåããããã®ã§ãrandom splitã§CVããã®ã¯ãæ³åº¦ã§ãã
ãã®ããã°ã§ã以ååãä¸ãã¾ããããæç³»åãã¼ã¿ã«å¯¾ããCVã¯ååã¨ãã¦ãéå»ããæªæ¥æ¹åã«åãã£ã¦ã®ã¿ãè¡ãã¾ãã
8. Not considering which data is available at point of decisionï¼ããããã¢ãã«ãå°å ¥ããéã«ã©ã®ãã¼ã¿ã使ããããèæ ®ãã¦ããªãï¼
ããã§çªç¶ã°ãã¨å®åçãªè©±ãåºã¦ãã¾ãããã¼ã¿åææ¥çãããããã¿ã¨ãã¦ãæ©æ¢°å¦ç¿ã¢ãã«ãä½ã£ãæã¨åããããªå¾åã®ãã¼ã¿ãèå¿ã®æ©æ¢°å¦ç¿ã·ã¹ãã å°å
¥æã«å¾ãããªããã¨ããã®ãããã¾ãããã¾ãã«ãã®è©±ã§ãã対å¦æ³ã¨ãã¦ãã¨ã«ããæ°è¦ãã¹ããã¼ã¿ãå¾ã¦ã¯æ¤è¨¼ãç¹°ãè¿ããã¨ãããã¨ãæå±ããã¦ãã¾ãã
9. Subtle Overtrainingï¼å¾®å¦ãªéå¦ç¿ï¼
ããã¯æ£ç´è¨ã£ã¦ä½ãè¨ãããã®ãã¡ãã£ã¨åããã¥ããã£ãã§ãããã¼ã¿ãå¢ããã°éå¦ç¿ãé²ãã¨ããã®ã¯ã¾ããã®éãã ã¨æããã§ãããããã¼ã¿ãå¢ããã«ã¤ãã¦éå¦ç¿ãé²ãã§ãã¾ã£ãæã«ã©ãããã°è¯ãããã¯çµæ§é£ãããã¼ãã ã¨æããã§ããããããå¤åãã®ãç¨åº¦ããã©ã測ããï¼ãéè¦ã ã¨ãããã¨ãªã®ããªã¨ããã¼ã¿ãå¢ããçµæCV MSEã2åã«ãªã£ã¦ãã¾ã£ããã©ããããè¯ããï¼ããããã¯ãã®MSEã許容ã§ãããã©ããã®æ¹ãéè¦ã ã¨æããã§ããã
追è¨ï¼2019å¹´6æ18æ¥ï¼
ãã¼ã¿ãµã¤ã¨ã³ãã£ã¹ãããããããã¡ãªéã¡ããã10ï¼æµ·å¤è¨äºç´¹ä»ï¼ - å æ¬æ¨ã§åããã¼ã¿ãµã¤ã¨ã³ãã£ã¹ãã®ããã°9 ã¯ãä¸ã¤ã®ãã¼ã¿ã»ããã ããä¸çæ¸å½ãããåããªãã£ã¦è©±ã§ã¯ã»ã»ã»ï¼
2019/06/17 20:59
確ãã«ããèªã¿åãããªã¨æãã¾ããããæææé£ããããã¾ãm(_ _)m
10. "need more data" fallacyï¼ããã£ã¨ãã¼ã¿ãå¿ è¦ãã¨ãã誤ã£ãæãè¾¼ã¿ï¼
æ¨æ¬æ½åºã«ãã ããå¤å
¸çãªçµ±è¨å¦ã«è©³ãã人ããè¦ãã°å®ã¯å½ããåãªè©±ã§ã¯ããã¾ããããç´æã«åãã¦ããã¼ã¿åæã¨ããã®ã¯å¤éããããã¯å°æ°ã§è¯ãã®ã§ãã¡ãã¨ï¼å®éã«ã¯è¦ããªãï¼æ¯éå£ãé©åã«ä»£è¡¨ãããµã³ãã«ãããã°è¯ãã¨ãããã®ã ã£ãããã¾ãããã¡ããããã¼ã¿ãå°ãªããã°ããã ã人ã®ç®ã§è¦ã¦ææ¡ãããã¨ã容æã§ããããã»ã³ã¹ããçºæ®ãããããªãã¾ãããããã¼ã¿ãµã¤ã¨ã³ãã£ã¹ãã¨ããã®ã¯å¾ã
ã«ãã¦ããã£ã¨ãã¼ã¿ãå¿
è¦ãã¨è¨ã£ã¦ãã¾ããã¡ã§ãããã¨ã
ãªãã°ããããå¤ã
ã¾ãã¾ãå¼ããã¨ãã姿å¢ã¯æããªããã°ãªããªããå°æ°ã®é©åã«æ¯éå£ã代表ãããµã³ãã«ãå¾ããããããªãµã³ããªã³ã°ãè¡ãéãã¯ãå¾ããããã¼ã¿ãåæããçµæããå°ããã¢ã¯ã·ã§ã³ããã¾ããããªãã£ãæã«èããã¹ããã¨ã¯ããã£ã¨ãã¼ã¿ãå¢ãããã§ã¯ãªããããã£ã¨ã¢ããã¼ããé©åãªãã®ã«å¤ãããã§ããã¹ãã ãã¨è¨ã£ã¦ãã¾ãã
ææ³ãªã©
ä¸è²«ãã¦éå¦ç¿ã®ãã¨ã"overfitting"ã§ã¯ãªã"overtraining"ã¨æ¸ããããtrain / dev / test (private)ã§ã¯ãªãin-sample vs. out-sampleã¨æ¸ããªã©ãã¡ãã£ã¨ç¨èªã®ä½¿ãæ¹ãéãæãã®ããããã¹ãã ã£ãã®ã§è²ã
ã¨èªãã§ãã¦æ¸æããã¨ã®ããè¨äºã§ããããã
å人çãªæè¦ãæ¸ãã¨ããã®è¨äºã主ã«ãã¸ãã¹å®åã«ãããæ©æ¢°å¦ç¿ï¼çµ±è¨ï¼ã¢ããªã³ã°ã®ç®¡çéç¨ã®ä»æ¹ã«ã¤ãã¦ä½ãç©ç³ããã¨ãã¦ããã®ã ã¨ããã°ããã¯ããã¸ãã¹å®åã«ããã¦ä½ãéè¦ãã¨ãããã¨ã第ä¸ç¾©ã«ç½®ãã¹ãã ã¨æãã®ã§ããä¾ãã°ã説æãã«éããç½®ãããã®ããããã¨ããäºæ¸¬ãã«éããç½®ãããã®ãããªã©ãªã©ã
ãã¨ã¯ãä¸è¬çãªã¢ããªã³ã°ã«éãã¦ã®æ³¨æäºé
ãå®ããã¨ãèè¦ããªã¨ãç¹ã«æç³»åãã¼ã¿ã®ã¢ããªã³ã°ã¯ãããä¸ã«å°é·ãåã¾ã£ã¦ããå°é·åã¿ãããªãã®ãªã®ã§ãå²ã¨ã·ã³ãã«ãªç´æäºãããªãã¨ã¯ãããããç¨åº¦æèçã«è½ã¨ãç©´ã«ããããªãããæ°ãä»ããã¹ãããªã¨æãã¾ããã