ã¡ã«ã«ãªãéè·ãã¾ãã
ããã¯ä½ï¼
ããããéè·ã¨ã³ããªã§ããã¿ã¤ãã«ã®éããä»ææ«ã§ã¡ã«ã«ãªãéè·ãããã¨ã«ãªãã¾ããã
ä¸åº¦ã®é¢è±ãæãã§è¶³æã5å¹´å¤åããä»æ¥ãã¡ã«ã«ãªã®æçµåºç¤¾æ¥ã§ããã大å¤ãä¸è©±ã«ãªã£ãä¼ç¤¾ãªã®ã§ãæè¬ã®æãè¾¼ãã¦å人çãªæ¯ãè¿ããæ¸ãæ®ãã¦ãããã¨æãã¾ãã
注è¨ï¼ãã¬ãã£ããªå 容ã¯ã»ã¼åºã¦ãã¾ããããéè·ãã¨ããæåãè¦ã¦ãã¬ãã£ããªå 容ãæå¾ ããã¦è¨äºãéãããçãã¾ããæå¾ ã«æ·»ããç³ã訳ããã¾ããð
ãã¾ã誰ï¼
ML_Bear ã¨ç³ãã¾ããæè¿ã¯æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ãåä¹ãã¤ã¤çæAIé©ãå±ãè¡ããã¦ããã ãã¦ããã¾ãã
ã¡ã«ã«ãªå ¥ç¤¾å½åã¯ãã¸ã¿ã«ãã¼ã±ã¿ã¼å ¼ãã¼ã¿ãµã¤ã¨ã³ãã£ã¹ãã§ãããå¾è¿°ã®éããã¡ã«ã«ãªå¨ç±ä¸ã«æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã¸ã®ãã£ãªã¢ãã§ã³ã¸ã®æ©ä¼ãããã ãã¾ããã
ã¡ã«ã«ãªã§ä½ããã¦ãã®ï¼
ãã£ããããã¨ãååã¯ãã¸ã¿ã«ãã¼ã±ã¿ã¼å ¼ãã¼ã¿ãµã¤ã¨ã³ãã£ã¹ãã¨ãã¦ãå¾åã¯æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã¨ãã¦åãã¦ãã¾ããã
ã¡ã«ã«ãªã«ã¯ä¸å ´ (2018å¹´6æ) ã®ç±æ°ãæ®ã2018å¹´9æã«å ¥ç¤¾ãã¾ãããåãæã®åæå ¥ç¤¾ã50人ããããã¦ã³ã£ããããã®ããæ¨æ¥ã®ãã¨ã®ããã«è¦ãã¦ãã¾ãã
ç´°ããä»äºå 容ãç´¹ä»ããã¨ããªããªãã®ã§èªåã® ãã¼ããã©ãªãªãã¼ã¸ ã«è²ãã¨ãã¦ãå°è±¡ã«æ®ã£ã¦ããã®ã¯ä»¥ä¸ã®ç¹ã§ãã
- ãã¸ã¿ã«ãã¼ã±ã¿ã¼æ代
- é常ã«ã¢ã°ã¬ãã·ããªããã¼ã¸ã£ã¼ãååã¨å ±ã«ãã¡ã«ã«ãªã®è¨å¤§ãªãã¼ã¿ãæ´»ç¨ãã¦ããã¸ã¿ã«ãã¼ã±ãã£ã³ã°ãCRMã®ã³ã¹ãåæ¸ããã¸ã§ã¯ãã«æºãã£ã¦ãã¾ããã
- ã¡ã«ã«ãªå ¥ç¤¾åã¯ã¿ããã¨ããªãéã®ã¦ã¼ã¶ã¼ãã°ã大ããªåºåäºç®ãæ´»ç¨ã§ããç°å¢ã¯é常ã«åºæ¿çã§ããã
- ãã¡ãã¡Kaggleræ代
- ãã£ããã¯å¿ãã¦ãã¾ã£ãã®ã§ããã2018å¹´ããããKaggleã«ããã£ã¦æ©æ¢°å¦ç¿ã®åå¼·ããã¦ãã¾ããã
- ä»äºã§ãæ©æ¢°å¦ç¿ã使ãããã¨æããKaggleã§å¾ãç¥èãæ´»ããã¦ããã£ã¨ã¯ã¼ãã³é å¸ã¢ãã«ãä½ã£ãã¨ãããéè¯ãææãåºããã¨ãã§ãã¾ããã
- ããããããã¨ã«ããã®ã¾ã¾æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã¨ãã¦åããæ©ä¼ãããã ããç¡äºã«ãã£ãªã¢ãã§ã³ã¸ãæããã¾ããã
- ã¡ã«ã«ãª 2nd ã·ã¼ãºã³
- ä¸åº¦ã¡ã«ã«ãªãéè·ãã¦1å¹´ãããçµã£ãå¾ã«ã諸äºæ ãã£ã¦ã¡ã«ã«ãªå¾©å¸°ã®æ©ä¼ãããã ãã¾ããã
- ååã¬ã³ã¡ã³ãã®æ¹åã«æºãããã¦é ããGoogle Cloudã®ãã¾ãã¾ãªãããã¯ããå©ç¨ãã¦å®éç¨ã«èããã¬ã³ã¡ã³ãã·ã¹ãã ãå®è£ ããæ¹æ³ãå¦ã°ãã¦ããã ãã¾ããã
- ãã®é ã¯ã©ãã ãçãè¦ç©ãã£ã¦ãå人åã®æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã ã£ãã®ã§ããéãããããªããåå¼·ããã¦ããã ãã¦ãæãã§ããã
- æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã¨ãã¦ã®å¯è½æ§ã大ããåºãããã¨ãã§ããã¡ã«ã«ãªã«å¾©å¸°ãã¦è¯ãã£ããªãã¨æãæ¥ã ã§ããã
- ãã®é ã®ããã¸ã§ã¯ãä¾
- çæAIé©ãå±æ代
- èå³ã移ãããããåã¯çæAIãã¼ã ãæ¥ãå¾ã¯çæAIã®æ¹ã«èå³é¢å¿ã移ã£ã¦ãã¾ãã¾ããã
- éãã社å ã«çæAIãã¼ã ãç«ã¡ä¸ãã£ã¦ããææã ã£ãã®ã§ãå°ã (ãã£ã¡ã?) ç¡çãè¨ã£ã¦ãã®ãã¼ã ã«ç°åããã¦ããã ãã¾ããã
- 社å åãã®çæAIãããã¯ãã®æ¹åããé常ã®æ©æ¢°å¦ç¿ã¢ãã«ã§ã¯é£èªããããªããã¸ã§ã¯ãã¸ã®çæAIã®æ´»ç¨ãè¡ã£ã¦ãã¾ããã(ããã¸ã§ã¯ãä¾: LMãæ´»ç¨ãã大è¦æ¨¡ååã«ãã´ãªåé¡ã¸ã®åãçµã¿)
- çæAIã®é»ææãããå®åã®ä¸ã§çæAIã«è§¦ããæ©ä¼ãæ°å¤ãããã ããã®ã¯å¹¸é以å¤ã®ãªã«ãã®ã§ããªãã¨æã£ã¦ãã¾ãã
ã¡ã«ã«ãªã®è¯ãã£ãã¨ãã
- ææ¦ã許容ããæå
- Go Boldã¨ããã¨ããããã·ã§ã³ã®ãã¨ããã¾ãã¾ãªåºæ¿çãªèª²é¡ã«å¤§èã«ææ¦ããã¦ã¦ããã ãã¾ãããããã¯ä¸è¨ã«æ¸ããå 容ããç解ãã¦ããã ãããã¨æãã¾ãã
- 飽ãæ§ã®åã5å¹´ã飽ããã«ãååä»äºåååå¼·ã¿ãããªç¶æ ã§é·ããä»äºããã¦ããã ããã®ã¯å¥è·¡ã ã£ãã¨æãã¾ããæ´ä»£ã®ããã¼ã¸ã£ã¼ã®çæ§ãããã¨ããããã¾ããã
- å¨è·ä¸ã«æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã¸ã®ãã£ãªã¢ãã§ã³ã¸ã«ææ¦ããæ©ä¼ãä¸ãã¦ããã ãããã¨ã«ã¯è¨èã«ãªããªããããæè¬ãã¦ãã¾ãã
- å¤æ§æ§
- å½ç±ãããã¯ã°ã©ã¦ã³ãããã¾ãã¾ãªäººæããã¦ãããããåå¼·ããã¦ããã ãã¾ãããèªå½ãå°ãªãã¦å°ä¸¦æã§å¤§å¤æ¥ããããã®ã§ããç¬ãä¸çãããã身è¿ã«æããããããã«ãªãã¾ããã
- è±èªãå ¬ç¨èªåããããã¨ãããè±èªã®åå¼·ããµãã¼ãããã ãã¦å°ãã¯è±èªã話ããããã«ããªãã¾ããããã ãè¾¼ã¿å ¥ã£ãè°è«ã¯ã¾ã ã¾ã ã§ããªãã®ã§ããã«ç é½ãå¿ è¦ã ãªã¨çæãã¦ãã¾ãã
- æ§å説ã®æå
- ãã¾ãã¾ãªã«ã¼ã«ãåºæ¬çã«ã¯æ§å説ã§éç¨ããã¦ãã¾ããã
- ç¡é§ãªã«ã¼ã«ããã¾ããªãã社å¡ãä¿¡é ¼ãã¦ããæããå¿å°ããã£ãã§ãã
- ãã¾ã«ã¤ã³ãã£ããã人ããã¾ããããå ¨ä½ã¨ãã¦ã¯ç¡é§ãªã«ã¼ã«ãå°ãªãæ¹ãå¹ççã ãããªã¨æãã¦ãã¾ãã
- æè¡åºå ±ã¸ã®åã®å
¥ãæ¹
- å°ãç´°ããç¹ã§ãããä¼ç¤¾ã¨ãã¦é常ã«æè¡åºå ±ã«åãå ¥ãã¦ãã¾ããã
- 社å ã§è¡ãªã£ãããã¸ã§ã¯ãã®çºä¿¡ãä½åº¦ããµãã¼ããã¦ããã ãã¦å¬ããã£ãã§ãã
- ããããã®å¾
é
- å ·ä½çãªéé¡ã¯ä¼ãã¾ãããã¿ããªå¤§å¥½ã給ä¸ã®è©±ãå°ãæ¸ãã¦ããã¾ããããï¼ç¶ºéºäºã°ã£ãã ã¨ã¤ã¾ããªãã§ãããï¼
- å è³ã¨ãã¦çµ¦ä¸å¾ éã¯ããªãè¯ãã»ãã ã¨æãã¾ããOpenSalaryã¨ãã«è¼ã£ã¦ãæ å ±ã¯å½ãããã¨ããã©ãé ãããã ãªã¼ã¨æãã¾ãã(ä¸ã®ã»ãã®ã°ã¬ã¼ãã®éé¡å¸¯ã妥å½ãã©ããªã®ãã¯ããããªã)
- 給ä¸ä»¥å¤ã«ããSick Leave (å·ç ä¼æ)ã好ããªPCæ¯çµ¦ããã«ãªã¢ã¼ãç°å¢ãªã©ãã¨ã³ã¸ãã¢ã¨ãã¦é常ã«åããããç°å¢ãæºåãã¦ããã ãã¦ãããæå¥ã®ã¤ãããããªãã§ãï¼çµ¶è³ï¼
ãªãã§è¾ããã®ï¼
ä¼ã£ãæã«ã§ãèãã¦ã¿ã¦ãã ããð
ã£ã¦æ¸ãã¨æå³æ·±ã§ããããã¾ãæ·±ãæå³ã¯ããã¾ããã5å¹´éãåãä¼ç¤¾ã®ä¸ã§å ¨åã§èµ°ãç¶ãã¦ãå°ãèããåãåºã¾ã£ã¦ãã¡ãã£ãã®ã§ããããããªãã¬ãã·ã¥ããããªã¼ãã¿ãããªæãã§ãã
ããã¦ä¸ç¹ã ã触ãããªããä¼ç¤¾ã®æé·ã«ä¼´ãç°å¢ã®å¤åã¯éè·ã決ããè¦å ã®ä¸ã¤ã§ãããã¡ã«ã«ãªãæçããçµç¹ã¸æé·ããç¾å ´ã«ç«ã¡ä¼ãããã¨ã¯å¬ããæãã¾ãããã®ä¸æ¹ã§ãã¹ã¿ã¼ãã¢ãããªãã§ã¯ã®ãã種ã®æ··æ²ã¨ããç±æ°ãå°ããã¤è½ã¡çãã¦ãã£ããã¨ãããã¦ãæããä¸ç·ã«åãã¦ããå人ãã¡ã1人ã¾ã1人ã¨å»ã£ã¦ãã£ããã¨ã¯ãå¯ãããªãã£ãã¨ããã°åã«ãªãããªã¨ããæãã§ãã
è¾ãããã¨ä½ããã®ï¼
ã¾ã 決ãã¦ã¾ããã
ã¨ããããä¹ ãã¶ãã«Kaggleãã£ãããä»æ¸ãã¦ãæ¬ã®ä»ä¸ããè¡ã£ãããæéã®ä½è£ãããæã«ããã§ããªããã¨ããããããããã¨æã£ã¦ãã¾ãã
次ã®ææ¦ã¯åå¹´ãããããã¦ãã£ãã決ãã¦ãããã°ããããªã¨ãæ©ã決ã¾ãã°è¯ãã«è¶ãããã¨ã¯ãªãã§ãããç¦ãå¿ è¦ããªãã®ã§ãã¼ã³ã¨æ§ãã¦ãããããªã¨æã£ã¦ãã¾ãã
ããæ¬æ¸ãã¦ãã®ï¼ãPRã
ãããããéä¸ã§æ¾ãåºãããæ¥ãããããªã¨æã£ã¦è¨ã£ã¦ãªãã£ãã®ã§ãããå»å¹´WEBçã§åºããæ¬ããå®éã®æ¬å±ã«ä¸¦ã¹ãæ¸ç±ã¨ãã¦æ¸ãç´ãã¦ã¾ãã
WEBçã«AIã¨ã¼ã¸ã§ã³ãã®ç« ã ãå çããããããã¨æã£ã¦å®æãªæ°æã¡ã§æ¸ãå§ããã®ã§ãããæ¸ãå§ãã¦ããLangChainãè¨æ³å¤ãã¦ããããOpenAIãããããããæ°æ©è½ã ããããæãå¥ã®æã¦ã«å°çä¸ã§ã¯ç¡æµã ã¨æã£ã¦ãGPT-4ã«å¹æµããã¢ãã«ãè¤æ°ç»å ´ãã¦ãããã¨å¤æ´ã ããã ã£ãã®ã§å¤§å¤ã§ããã
å¤åã«ã¯æ¬å±ã«ä¸¦ã¶ã¨æãã®ã§ãè¯ãã£ããæã«åã£ã¦ã¿ã¦ãã ãããã
ãããã«
åãã¨ãã®ãªãã¨ã³ããªãèªãã§ããã ãã¦ãããã¨ããããã¾ãããããã¦å®£ä¼ã«ã¤ãåããã¦ãã¿ã¾ããã
5å¹´éã®æãåºãªã©å°åº1ã¤ã®è¨äºã«æ¸ããããªãã®ã§ãã¾ããä¼ãããéã«ã§ãè²ã ã話ãããã¦ããã ããã°å¬ããã§ãã
ã§ã¯ã§ã¯ã
é ããªãpandasã®æ¸ãæ¹
ããã¯ä½ï¼
- ãã®è¨äºã¯ Kaggle Advent Calendar 2021 ã®7æ¥ç®ã®è¨äºã§ãã
- pandasã¯ãã¼ã¿åæã©ã¤ãã©ãªã¨ãã¦é常ã«ä¾¿å©ã§ãããæ¸ãæ¹ãééããã¨ç°¡åã«å¦çãé ããªã£ã¦ãã¾ãã¨ããæ¬ ç¹ãããã¾ããããã§ããã®è¨äºã§ã¯é ããªããªãæ¸ãæ¹ãããããã«æ°ãã¤ããããã¤ã³ããããã¤ããç´¹ä»ãããã¨æãã¾ãã
- ãã® Colab Notebookã®å®è¡çµæãã¨ã¯ã¹ãã¼ãããä¸ã§ãä¸è¦ãªé¨åãä¸é¨åã£ã¦è¨äºã«ãã¦ãã¾ããcolab notebook ãã³ãã¼ãã¦å®è¡ãã¦ããããã°åç¾ãããã¨ãå¯è½ãªã¯ãã§ãã(colabã«ã³ã¡ã³ãçãããã ãã¦ãè¿ããã¨ã¯ã§ããªãã¨æãã¾ãããã¿ã¾ããã)
åææ¡ä»¶
- ãã®è¨äºã§ã¯ããã¾ã§ãé
ããªã(ãªãã¥ãã)æ¸ãæ¹ãç´¹ä»ããããã¨ã«åªãã¾ãããã£ã¦ã以ä¸ã®ãããªæ¹åç¹ã¯ãããä¸æ¦èæ
®ã®å¤ã«ãããã®ã¨ãã¦è©±ãé²ãã¾ãã
- 並ååã©ã¤ãã©ãª
- numbaã§ã®ã³ã³ãã¤ã«
- (cudfãªã©ã§ã®)GPUæ´»ç¨
- BigQueryå©ç¨
- ä»è¨èªå©ç¨(C++ã¨ã)
ä½è«
- pandasé«éåã§ã°ã°ãã¨ä¸¦ååã©ã¤ãã©ãªã®ç´¹ä»ãçµæ§åºã¦ãã¾ã
- åºæ¬çã«ã¯ãã®è¾ºãã¯ããã¾ã調ã¹ãå¿ è¦ã¯ãªãã¨æã£ã¦ãã¾ã
- ã©ã®ã©ã¤ãã©ãªãå¾®å¦ã«pandasã¨ã¯äºææ§ããªãã®ã§ãã©ããå¾®å¦ã«äºææ§ããªããã®ãå¦ã¶ãªãcudfä¸æããªã¨æãã¾ãã
- ä»ã¯colabã«æ¨æºã§çµã¿è¾¼ã¾ãã¦ãã¾ãããããã®ãã¡çµã¿è¾¼ã¾ããã¯ã⦠(åã®é¡æãå«ã)
ç®æ¬¡
- ããã¯ä½ï¼
- ç®æ¬¡
- ãã¼ã¿æºå
- ãã¼ã¿èªã¿è¾¼ã¿
- iterrows ã¯çµ¶å¯¾ã«ä½¿ããªã (applyã)
- åæå®ãããã
- numpy å¦çã®æ´»ç¨
- é«éjoin
- ãã®ä» é£ã³éå ·ç³»
- åèè³æ
- 宣ä¼
ãã¼ã¿æºå
- ã¾ãã¯ä¾ã«ä½¿ããã¼ã¿ã®æºåãè¡ãã¾ã
- å¥ã«ä½ã®ãã¼ã¿ã§ãããã£ãã®ã§ããããã¡ãã®è¨äºã§ä½¿ããã¦ãããã¼ã¿ãé©å½ã«å å·¥ãã¦ä½¿ãã¾ãã
- ãã¼ã¿éãå°ãªãã£ãã®ã§ã«ã©ã æ°ã20åãè¡æ°ã100åã«è¨ãã¾ãã¦ãã¾ãã
import gc import string import random import numpy as np import pandas as pd from tqdm.notebook import tqdm tqdm.pandas() def make_dummy_location_name(num=10): chars = string.digits + string.ascii_lowercase + string.ascii_uppercase return ''.join([random.choice(chars) for i in range(num)]) def make_dummy_data(df, location_name): for i in range(20): df[f'energy_kwh_{i}'] = df[f'energy_kwh'] * random.random() df['location'] = make_dummy_location_name() return df !wget https://raw.githubusercontent.com/realpython/materials/master/pandas-fast-flexible-intuitive/tutorial/demand_profile.csv df_tmp = pd.read_csv('demand_profile.csv') df_dummy = pd.concat([ make_dummy_data(df_tmp.copy(), x) for x in range(100) ]).reset_index(drop=True) df_dummy = df_dummy.sample(frac=1).reset_index(drop=True) df_dummy.to_csv('data.csv') display(df_dummy.info()) display(df_dummy[['date_time', 'location', 'energy_kwh_0', 'energy_kwh_1']].head(3))
date_time | location | energy_kwh_0 | energy_kwh_1 | |
---|---|---|---|---|
0 | 14/7/13 3:00 | DOymwZfkoV | 0.696740 | 0.419453 |
1 | 24/7/13 21:00 | smOT74HjRq | 0.213311 | 0.317483 |
2 | 4/6/13 9:00 | nKYmHeR2ov | 0.322995 | 0.413001 |
- join / merge ã®å®æ¼ã«ä½¿ãé©å½ãªãã¼ã¿ãä½ãã¾ã
- location_idã®ãªã¹ãã¨ãã¾ãã
locations = df_dummy['location'].drop_duplicates().values df_locations = pd.DataFrame({ 'location': locations, 'location_id': range(len(locations)) }) df_locations.head(3)
location | location_id | |
---|---|---|
0 | DOymwZfkoV | 0 |
1 | smOT74HjRq | 1 |
2 | nKYmHeR2ov | 2 |
ãã¼ã¿èªã¿è¾¼ã¿
ãã¦ãããããå®éã«è©±ãé²ãã¦ãããã¨æãã¾ãã ã¾ãã¯ãã¼ã¿ãèªã¿è¾¼ãã¨ãã«æ°ãã¤ãããã¤ã³ãã§ã
usecols ã®å©ç¨
- ãã¼ã¿ã大ããããã¤ãæ¨ã¦ãã«ã©ã ãå¤ãæã¯å¿
ã
usecols
ãæå®ãã¾ããã - èªã¿è¾¼ã¿é度ãå¦å®ã«å¤ããã¾ã
usecols = ['date_time', 'energy_kwh_0', 'energy_kwh_1', 'energy_kwh_2', 'location']
%%time # usecolsããªãã¨ã df = pd.read_csv('data.csv')
CPU times: user 4.84 s, sys: 126 ms, total: 4.97 s
Wall time: 4.96 s
%%time # usecolsãããã¨ã df = pd.read_csv('data.csv', usecols=usecols)
CPU times: user 2.86 s, sys: 97.1 ms, total: 2.95 s
Wall time: 2.93 s
åæå®
- ä½è£ããªãã¨ã以å¤ã¯åãæå®ãã¾ããã
- éè¨ã®ãã¼ã«ããã«ã©ã 㯠category åã«ãã¦ããã¨éè¨ãæ©ããªãã¡ãªãããããã¾ã (å¾è¿°)
- èªã¿è¾¼ã¿é度ã«ã¯å½±é¿ãã¾ããããã¡ã¢ãªä½¿ç¨éã«å¤§ããè²¢ç®ããã®ã§ã¡ã¢ãªä¸è¶³ã§è½ã¡ããçã®ä¸è¦ãªã¨ã©ã¼ãé²ããã¨ã§ãã©ã¤ã¢ã³ãã¨ã©ã¼ã®å¹çãä¸ããã¨æãã¾ãã
%%time # åæå®ãã¦ããã¨ãè¡åãè¯ã # èªåã§åãèããã®ãé¢åãªæã¯æ¬¡ç¯ã® reduce_mem_usage ã使ãã®ã§ãè¯ã df = pd.read_csv( 'data.csv', usecols=usecols, dtype={ 'date_time': str, 'energy_kwh_0': float, 'energy_kwh_1': float, 'energy_kwh_2': float, 'location': 'category' } )
CPU times: user 2.79 s, sys: 73.1 ms, total: 2.87 s
Wall time: 2.86 s
cudf
- æ°åè¡åä½ã®ãã¡ã¤ã«ãªãcudfã使ãã®ãè¯ãã¨æãã¾ã
- Kaggle Riiidã³ã³ããã¼ã¿(大ä½1åè¡ãããï¼ã§ã¯pandasèªã¿è¾¼ã¿ã§ã¯1å以ä¸ããã£ã¦ãããã®ãcudfèªã¿è¾¼ã¿ã ã¨3ç§ã§çµããã¨ã®æ稿ãããã¾ãã
- èªã¿è¾¼ãã å¾ã®å¦çãä¸é¨å¤ããã®ã§æ³¨æã¯å¿ è¦ã§ãããè«å¤§ãªãã¼ã¿ãæ±ãã¨ãã®é¸æè¢ã¨ãã¦å¦ç¿ã³ã¹ãã«è¦åãããã©ã¼ãã³ã¹ã ã¨æãã¾ãã
# import cudf # cdf = cudf.read_csv('data.csv')
iterrows ã¯çµ¶å¯¾ã«ä½¿ããªã (applyã)
æ°å¤ãã®è¨äº(1)(2)ã§åãä¸ãããã¦ãã¦ããåç¥ã®æ¹ã«ã¨ã£ã¦ã¯ãä½ããã¾ãããã¨æããããã¨ããããã¾ããã
ãããã®è¨äºã®ã¿ã¤ãã«ãããã¦ãã®ãããã¯ãåãä¸ããªãããã«ã¯ãããªãã®ã§ç´¹ä»ããã¦ããã ãã¾ãã
- é
ããªãæ¸ãæ¹ã¯è²ã
ããã¾ããâ¦
- ã¨ããããèããã®ãé¢åãªããnumpyè¡åã«ãã¦ããã«ã¼ããåãã°å第ç¹ãã ã¨è¦ãã¦ããã°è¯ã
- ã
np.where
ãnp.vectorize
ã使ã£ã¦ã«ã¼ãå¦çãæé¤ããã¨éããã¨ããã®ãä½µãã¦è¦ãã¦ããã¨ãå°ã£ãã¨ãã®æ段ã¨ãã¦ä½¿ãã¾ãã
- ref
iterrowsã®é ããä½æããã
åºæ¬çã«ãiterrows
ã使ããªãæ¸ãæ¹ãããã ãã§9å²ã®ç ´æ»
çãªé
ããåé¿ã§ãã¾ã(æè¨)
æ¥çã®æèè ã®æ¹ããã®ãããªãã¤ã¼ããããã¦ãã¾ã
pandasã§éãã³ã¼ããæ¸ãããã£ããiterrowsã¨applyã¨transformã使ããªãããããpandasã§éãã³ã¼ããæ¸ãããã£ããiterrowsã¨applyã¨transformã使ããªã https://t.co/DLmWsBvAUG
— ã¾ã¾ããã (@mamas16k) 2021å¹´11æ2æ¥
ã¾ãã¯iterrowsã®é ããä½æãã¦ã¿ã¾ãããã
(å®è¡ããå¦çã¯ä½ã§ãããã®ã§é©å½ã«æ¸ãã¾ãããç¹ã«æå³ã¯ããã¾ããã)
# ç ´æ» çã«é ã patterns = [] for idx, row in tqdm(df.iterrows(), total=len(df)): if row['energy_kwh_0'] > row['energy_kwh_1']: pattern = 'a' elif row['energy_kwh_0'] > row['energy_kwh_2']: pattern = 'b' else: pattern = 'c' patterns.append(pattern) df['pattern_iterrows'] = patterns
CPU times: user 1min 23s, sys: 661 ms, total: 1min 23s
Wall time: 1min 24s
apply
ã«ããã¨å°ãã¯ãã·ãªããã«è¦ããã®ã§ãããå¾è¿°ãããé
ããªãæ¸ãæ¹ãã¨æ¯ã¹ãã¨æ¯è¼ã«ãªããªããããé
ãã§ãã
%%time def func_1(energy_kwh_0, energy_kwh_1, energy_kwh_2): if energy_kwh_0 > energy_kwh_1: return 'a' elif energy_kwh_0 > energy_kwh_2: return 'b' else: return 'c' df['pattern_iterrows'] = df.progress_apply( lambda x: func_1(x['energy_kwh_0'], x['energy_kwh_1'], x['energy_kwh_2']), axis=1 )
CPU times: user 18 s, sys: 494 ms, total: 18.4 s
Wall time: 18.5 s
è³æ»ã§æ¸ããæ¸ãæ¹ (numpyé åã«ãã)
- ã¨ããããããæ¸ãã¦ããã°æ»ã«ã¯ããªããã¨ããçµè«ãç½®ãã¦ããã¾ã
- ãã®ãã¼ã¿ä¾ã§ã¯
iterrows
ãåããã100ååå¾ãapply
ãããã20ååå¾æ©ããªã£ã¦ãã¾ã iterrows
㧠for ã«ã¼ãåãæ¸ãæ¹ã¨é常ã«ä¼¼ã¦ããã®ã§ãè¦ããã®ãç°¡åãã¨æãã¾ã
%%time # ã¨ãããã numpy è¡åã«ãã¦ããåãã°æ©ã patterns = [] for idx, (energy_kwh_0, energy_kwh_1, energy_kwh_2) in tqdm(enumerate( zip( df["energy_kwh_0"].values, df["energy_kwh_1"].values, df["energy_kwh_2"].values ) ), total=len(df)): # å¾ã¯ iterrows ã®ã³ã¼ãããã®ã¾ã¾æ¸ãã°è¯ã if energy_kwh_0 > energy_kwh_1: pattern = 'a' elif energy_kwh_0 > energy_kwh_2: pattern = 'b' else: pattern = 'c' patterns.append(pattern) df['pattern_by_np_array'] = patterns assert np.array_equal(df['pattern_iterrows'], df['pattern_by_np_array']) # ä¸å¿ç¢ºèª (10msãããããã£ã¦ã)
CPU times: user 1.12 s, sys: 18.9 ms, total: 1.14 s
Wall time: 1.17 s
ãã®ä»è²ã ãªæ¸ãæ¹
- ä¸è¨ã®æ¸ãæ¹ã¯ã«ã¼ããåãã¦ããããããæè¯ã®ããã©ã¼ãã³ã¹ã¨æ¯ã¹ãã¨ããå£ã£ã¦ãã¾ãã¾ãã
- ããã§ä»¥ä¸ã§ã«ã¼ããåé¿ããæ¹æ³ã2ã¤ç´¹ä»ãã¦ããã¾ã
- numpyã®å¦çã¨çµã¿åãããã¨éãã®ã§ç´¹ä»ãã¾ã
np.where ã®æ´»ç¨
%%time # ç°¡åãªå¦çãªã np.where ãªã©ã§å¦çãããã¨ãèãã df['pattern_np_where'] = 'c' df['pattern_np_where'] = np.where(df['energy_kwh_0'] > df['energy_kwh_2'], 'b', df['pattern_np_where']) df['pattern_np_where'] = np.where(df['energy_kwh_0'] > df['energy_kwh_1'], 'a', df['pattern_np_where']) assert np.array_equal(df['pattern_np_where'], df['pattern_by_np_array']) # ä¸å¿ç¢ºèª (10msãããããã£ã¦ã)
CPU times: user 68.2 ms, sys: 0 ns, total: 68.2 ms
Wall time: 66.6 ms
np.vectorize ã®æ´»ç¨
- np.whereã§æ¸ããããªå¦çã¯éãã®ã§ãããè¤éãªå¦çã«ãªãã¨ãå¦çãå®ç¾ããåçã®æä½ãèããã®ã«ã¾ãã¾ãé ã使ãå¿ è¦ãããã¾ãã
- ããã§ãèããã®ãé¢åãªã
np.vectorize
ã¨ããnumpyã®é¢æ°ã使ãæ¹æ³ãããã®ã§ãç´¹ä»ãã¾ãã - æ¨è«ããæéã®ãªã¼ãã¼ãããããããããããé ãã®ã§ãããããã§ä¸è¨ã® numpyè¡åã«ãã¦ãã for ã«ã¼ããåãããã¯å ¨ç¶éãã§ãã
%%time def func_1(energy_kwh_0, energy_kwh_1, energy_kwh_2): if energy_kwh_0 > energy_kwh_1: return 'a' elif energy_kwh_0 > energy_kwh_2: return 'b' else: return 'c' df['pattern_np_vectorize'] = np.vectorize(func_1)( df["energy_kwh_0"], df["energy_kwh_1"], df["energy_kwh_2"] ) assert np.array_equal(df['pattern_np_vectorize'], df['pattern_by_np_array']) # ä¸å¿ç¢ºèª (10msãããããã£ã¦ã)
CPU times: user 296 ms, sys: 35 ms, total: 331 ms
Wall time: 333 ms
åæå®ãããã
ãiterrows
使ããªãã§ãã®è¨äºã§è¨ããããã¨ã®90ï¼
ãããçµãã£ã¦ããã®ã§ãããä»ã«ãç´°ã
ã¨ããç¹ãå°ãããã®ã§ä»¥ä¸å°ãæãã¦ããã¾ããã¾ãã¯åæå®ã®è©±ã§ãã
- groupbyããã¨ãã«ã¯ã«ãã´ãªåããªãã¹ã使ã
groupby
ããã¨ãã®ãã¼ã(objectåã§ã¯ãªã) categoryåã ã¨æ©ã- 1åããéè¨ããªããªãã«ãã´ãªåã«å¤æããæéãç¡é§ãªã®ã§å¤æä¸è¦ã ãã大æµã®å ´åã¯ä½åº¦ãéè¨å¦çãããã®ã§categoryåã«ãã¦ããã¨è¯ã
- ãã®ä»ã®ã«ã©ã ãå¿
è¦ãªç²¾åº¦ã«å¿ãã¦ã«ãã´ãªå¤æãã¦ããã¨ã¡ã¢ãªä½¿ç¨éãåæ¸ã§ãã¦è¯ã
- èªåã§åãèããã®ãé¢åãªæã¯Kaggleã³ã¼ãéºç£ã® reduce_mem_usage ã使ãæããã
- é«éåããããã®ãç´¹ä»ããã¦ãã è¨äº
groupby ã®éè¨ãã¼ã¯ category åãè¯ã
# ãã¼ã¿èªã¿è¾¼ã¿ã® dtype æå®ã§ category åã«ãã¦ãã¾ã£ã¦ããã®ã§å¹æã確èªããããã«ä¸åº¦objectåã«æ»ã df['location'] = df['location'].astype('object')
%%time hoge = df.groupby('location')['energy_kwh_0'].mean()
CPU times: user 59.8 ms, sys: 11 µs, total: 59.8 ms
Wall time: 59.1 ms
%%time # category åã¸ã®å¤æã¯å¤å°æéãããã df['location'] = df['location'].astype('category')
CPU times: user 59.4 ms, sys: 1.01 ms, total: 60.4 ms
Wall time: 64.2 ms
%%time # ãã ãä¸åº¦å¤æãã¦ããã¨ãã®å¾ã®éè¨ã¯æ©ã hoge = df.groupby('location')['energy_kwh_0'].mean()
CPU times: user 11.6 ms, sys: 996 µs, total: 12.6 ms
Wall time: 16.7 ms
pd.to_datetimeã¯ãã©ã¼ãããæ示ããã®ãå
%%time # ãã®ä¾ã¯æ¥µç«¯ãªä¾ããã ã⦠pd.to_datetime(df['date_time'])
CPU times: user 1min 21s, sys: 223 ms, total: 1min 21s
Wall time: 1min 21s
%%time # æå®ããã¨æ¨è«ãå ¥ããªããããéã pd.to_datetime(df['date_time'], format='%d/%m/%y %H:%M')
CPU times: user 2.37 s, sys: 8.13 ms, total: 2.37 s
Wall time: 2.36 s
éè¨æã®ã«ã©ã æ示
# %%time # ããçµãããªãã®ã§æ³¨æ # df.mean()
%%time
df.mean(numeric_only=True)
CPU times: user 8.44 ms, sys: 37 µs, total: 8.48 ms
Wall time: 9.53 ms
numpy å¦çã®æ´»ç¨
- (è¨ãã¾ã§ããªãã) numpy ã«å®è£ ããã¦ããå¦çã¯ããã§æ¸ããã»ããéã
- ä¸è¨ã®ä¾ã®ããã«ä½ååãæ©ããªãã¨ããããã§ããªããã10-30%ç¨åº¦ã¯éãã®ã§ãªãã¹ãæ°ã使ã£ãæ¹ãè¯ã
%%timeit df[['energy_kwh_0', 'energy_kwh_1', 'energy_kwh_2']].sum(axis=1)
10 loops, best of 5: 43.1 ms per loop
%%timeit df[['energy_kwh_0', 'energy_kwh_1', 'energy_kwh_2']].values.sum(axis=1)
100 loops, best of 5: 11 ms per loop
é«éjoin
æå¾ã«ãããé«åº¦ãªé«éåã®è©±ããã¾ãã
- pandas DataFrame ãmergeãããæãä¸å®ã®æ¡ä»¶ãæºããã¦ããã¨ãreindexãç¨ããä¸ã§concatãããã¨éãã¨ãããã¯ããã¯ãããã¾ãã
- ä¸å®ã®æ¡ä»¶ã¨ã¯ããjoin ãããDataFrameã® join ã«å©ç¨ãããã¼ãã¦ãã¼ã¯ã§ãããã¨ããå
容ã§ã
- æ¸ãæ¹ããããããã®ã§ãã®å¶ç´ã確èªãã¤ã¤é«éjoinãè¡ãé¢æ°ãæã£ã¦ããã¨ãããããããªã
- ãã®è¨äºã§ä½¿ã£ã¦ãä¾ãããã®ãã¼ã¿éã ã¨ãããªã«å·®ãåºã¦ããªããããã®æ¹æ³ã®ååº(?)ã® KaggleRiiidã³ã³ãã®Notebook ã§ã¯350å以ä¸éããªã£ã¦ãã(ï¼)
%%time df_merge = pd.merge(df, df_locations, how='inner', on='location')
CPU times: user 909 ms, sys: 9.94 ms, total: 918 ms
Wall time: 920 ms
%%time df_concat = pd.concat([ df, df_locations.set_index('location') .reindex(df['location'].values) .reset_index(drop=True) ], axis=1)
CPU times: user 148 ms, sys: 3.77 ms, total: 151 ms
Wall time: 150 ms
ãã®ä» é£ã³éå ·ç³»
- 並ååã©ã¤ãã©ãª
- 使ããªãpandarallelãæ軽ã§ãããã
- mecabã§ã®åãã¡æ¸ãã¿ãããªã©ãããããªãªãå¦çã parallel_apply ãã¦ä½¿ã£ãããã¦ã¾ã
- ãã®ä»è²ã ãã奴ã¯ããç¥ããªããããããåå¼·ããããããªãcudfã®ä½¿ãæ¹ãå¦ãã æ¹ãä¸æçã«å¦ç¿å¹çè¯ãã¨æãã¾ãã
- 使ããªãpandarallelãæ軽ã§ãããã
- cudf
- å¤å°ä½¿ããªãé¢æ°ã¯ãããåºæ¬çé
- cuml ã¨çµã¿åããããããã¨è¯ã
- colabæ¨æºã§å
¥ãã¦ããã¦GPUæ°è»½ã«ä½¿ããã°ãããã ãã©ãªã
- colabã¸ã®ã¤ã³ã¹ãã¼ã«ãããã
- numba
- ã³ã³ãã¤ã«ã§ããããã«æ¸ãã°éã
- ãã numba ã§é å¼µãããããªã cudf ã¨ã BQ ã§ãããã¡ããã
åèè³æ
- ãã®è¨äºãå·çããã«ããã£ã¦åèã«ããè¨äºãåæãã¦ããã¾ã
- How to make your Pandas operation 100x faster
- Do You Use Apply in Pandas? There is a 600x Faster Way
- Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects
- Dunder Data Challenge #2 â Explain the 1,000x Speed Difference when taking the Mean
- How to simply make an operation on pandas DataFrame faster
- Make Pandas Run Blazingly Fast
- è¶ çéãªcuDFã¨Pandasãæ¯è¼ãã
- pandasã§ä½¿ãå¦çã¯ã ãããèªåã®ããã°è¨äºã«ã¾ã¨ã¾ã£ã¦ã
宣ä¼
æå¾ã«å®£ä¼ããã¦ãã ããã(ã¤ãã³ãçµãã£ãå¾ã§æ¶ã)
- åã®æå±ãã¦ãããã¼ã ã§åå¼·ä¼ããã®ã§ããã£ãããã²
- åã¯ç¾è·ã®ä»äºã§ã¯ã¡ã«ã«ãªã¢ããªã®ãã¼ã ç»é¢ã«åºãã¬ã³ã¡ã³ãã¼ã·ã§ã³ãã¼ãã®è£å´ã®ãã¸ãã¯ãçµãã§ãã¾ãã
- ãã®è¨äºã§ç´¹ä»ããæ¹æ³ãªã©ã使ãã¤ã¤è¨å¤§ãªãã°ã解æãã¦ã¬ã³ã¡ã³ãã¼ã·ã§ã³ãã¸ãã¯ãçµãã®ã¯ãªããªãé¢ç½ãã§ã
- å°å¸¸ãããªãã¢ã¯ã»ã¹æ°ãããã®ã§ã¢ããªã«çµã¿è¾¼ãã¨ãã®æ¹æ³ãªã©ãèæ ®ãã¦ãã¸ãã¯ãçµãã®ã¯æ£ç´ãªããªã骨ãæããä»äºã§ããç¬ã飽ãæ§ã®åã§ããªããªã飽ããªãã¦ç´ æ´ãããã¨æãã¾ãã
- åä»ã12/14ã¾ã§ã¨ãªã£ã¦ããã®ã§ãèå³ããæ¹ã¯ãã®ã¾ã¾âããç³ãè¾¼ãã§è¦ã¦ãã ãã
Shopeeã³ã³ã解æ³ãèªãã§åå¼·ã«ãªã£ããã¨ã®éãªã¾ã¨ã
åç½®ã
- Shopeeã³ã³ãã®è§£æ³ãèªãã§ãåå¼·ã«ãªããã¨ãå¤ãã£ãã®ã§éã«ã¾ã¨ãããã®ã§ãã
- Shopeeã³ã³ãã«ã¯åå ãã¦ããããã¨ã¢ããªã®ã§å®éã«ä½¿ãã¨ãã«ã¯è²ããªå·¥å¤«ãå¿ è¦ã ã¨ã¯æãã¾ãã
- åèè³æããã»ã¼æç²ããã¦ããã ããã¨ãããå¤ã
ããã¾ãã
- åé¡ããã°ä½ãªãã¨ãã£ããã£ã¦ãã ããã
åèè³æ
Solutions
- 1st / 2nd ãä¸å¿ã«èªãã§ã¾ã¨ãã¦ãã¾ã
- 2ndã®ã³ã¼ãã¯ç©´ãéãã»ã©æè¦ããã¦ããã ãã¾ãã
- ããGCNã®ãããã¯ã¾ã å ¨ç¶ç解ã§ãã¦ãªãâ¦orz
ããã°è¨äºãYoutubeãªã©
- shimacosããè¨äº
- ã³ã³ãæ¦è¦ã解æ³ã®ä¸å¯§ãªè§£èª¬ã«å ãã¦ãä½æ ãããã解æ³ãæãã¤ããã®ãã¨ããæèã®æµããã工夫ãã¦ãã£ããã©çµå±å¤åå¹ããªãã£ããã¨ãæ¸ããã¦ãã¾ãã
- ãã®è¨äºèªã¾ãããã¨ããªãæ¹ã¯ãåã®è¨äºãªããéãã¦ã¾ãã¯ãã¡ããèªã¾ãããã¨ãå¼·ããå§ããã¾ãw
- Kaggle Shopeeã³ã³ãPrivate LBå¾
æ©æ ï¼ããåçä¼
- ã³ã³ãçµäºç´å¾ã«è¡ãããæ¥æ¬äººä¸ä½é£ã®æ¹ã®è§£æ³è§£èª¬ã®ã¢ã¼ã«ã¤ãåç»
- çã§èãã¦ãæã¯ååããããã¾ããã§ãããããã®å¾è²ã 調ã¹ã¦ããå度èãã¨ç解ãæ·±ã¾ãã¨ã¦ãåå¼·ã«ãªãã¾ãã
- asteriamããè¨äº
- ä¸ä½è§£æ³ãä¸éãã¾ã¨ãã¦ãã ãã£ã¦ãã¾ã
- è±æ¥è¦æ¯ã¹ãªããä¸ä½è§£æ³ãèªãã¨ãã®ç解ã®å©ãã«ããã¦ããã ãã¾ãã
ã³ã³ãæ¦è¦
- shimacosãããæ¸ããã¦ããè¨äºã®å 容ãé常ã«ç°¡æ½ã§ããããããã£ãã®ã§å¼ç¨ããã¦ããã ãã¾ãã
æ±åã¢ã¸ã¢æ大ç´ã®ECãã©ãããã©ã¼ã ã§ããShopeeãéå¬ãããã®ã§ããã¼ã¿ã¨ãã¦ã¯ã¦ã¼ã¶ãç»é²ããååç»åã¨ååã®ã¿ã¤ãã«ãä¸ãããã¾ãã ã¾ããã©ãã«ã¨ãã¦ã¯ã¦ã¼ã¶ãç»é²ããååã®ç¨®å¥ãä¸ãããã¦ãã¾ãããã®ã©ãã«ã¯ãã¦ã¼ã¶ãç»é²ãããã®ãªã®ã§ããã¤ãºãå¤ãè¼ã£ã¦ãããã®ã«ãªã£ã¦ãããåãç»åãåãã¿ã¤ãã«ã§ãéãã©ãã«ãã¤ãã¦ããããã¾ããã¾ãããã®ç¨®å¥ã¨ããã®ã¯æã£ã以ä¸ã«ç´°ãããåãå粧åã§ã50mlã®ãã®ã¨100mlã®ãã®ã§éãã©ãã«ã«ãªã£ã¦ããããã¾ãã ãã®ãããªã¦ã¼ã¶ãã¤ããã©ãã«ãæ師ãã¼ã¿ã¨ãã¦ãç»åã¨ã¿ã¤ãã«ã®ããã¹ãæ å ±ãç¨ãã¦ååã»ããã®ä¸ããåãååãæ½åºããã¢ãã«ãä½æãããã¨ãä»åã®ãé¡ã¨ãªã£ã¦ãã¾ãã
- ä¸ããããç»åãææ¸ãNNã§ãã¾ããã¯ãã«åããå¾ããããç¨ãã¦æ¤ç´¢ãè¡ãã³ã³ãã ã£ãããã§ããã1st solutionæ°ãã(embeddingãç²å¾ããããã®)ã¢ãã«ã究極ã¾ã§æ¹åãããã¨ã¯ãã¾ãæ¬è³ªçã§ã¯ãªããæ½åºãããã¯ãã«ãããã«ä¸æãæ¤ç´¢ã«å©ç¨ããããèã ã£ãããã§ãã
解æ³ä¾ (2nd)
- 1st stage: Train metric learning models to obtain cosine similarities of image, text, and image+text data
- timm/huggingfaceããã¼ã¹ã«ãã¯ãã«åå¾
- ãããconcatenate
- faissã§ã¤ã³ããã¯ã¹åãè¿åæ¢ç´¢
- query expansion ã㦠concat (以ä¸åã)
- weightã¯similarityã®sqrt (αQEã£ã½ã)
- 2nd stage: Train âmetaâ models to classify whether a pair of items belong to the same label group or not.
- Used LightGBM and GAT (Graph Attention Networks)
åå¼·ã«ãªã£ããã¨ã®ç®æ¡æ¸ã
pre-train models ã®æ´»ç¨
- ç»åã§ã¯
timm
ãNLP:transformers
ãã»ã¼ããã¡ã¯ãã¹ã¿ã³ãã¼ãã£ã½ãã - 使ãå¤ããã¦ãã¢ãã«ããææ°ã®ã¢ãã«ã¾ã§ãå¤ä»æ±è¥¿ã®æ§ã ãªã¢ãã«ãæ軽ã«å©ç¨ã§ããã
- ã¾ãã¯ãããã®ã¢ãã«ã®fine-tuningãã©ãè¡ããããèããã®ã常å¥æ段ã£ã½ãã
- 1st, 2nd ãå ±ã«ãã®æ§æã ã£ãã
- 使ã£ã¦ããã¢ãã«ãæ¯è¼çä¼¼éã£ã¦ããã
timm
(Github)
- ç»åç³»ã®NNã¢ãã«ããã¡ããã¡ãé »ç¹ã«æ´æ°ããã¦ãã
- ä¸ä½è§£æ³ã§å©ç¨ããã¦ããmodelã®ä¾
- 1st
eca_nfnet_l1
- nfnetã軽éåãããã¤?
- 2nd
vit_deit_base_distilled_patch16_384
- ç»åã®transformer
dm_nfnet_f0
- batch normalizationãå©ç¨ããªã / 2021/02ã«ç»å ´ããæ°ãããã¤
- 1st
- inference-code ã®ä¾
# https://www.kaggle.com/lyakaap/2nd-place-solution import timm backbone = timm.create_model( model_name='vit_deit_base_distilled_patch16_384', pretrained=False) model1 = ShopeeNet( backbone, num_classes=0, fc_dim=768) model1 = model1.to('cuda') model1.load_state_dict(checkpoint1['model'], strict=False) model1.train(False) model1.p = 6.0
huggingface
(ref)
- NLPçã®è¶ æåã©ã¤ãã©ãª
AutoModel
ã¨ããæ©æ§(?)ã使ãã°èªè¾¼ãäºåå¦ç¿ã¢ãã«ã®ãã¹ãå¤ããã ãã§ä½¿ãã¾ãããã- ãã¡ããã¡ã便å©ãããªã®ã«ä»ã¾ã§å ¨ç¶ç¥ãããã£ãâ¦
- ãã¤ãã¼è¨èªãå¤è¨èªã¢ãã«ãå¤æ°åå¨
- ä»åã®ECãµã¤ãã¯ã¤ã³ããã·ã¢èªã ã£ãã®ã§ãã¤ã³ããã·ã¢èªã®BERTãå¼·ãã£ãã¨ã®ãã¨ã
- models
- 1st
xlm-roberta-large
xlm-roberta-base
cahya/bert-base-indonesian-1.5G
indobenchmark/indobert-large-p1
bert-base-multilingual-uncased
- 2nd
cahya/bert-base-indonesian-522M
- Multilingual-BERT (huggingfaceã®ã¢ãã«å調ã¹ã¦ãªã)
- Paraphrase-XLM embeddings (åä¸)
- 1st
- inference-code ã®ä¾
# https://www.kaggle.com/lyakaap/2nd-place-solution from transformers import AutoTokenizer, AutoModel, AutoConfig model_name = params_bert2['model_name'] tokenizer = AutoTokenizer.from_pretrained('../input/bertmultilingual/') bert_config = AutoConfig.from_pretrained('../input/bertmultilingual/') bert_model = AutoModel.from_config(bert_config) model2 = BertNet( bert_model, num_classes=0, tokenizer=tokenizer, max_len=params_bert['max_len'], simple_mean=False, fc_dim=params_bert['fc_dim'], s=params_bert['s'], margin=params_bert['margin'], loss=params_bert['loss'] ) model2 = model2.to('cuda') model2.load_state_dict(checkpoint2['model'], strict=False) model2.train(False)
深層è·é¢å¦ç¿
- æ¦è¦ã¯yu4uããã®è¨äºã«è©³ãã
- 以ä¸ã®ç¹ãå¬ããã¨ã®ãã¨
- é常ã®ã¯ã©ã¹åé¡åé¡ãå¦ç¿ãããã ãã§è·é¢å¦ç¿ãå®ç¾ã§ãã
- å¦ç¿ã容æãªã¯ã©ã¹åé¡ã¢ãã«ã«1層ç¬èªã®ã¬ã¤ã¤ã追å ããã ãã§ãé常ã®ã¯ã©ã¹åé¡åé¡ã¨ãã¦å¦ç¿ãå¯è½ããã¹ãcross entropyã®ã¾ã¾ã§è¯ãã
- ä¸ä½ã®ãã¼ã ã¯å¤§ä½ä½¿ã£ã¦ãã
- 以ä¸ã®ç¹ãå¬ããã¨ã®ãã¨
- ãã¥ã¼ãã³ã°ã«æããããã¨ã
- 1stã®ãã¼ã ã¯ArcFaceã®ãã¥ã¼ãã³ã°ã«ç¸å½æããã£ãããã§ã以ä¸ã®å·¥å¤«ãããã¨ã®ãã¨ã
- increase margin gradually while training
- use large warmup steps
- use larger learning rate for cosinehead
- use gradient clipping
- 4thã®ãã¼ã
- we also saw batch size to matter during training
- Some models seemed to be very sensitive to the learning rate
- gradient clipping may have also helped to stabilize the training.
- shimacosãã
- åºç¤ã¯ãªããªãå¦ç¿ãé²ã¾ãªãã£ãããã¦ãã©ã¡ã¼ã¿ã®èª¿æ´ãé£ããã£ã
- å¦ç¿çã大ããããwarmupã大ããã«è¡ããã¨ã§å¦ç¿ãé²ã¿ããããªã£ã
- å¦ç¿ã®åºç¤ã ã¨æ®éã®softmaxãããã¯ã©ã¹éã®äºæ¸¬å¤ã®å·®ãé¡èã«åºãªããããå¦ç¿ãé£ããã®ã§ã¯ãªãã
- 1stã®ãã¼ã ã¯ArcFaceã®ãã¥ã¼ãã³ã°ã«ç¸å½æããã£ãããã§ã以ä¸ã®å·¥å¤«ãããã¨ã®ãã¨ã
- 2ndã¯
CurricularFace
ãå©ç¨- å¤ãã®ãã¼ã ã使ã£ã¦ããArcFaceãè¶ ããæ§è½ã ã£ãã¨ã®ãã¨
- å¦ç¿ã¹ãã¼ã¸ã«å¿ãã¦ãã¤ã¼ã¸ã¼ãµã³ãã«ã¨ãã¼ããµã³ãã«ã®ç¸å¯¾çãªéè¦æ§ãèªå調æ´ï¼
- ArcFace ã«å¦ç¿ãµã³ãã«ãè³¢ãé¸ã¶ãããªæ©è½ãã¤ããã¤ã¡ã¼ã¸ï¼
- 1st: class-size-adaptive margin ãããç¨åº¦ã¯ä½¿ããã®ãã¨
- google landmark ã§ã使ããã¦ãã¨ã
- landmarkã»ã©å¤ã¯ã©ã¹ã§ã¯ãªãã£ããããå¹æã¯éå®çã ã£ãã¨ã®ãã¨
QueryExpansion / DataBase-side feature Augmentation
- IRã«ããã¦ã¯ã¨ãªåã³DBãæ¡å¼µããããã®ææ³
- QueryExpansion: æ¤ç´¢ããçµæãå
ã«ã¯ã¨ãªãã©ãã©ãæ¡å¼µãã¦ãã
- å ã ã®ãã¯ãã«ã§æ¤ç´¢
- æ¤ç´¢ã§å¼ã£ããã£ãã¢ã¤ãã ã®ãã¯ãã«ãéã¿ä»ããã¦å ã®ãã¯ãã«ã«å ç®
# https://www.kaggle.com/lyakaap/2nd-place-solution def query_expansion(feats, sims, topk_idx, alpha=0.5, k=2): # å¼ã£ããã£ãä¼¼ã¦ããã¢ã¤ãã ã¸ã®ã¦ã§ã¤ãã決ããå¼(è«æ) weights = np.expand_dims(sims[:, :k] ** alpha, axis=-1).astype(np.float32) # ã¦ã§ã¤ãã«å¿ãã¦ãã¯ãã«ãå ç®ãã¦æ°ãããã¯ãã«ãæ±ãã feats = (feats[topk_idx[:, :k]] * weights).sum(axis=1) return feats # img_D / img_I ã¯ä¸æ®µç®ã®æ¤ç´¢ã§å¼ã£ããã£ãç»åã®ãã¯ãã«ãã¤ã³ããã¯ã¹ img_feats_qe = query_expansion(img_feats, img_D, img_I)
- DBA(DataBase-side feature Augmentation)
- lyakaapããã®memoãã
- ãã¼ã¿ãã¼ã¹ã®ãµã³ãã«ãããã®ãµã³ãã«ã«å¯¾ããè¿åã®descriptorã«ããéã¿ä»ãå¹³åãåããã¨ã§refineããã
- QEã¨ä¼¼ã¦ãããQEã¯queryå´ãrefineãããã©DBAã¯DBå´ãrefineããã¤ã¡ã¼ã¸ããã®ããDBAã¯ãªãã©ã¤ã³ã§ä¸åããã ãã§è¯ãã¦ãquery searchã®ã¨ãã«ã¯é度ã«å½±é¿ããªãã®ãå¼·ã¿ã
- ä½æ å¹ãã®ãï¼ â descriptorãããã¯ã©ã¹ä¸å¿ã«è¿ã¥ããããDBAã¯åä¸ã¯ã©ã¹å士ï¼è¿åã®ãµã³ãã«ã¯åä¸ã¯ã©ã¹ã«å¤§ä½å±ãã¦ããã¨ããä»®å®ï¼ã§ããã¯ã©ã¹ä¸å¿ã«å¼ãã¤ãåããããªãã¨ããã¦ããã
- lyakaapããã®memoãã
embeddingãã¢ã³ãµã³ãã«ããã¨ãã®å·¥å¤«
- shimacosããæ°ã
* ä¸çªè¯ãã£ãã®ã¯ãå種EmbeddingãããããL2 Normalizeãã¦ããconcatããã¨ããæ¹æ³ * ã¢ãã«ã«ãã£ã¦Embeddingã®ã¹ã±ã¼ã«ãéãã®ã§å½ããåã¨è¨ãã°å½ããåã§ãããL2 normalizeããã«concatãã¦ãã¾ãã¨ããã¾ã§æ¹åãå¾ããã¾ããã§ããã * ãã®ãããªç´°ããæè¡ã¯ãéå»ã³ã³ãã®è§£æ³ã§ãããã£ã¨æ¸ããã¦ããã ããªã®ã§è¦ãã¦ããã¨è¯ãããããã¾ããã
faiss
- Facebook Resarchãæä¾ããè¿åæ¢ç´¢ã©ã¤ãã©ãª (Github)
- faissã¯GPUããã«æ´»ç¨ãã¦æ¤ç´¢ãé«éåãããã¨ãã§ãããããªã®ã§ããã®ã³ã³ãã¨ã®ç¸æ§ãè¯ãã£ãã®ã ãããã
- æ¥æ¬èªã®ãã£ããã¼ãªè¨äº
- ã¡ã«ã«ãªç¤¾ã§ã使ããã¦ãã¨ã®ãã¨ããµã¼ãã
- ãã¯ãã«ã®è¿½å ã¨ãããç¨ããæ¤ç´¢ã®ã³ã¼ãä¾
# https://www.kaggle.com/lyakaap/2nd-place-solution import faiss res = faiss.StandardGpuResources() index_img = faiss.IndexFlatIP(params1['fc_dim'] + params2['fc_dim']) index_img = faiss.index_cpu_to_gpu(res, 0, index_img) index_img.add(img_feats) similarities_img, indexes_img = index_img.search(img_feats, k)
Forest Inference
- GPUã§GBMã®æ¨è«ãçéã«ãã¦ããããã¤
- RAPIDSå
¬å¼
- Using FIL (Forest Inference Library), a single V100 GPU can deliver up to 35x more inference throughput than a CPU-only node with 40 cores.
# å¤åãããªæãã§ä½¿ã import treelite from cuml import ForestInference clf = ForestInference() clf.load_from_treelite_model( treelite.Model.load( '/tmp/tmp.lgb', model_format='lightgbm' ) ) clf.predict(X_test).get()
ãã®ä»ç´°ããªç¹
Generalized Mean (GeM) Pooling
- lyakaapããã®memo
- p=1ã§mean, p=âã§maxã¨çãããè«æã§ã¯p=3ãæ¨å¥¨ãSPoC/MACãããé«ç²¾åº¦ã
- https://amaarora.github.io/2020/08/30/gempool.html
- GeM pooling 層ã¯å¦ç¿å¯è½ãªã®ã§pãèªåçã«æ±ºãããã¨ãã§ãã
- https://paperswithcode.com/method/generalized-mean-pooling#
tokenizer
- TweetTokenizer
- ã«ã¸ã¥ã¢ã«ãªæç« ã®tokenizeã«åãã¦ããï¼
NVIDIA DALI
- ç»åã®èªã¿è¾¼ã¿ã¨ãªãµã¤ãºãé«éåããã
- åè: https://xvideos.hatenablog.com/entry/nvidia_dali_report
LightGBMç¹å¾´é (2nd)
- åååã®TOP50ã®çµã¿åããã«å¯¾ãã¦é¡ä¼¼åº¦ãç·¨éè·é¢ãä»ä¸ â åãã«ãã´ãªã ã£ããã©ãããäºæ¸¬ããããã«å¦ç¿
- ç¹å¾´é
- ååå士ã®é¡ä¼¼åº¦
- ç·¨éè·é¢
- åååã®ã¿ã¤ãã«ã®é·ããã¯ã¼ãæ°
- åååã®top-Né¡ä¼¼ååã®similarityã®å¹³å
- (ããå¹ãã®ã©ããããæ°æã¡ãªãã ãã)
- åååã®ç»åãµã¤ãº
- ç¹å¾´é
punctuationã®åå¦ç
- è¤æ°ã®æåï¼é·ã1ã®æååï¼ãæå®ãã¦ç½®æããå ´åã¯æååï¼stråï¼ã®translate()ã¡ã½ããã使ããtranslate()ã«æå®ããå¤æãã¼ãã«ã¯str.maketrans()é¢æ°ã§ä½æããã
title.translate(str.maketrans({_: ' ' for _ in string.punctuation}))
- string: https://docs.python.org/ja/3/library/string.html
string.punctuation
: String of ASCII characters which are considered punctuation characters in the C locale
- TfidfVectorizer:
token_pattern=u'(?u)\\b\\w+\\b'
ã¨ãããã¨ä¸æåã®ãã¼ã¯ã³ãé¤å¤ããªããªã
ç·¨éè·é¢ãä¸æã§åºãã©ã¤ãã©ãª
- è²ã
ããããã
editdistance
Levenshtein
# https://github.com/roy-ht/editdistance import editdistance editdistance.eval('banana', 'bahama') ## 2 # https://qiita.com/inouet/items/709eca4d8172fec85c31 import Levenshtein string1 = "äºä¸æ³°æ²»" string2 = "äºä¸æ³°æ¬¡" string1 = string1.decode('utf-8') string2 = string2.decode('utf-8') print Levenshtein.distance(string1, string2)
stemmer
import Stemmer stemmer = Stemmer.Stemmer('indonesian')
LangID
è¨èªãç¹å®ãã¦ããã
import langid result = langid.classify('ããã¯æ¥æ¬èªã§ã') print(result) # => ('ja', -197.7628321647644)
Kaggle Riiid! ã³ã³ãåæ¦è¨
ããã¯ä½ï¼
- '20/10-'21/01ã«Kaggleã§éå¬ããã¦ãã
Riiid! Answer Correctness Prediction
ã®åå è¨é²ã§ã - public 51st (0.801) â private 52nd (0.802) ã¨é ä½ã¯å¥®ãã¾ããã§ããããç¾å®ä¸çã§ã®äºæ¸¬ã¿ã¹ã¯ã«å³ããã³ã³ãã®è¨è¨(å¾è¿°)ãã1åè¡ãè¶ ããè±å¯ãªãã¼ã¿ãæ±ããã¨ãã£ãå 容ãé常ã«åå¼·ã«ãªãã³ã³ãã§ããã
- Discussionã«æä¸ããå 容ã¨è¢«ãã¾ãããèªèº«ã®åå¿é²(ã¨è§£æ³ã®ä¾é¤) ã®ããã«ã¾ã¨ãã¦ããã¾ãã
ã³ã³ãæ¦è¦
ãã£ããè¨ãã¨
- TOEICåå¼·ã¢ããªã§ã®ã¦ã¼ã¶ã¼ã®æ£ç確çãäºæ¸¬ããã³ã³ã
- Code Competition (ã³ã¼ããæåºããå½¢å¼ã®ã³ã³ã)
- trainãã¼ã¿ã¯ç´1åãtestãã¼ã¿ã¯ç´250ä¸ã
- ãã ã次é ã§è¿°ã¹ãããã«testãã¼ã¿ã¯è¦ããã¨ãã§ããªã
SANTA TOEIC ã¨è¨ãã¢ããªãé¡æã§ãã
Riiidã³ã³ãã®é¡æã¯TOEIC対çã¢ããªã£ã½ãããªï¼
— ML_Bear (@MLBear2) October 9, 2020
"Santa reached No. 1 in sales among education apps in Japan and Korea." ã¨ã®éããæ¥éã§appstore 1ä½åã£ãããã¦ãã¨ã®ãã¨ãhttps://t.co/Fm1t5Owq2k
ã¢ããªLPâhttps://t.co/ZJhOH9Urfk
ã³ã³ãã®ç¹å¾´: ç¾å®ä¸çã§ã®äºæ¸¬ã¿ã¹ã¯ã«å³ããã³ã³ãè¨è¨
- ç¹æ®ãªãµããããæ¹æ³ãæ¡ã£ã¦ããã³ã³ãã§ãã£ãããããæªæ¥ã®æ å ±ãç¨ããç¹å¾´éãä½ãããè¶ å¤§éã®ç¹å¾´éã§æ®´ããã¨ããkaggleã§ãããã¡ãª(å®éç¨ãã¥ãã)æ¹æ³ãå¡ããã¦ããããæ°ãããã¼ã¿ã«åããã¦å°ãã¥ã¤ã¢ãã«ãç¹å¾´éãæ´æ°ãããã¨ãå¯è½ã ã£ããã¨ãé常ã«å®ç¨çãªã³ã³ãã ã£ãã¨æãã¾ãã
- ãµããããæ¹æ³
- æåºããã³ã¼ãã¯ãµããããæã«æå®ãããAPIãå©ãããã«æ¸ã
- ãµããããå¾ã®å®è¡æã«1ãããããã30-50å(?)ç¨åº¦ã¥ã¤ã¬ã¹ãã³ã¹ãè¿ã£ã¦ãã
- äºæ¸¬ãè¡ã£ã¦æåºããã¨ã次ã®ããããã¼ã¿ãéã£ã¦æ¥ãã
- ãã®ããæªæ¥ã®æ å ±ã使ã£ã¦ç¹å¾´éãçæãããã¨ãã§ããªã
- ãã®ãããã«ã¯åã®ãããã®ãã¼ã¿ã®æ£è§£ã©ãã«ãä¸ãããã¦ãã
- ãã®ããç¹å¾´éã®æ´æ°ãã¢ãã«ã®åå¦ç¿ãªã©ãè¡ãã
- ãã ã1ããããããç´0.55sec以ä¸ã§å¦çããªãã¨å
¨ä½ã®å¦çæéã9æéãè¶
ãã¦Timeout Errorã§ãµãããããéããªã
- ãã®ããè¶ å¤§éã®ç¹å¾´éã使ãã¢ãã«ã¯æ´»ç¨ãã¥ãã (testãã¼ã¿ãåãåã£ã¦ããå¦çããã®ã«å¤§éã®æéã使ããã)
- ã³ã³ãã主å¬ããæ義ãé«ã¾ãããã§ããã®å½¢å¼ã®ã³ã³ããä»å¾ä¸»æµã«ãªãã°ãããªã¨å人çã«ã¯æãã¦ã¾ãã
ããå®å ¨ã«åæãããkaggleãã¼ãããã¯ã®ã¡ã¢ãªå¶ç´ã¨ããç¹å¾´éçæã®æéå¶ç´ã¨ããå«ãã¦Riiidã¯ã³ã³ãã®ä¸ã¤ã®å®æå½¢ã ã¨æãããã¼ãã«ã³ã³ãã¯ãã£ã¨ããã§è¯ãããããªãããªãã(åå¿è è¾ããã ãâ¦) https://t.co/zzRwOFAbka
— ML_Bear (@MLBear2) December 7, 2020
ãµããããæã®å³ããæéå¶éãããããããã®é«éåTipsãçã¾ããã³ã³ãã ã¨ãæãã¾ãã
pandasã®left joinã300å以ä¸é«éåããkaggle notebookã
— ML_Bear (@MLBear2) November 24, 2020
çµåãããã¼ãã«ã®çµåãã¼ãã¦ãã¼ã¯ã§ããå¶ç´ãå¿ è¦ãªãã®ã®ã軽ãæ¸ãç´ãã ãã§300åãæ©ããªãã¨ã¯ãããâ¦ï¼
concatã®ã»ããéãã®ã¯ç´æçã«ããããã©reindexç¥ããªãã£ãã®ã§åå¼·ã«ãªãã¾ãããhttps://t.co/NFugt6Nisr pic.twitter.com/jcHK3T9ndo
æ¦è¦
- éå¬æé: 2020/10/06 ã 2021/01/08
- åå ãã¼ã æ°: 3,406
- ç®çå¤æ°: ã¦ã¼ã¶ã¼ãåé¡ã«æ£çãããå¦ã (1ãããã¯0)
- ä¸ãããããã¼ã¿
- ã¦ã¼ã¶ã¼ã®åé¡åçãã°ãè¬ç¾©ãã°
- åé¡ID, åé¡ã°ã«ã¼ãID, åé¡ã®ã¿ã°ID(ã¿ã°å 容ã¯éå ¬é), åé¡ã解ããå¾ã«è§£èª¬ãè¦ããå¦ã, åé¡ã解ãã®ã«ããã£ãæé ãªã©
- è©ä¾¡ææ¨: AUC
解æ³
- LightGBMx2 + CatBoostx1 ã®å éå¹³åã¢ã³ãµã³ãã«
- ç¹å¾´éã¯130åããã
- NN (transformer) ã¯ãã¼ã ã¡ã¤ããçµç¤ã«è©¦ãã¦ããã¦ãã¾ãããéã«åãã¾ããã§ãã
å¹ããç¹å¾´é
å°ã工夫ãããªãã¤
åçã®ã¤ãããå³ãææ¨
- +0.006ã¨èªåãã¡ãä½ã£ãç¹å¾´éã®ä¸ã§ã¯å§åçã«å ãè¼ãã¦ããð
- ä½ãæ¹
- trainå
¨é¨ã使ã£ã¦ååé¡ã®åé¸æè¢ãã©ããããé¸æããã¦ããããè¨ç®
- ä¾: content_id=XXXX ã®é¸æè¢1/2/3/4ã®é¸æç: 9% / 5% / 1% / 85%
- ååé¡ã§é¸æçãç©ã¿ä¸ãã¦åé¸æè¢ã®ãã¼ã»ã³ã¿ã¤ã«ãç®åº
- ä¾: âã®ä¾ã ã¨
- é¸æè¢1: 15% (=1+5+9)
- é¸æè¢2: 6% (=1+5)
- é¸æè¢3: 1%
- é¸æè¢4: 100% (=1+5+9+85)
- ä¾: âã®ä¾ã ã¨
- åã¦ã¼ã¶ã¼ã®éå»ã®é¸æè¢ã®ãã¼ã»ã³ã¿ã¤ã«ãAggregation(std, avg, min, etc.)
- trainå
¨é¨ã使ã£ã¦ååé¡ã®åé¸æè¢ãã©ããããé¸æããã¦ããããè¨ç®
- æ°æã¡
- ã»ã¨ãã©ã®äººãé¸ãã§ããªããããªã¤ãã¤é¸æè¢ãé¸ãã§ã人ã¯å¤åã¤ãã¤
- ããã§ãã人ã¯ãã¨ãééã£ãã¨ãã¦ãã¤ãã¤é¸æè¢ã¯é¸ã°ãªãã¯ã
- âã®ä¾ã®é¸æè¢3ã¨ãé¸ã¶äººã¯å¤åã¤ãã¤ã®ã§ãã®ãã¨ãã¤ãã¤ã¯ã
- å¤åãã®èãã¯åã£ã¦ã¦stdã®éè¨ããã¡ããã¡ãå¹ãã¦ãã
å°ã工夫ããWord2Vec by ãã¼ã ã¡ã¼ã
- ãããªã¼ã¯ãã¦ããæåã ã£ãã+0.004ãããå¹ãã¦ãã¦ãã¡ããé常ã«å¹æããã£ã
- ä½ãæ¹
- åã¦ã¼ã¶ã¼ãã¨ã«
åé¡_(æ£è§£|ä¸æ£è§£)
ã並ã¹ã åé¡_(æ£è§£|ä¸æ£è§£)
ãword2vecã§ãã¯ãã«å- ã¦ã¼ã¶ã¼ã®éå»Nåã®
åé¡_(æ£è§£|ä¸æ£è§£)
ã®ãã¯ãã«ãå¹³åãã¦ã¦ã¼ã¶ã¼ããã¯ãã«å - ã¦ã¼ã¶ã¼ãã¯ãã«ã¨æ¬¡ã®åé¡ã®
åé¡_(æ£è§£)
åé¡_(ä¸æ£è§£)
ã®ã³ãµã¤ã³è¿ä¼¼åº¦ãç®åº
- åã¦ã¼ã¶ã¼ãã¨ã«
trueskill by ãã¼ã ã¡ã¼ã
- trueskillã§ååé¡ãåã¦ã¼ã¶ã¼ã®å¼·ããã¹ã³ã¢å
- ããããã¦ã¼ã¶ã¼ãåé¡ã«åã¤(æ£çãã)確çãç®åº
- importanceã¯å¸¸ã«topã ã£ãã+0.001ããã
- æçµç¤ã«å ¥ããããä»ã®ç¹å¾´éã¨é£ãåã£ã¦ãããããããªã
- æ£çã®éã¿ä»ã足ãä¸ã by ãã¼ã ã¡ã¼ã
- éã¿ã¯æ£ççã®éæ°ã¨ãã¦ãæ£çãéã¿ä»ãã§ã«ã¦ã³ã
åºæ¬çãªã㤠(æç²ãã¦è¨è¼)
- å種TargetEncoding (åé¡ã®æ£çç, åé¡ãæ£çã§ãã人ã®å²å, etc.)
- å種ã¦ã¼ã¶ã¼ãã° (éå»400åã®æ£çç, éå»400åã®ãã¡åããã¼ãã§ã®æ£çç, etc.)
- ã¡ã¢ãªã®é¢ä¿ã§éå»400åã ãã®éè¨ã«ããã800åã«ä¼¸ã°ãã¦ããããã¦ç²¾åº¦å¤ãããªãã£ããç¡éã«ãã°åã£ããå¤ãã£ããããã®ã ãããï¼
- timestampã®Lagç³»ç¹å¾´é
- åã«åãåé¡ã解ããæããã®çµéæé
- åã®åé¡ããã®çµéæé
- timestampãå å·¥ããç¹å¾´é
- çµéæéã使ãç©
- åã®åé¡ããã®çµéæé / ãã®task_containerã«ãããå¹³åçãªæé
- åã®åé¡ããã®çµéæé / ãã®task_containerã«ãããå¹³åçãªæé(æ£çã®ã¿ã§éè¨)
- ããããå¹ãã¦ããã®ã§timestampã¯
åé¡ãåçããtimestamp
ããªã¨æã£ã¦ãããã©ã©ãã ãããï¼
- SAINTã®è«æã«è¼ã£ã¦ããã©ã°ã¿ã¤ã ããªãã¹ãåç¾ããç©
- åã®åé¡ããã®çµéæé - åã®åé¡ã«ããã£ãæé
- çµéæéã使ãç©
- åç´ãªWord2Vec
- åã¦ã¼ã¶ã¼ãã¨ã«åé¡ã並ã¹ãâåé¡ãåèªã¨è¦ç«ã¦ã¦word2vecã§ãã¯ãã«å
- ãã¼ãå¥ãæ£è§£ã®åé¡ã®ã¿ã誤çã®åé¡ã®ã¿ãwindowããããæ¯ã£ã¦ã¿ãããªã©ã§ããããä½ã£ã
- ãã¿ã¼ã³ã足ãã°è¶³ãã ãã¹ã³ã¢ãã®ã³ãå°è±¡
- åã¦ã¼ã¶ã¼ãã¨ã«åé¡ã並ã¹ãâåé¡ãåèªã¨è¦ç«ã¦ã¦word2vecã§ãã¯ãã«å
å¹ããªãã£ããã¨
ã¾ã大éã«ããã®ã§ããä¸æè°ãªç¹ã ã
- tag
- ã¿ã°ã®åºç¾ãããã¼ãã®åå¸ãããã¿ã°ã®ä»åãã¯ã§ãã¦ããã
- ææ³ã®ã¿ã°ã¨ãã¤ã³ããã¼ã·ã§ã³ã®ã¿ã°ã¨ãåãã£ã¦ãã
- ã«ãããããããããããå å·¥ãã¦ã¢ãã«ã«å
¥ãã¦ã¿ããå
¨ç¶å¹ããªãã£ã
- target encodingããããã¦ã¼ã¶ã¼ã®éå»ã®ã¿ã°ã®æ£çæ°ã¨ãæ°ãã¦ã¿ãããã
- ã¿ã°ã®åºç¾ãããã¼ãã®åå¸ãããã¿ã°ã®ä»åãã¯ã§ãã¦ããã
- Lecture
- åç´ã«ã¦ã³ããªã©ããã¦å ¥ãã¦ã¿ããå ¨ç¶æ´»ç¨ã§ããªãã£ã
ãã®ä»å·¥å¤«ããç¹
- stickytape
- ã³ã¼ãã¯ãã¼ã ã§Github管çãè¡ã£ã¦ãã¾ããã
- stickytapeãç¨ãã¦ãä¾åãã¦ããç¹å¾´éçæã³ã¼ããªã©ãããæãã«ã¾ã¨ãã¦kaggle notebookã«è²¼ãä»ããã¹ã¯ãªãããçæãããããã«ãã¼ã ã¡ã¼ããè¨å®ãã¦ããã¾ããã
- BigQuery
- å¦ç¿æã¯BigQueryã使ã£ã¦ãã¼ã¿çæãè¡ãããã¾ãè¡ã£ãç¹å¾´éã ããµããããç¨ã«æ¸ãç´ãã¾ãã
ãã®ã³ã³ãã¯ä»äºãï¼ã£ã¦æããããããããSQLæ¸ãã¾ãã
ãã¼ã ã¡ã¤ããæ¸ããSQLãã¨ã¦ãããèãããã¦ç·´ããããã®ã ã£ãã®ã§æããã¡ãã£ã¨å«å¦¬ãã¦ãã¾ã£ãw Kaggleä»äºã§ä½¿ããªã説ãæ¨é²ãã¦ã人ã«æ¯éè¦ã¦ã»ããã
— ML_Bear (@MLBear2) December 6, 2020
åç
ã¾ã ä¸ä½é£ã®ã½ãªã¥ã¼ã·ã§ã³èªãã§ãªãä¸ã§ã®åçãªã®ã§èªãã ãå¤ããããã§ããã
- NNãçµç¤ã¾ã§ãããªãã£ã(ã§ããªãã£ã)
- DSB2019ã§ä¸ä½é£ãã»ã¼LightGBMä¸æ¬ã§ä¸ä½ã«é£ãããã¦ããå°è±¡ãå¼·ããNNã¯å¾åãã«ãã¦ããã
- ãµããããã³ã¼ãã®æ§ç¯(ãã¼ã«ã«ç¹å¾´éã®ç§»æ¤)ã«ããæéåã£ã¦æéãé£ã£ã¦ãã¾ã£ãã
- å¥ã
ã«çµãã§ããç¹å¾´éããã¡ããã¡ãé£ãåã£ã¦ãã
- ã¿ããªãä½ã£ãç¹å¾´éããã¼ã¸ããã¢ãã«ä½ã£ã¦ãå
¨ç¶ã¹ã³ã¢ä¸ãããªãã¦æ§ããã«è¨ã£ã¦ãã£ã¡ãã·ã§ãã¯ã ã£ãã
- åç´è¶³ãä¸ãã§+0.006ãããè¦è¾¼ãã§ããã®ã+0.001ãããããä¸ãããªãã£ã
- ãã®æç¹ã§ã¡ãã£ã¨å¿æããã¦ãã¾ã£ãæããã£ãâ¦ã
- ã¿ããªãä½ã£ãç¹å¾´éããã¼ã¸ããã¢ãã«ä½ã£ã¦ãå
¨ç¶ã¹ã³ã¢ä¸ãããªãã¦æ§ããã«è¨ã£ã¦ãã£ã¡ãã·ã§ãã¯ã ã£ãã
ä¸ä½è§£æ³
- ç¶ã ã¨å ¬éãããã¯ããªã®ã§ãã¨ã§ã¾ã¨ãã¦è¨äºã«ãããã¨æã£ã¦ãã¾ã
Kaggleã§æ¦ããã人ã®ããã®pandaså®æ¦å ¥é
ã¯ããã«
- èªåã¯å ã pandasãè¦æã§Kaggleã³ã³ãåå æã¯åºæ¬çã«BigQueryä¸ã®SQLã§ç¹å¾´éãä½ããæä½éã®pandasæä½ã§ãã¼ã¿å¦çããã¦ãã¾ããã
- ããããããã³ã¼ãã³ã³ããã£ã·ã§ã³ã«åå ãããã¨ã«ãªããpythonã§è»½å¿«ã«ãã¼ã¿å¦çãããªãå¿ è¦ãåºã¦ããã®ã§åå¼·ãã¾ããã
- ããã§ãå½æã®åå¼·ã¡ã¢ããã¨ã«ãããã ãç¥ã£ã¦ããã°Kaggleã§ããããæ¦ããããªãã¨æã£ã¦ããpandasã®ä¸»è¦æ©è½ãã¾ã¨ãã¾ããã
注è¨
å®æ¦å ¥é
ã®ã¤ãããã»ã¼è¾æ¸
ã«ãªã£ã¦ãã¾ãã¾ãã orz- pandasã¨ã¯ãªãããçãªå
容ã¯æ¸ãã¦ãã¾ãã
(import pandas
ãDataFrameã¨ã¯ä½ããªã©) - pandas1.0ç³»ã§ãåãããã«æ¸ããã¤ããã§ããééã£ã¦ãããã¿ã¾ãã
ç®æ¬¡
- ã¯ããã«
- ç®æ¬¡
- Options
- DaraFrame èªã¿æ¸ã
- ãã¼ã¿ã¯ãªã¼ãã³ã°
- DataFrameæä½
- å種æ¼ç®
- ã«ãã´ãªå¤æ°ã¨ã³ã³ã¼ãã£ã³ã°
- æååæä½
- æ¥ä»ç³»å¦ç
- å¯è¦å
- 並åå¦ç
- ãã¾ã: Excelèªã¿æ¸ã
- pandasã身ã«ã¤ããã«ã¯ï¼
- ãããã«
Options
jupyter notebook 㧠DataFrame ã®è¡¨ç¤ºãçç¥ãããªãããã«ããã ãªãã ããã æ¸ãæ¹ãããå¿ããã
pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None)
DaraFrame èªã¿æ¸ã
CSVãã¡ã¤ã«
èªã¿è¾¼ã¿
read_csv
ã¯æå¤ã¨ãªãã·ã§ã³ãå¤ãã®ã§ãªããªãè¦ãããã¾ããã
# åºæ¬ df = pd.read_csv('train.csv') # headerããªãã¨ã (ååã¯é£çªã«ãªã) df = pd.read_csv('train.csv', header=None) # headerããªãã¦èªåã§ååæå®ãããã¨ã df = pd.read_csv('train.csv', names=('col_1', 'col_2')) # å©ç¨ããåãæå®ãããã¨ã df = pd.read_csv('train.csv', usecols=['col_1', 'col_3']) # lamdaå¼ãå©ç¨å¯è½ df = pd.read_csv('train.csv', usecols=lambda x: x is not 'col_2') # åå: èªã¿è¾¼ãã ãã¨ã®å¤æ´ df = df.rename(columns={'c': 'col_1'}) # åæå®ã§èªã¿è¾¼ã¿ (æå®ããå以å¤ã¯èªåæ¨å®) ## ã¡ã¢ãªé¼è¿«ãã¦ããã¨ã以å¤ã¯ãåæå®ãã read_csv ãã¦ã ## å¾è¿°ã® `reduce_mem_usage` ã使ããã¨ãå¤ã df = pd.read_csv('train.csv', dtype={'col_1': str, 'col_3': str}) ## å: èªã¿è¾¼ãã ãã¨ã®å¤æ´ df = df['col_1'].astype(int) # float / str / np.int8 ... # æéç³»ãã¼ã¿ãparse df = pd.read_csv('train.csv', parse_dates=['created_at', 'updated_at'])
- pandasã§csv/tsvãã¡ã¤ã«èªã¿è¾¼ã¿
- Pandasã®read_csvãããéã«ããåãdatetimeã¨ãã¦èªã¿è¾¼ã
æ¸ãåºã
# åºæ¬ df.to_csv('file_name.csv') # indexä¸è¦ã®ã¨ã (kaggle submission fileã¯ä¸è¦ãªã®ã§å¿ããã¡) submission.to_csv('submission.csv', index=False)
Pickleãã¡ã¤ã«
# åºæ¬ df = pd.read_pickle('df.pickle') df.to_pickle('df.pickle') # ãã¼ã¿ãéãã¨ãã¯zipåã§ãã (ãé ãã¦å®ç¨ã«èããªãããã) ## æ¸ãåºã: æ¡å¼µåã zip ã gzip ã«ããã ãã§ãã df.to_pickle('df.pickle.zip') ## èªã¿è¾¼ã¿: read_pickle ã¯æ¡å¼µåãè¦ã¦èªåçã«è§£åå¦çããã¦ããã df = pd.read_pickle('df.pickle.zip')
- pandas.DataFrame, Seriesãpickleã§ä¿åãèªã¿è¾¼ã¿
- pandas ã®æ°¸ç¶åãã©ã¼ãããã«ã¤ãã¦èª¿ã¹ã
- pickle / feather / parquet ã®æ¯è¼ããã¦ããè¨äº
- ä¿åæã®ãã¡ã¤ã«ãµã¤ãºä»¥å¤ã¯ããããé¢ã§pickleãåªãã¦ããã¨ã®ãã¨
- ãã¡ã¤ã«ãµã¤ãºã¯parquetãå°ããã¨ã®ãã¨
ã¡ã¢ãªä½¿ç¨éåæ¸ã®å·¥å¤«
ãã¡ã¤ã«ãèªã¿è¾¼ãã ç´å¾ã«ã¡ã¢ãªä½¿ç¨éåæ¸ããã¯ã»ãä»ãã¦ããã¨è²ã ã¯ãã©ãã¾ãã
åå¤æ´
# kaggleã§ãã使ããã `reduce_mem_usage` ã§ã¡ã¢ãªä½¿ç¨éåæ¸ ## å é¨ã§ã¯åã«ã©ã ã®å¤åã«åããã¦åå¤æ´ãè¡ã£ã¦ãã ## `reduce_mem_usage` å®è£ 㯠ref åç § df = reduce_mem_usage(df) # å®è·µçã«ã¯ read_csv ããç´å¾ã«ã¡ã¢ãªä½¿ç¨éåæ¸ãè¡ããã¨ãå¤ã df = df.read_csv('train.csv')\ .pipe(reduce_mem_usage) # ä½è«ã ããpipeã使ãã¨å¯èªæ§åä¸ãããã¨ãå¤ã # f(g(h(df), arg1=1), arg2=2, arg3=3) df.pipe(h) \ .pipe(g, arg1=1) \ .pipe(f, arg2=2, arg3=3)
ä¸è¦ã«ã©ã åé¤
import gc # dropã§ãè¯ã: df.drop('col_1', axis=1, inplace=True) del df['col_1']; gc.collect();
ãã¼ã¿ã¯ãªã¼ãã³ã°
æ¬ æãã¼ã¿å¦ç
# æ¬ æãããè¡ãåé¤ df1.dropna(how='any') # ç¹å®ã®åã§æ¬ æãã¦ããè¡ãç¡è¦ df = df[~df['col_1'].isnull()] # åãã df1.fillna(value=0)
éè¤æé¤
# åºæ¬ df2.drop_duplicates() # éè¤ãã¦ããã«ã©ã ã®æå® df2.drop_duplicates(['col_1']) # æ®ãåã®æå® df2.drop_duplicates(['col_1'], keep='last') # keep='first' / False(drop all)
è£é (interpolate)
- Kaggleã§ã¯ãã¾ã使ããªãããã ããå®åã¨ãã§å½¹ã«ç«ã¡ãã
DataFrameæä½
DataFrame æ å ±è¡¨ç¤º
# è¡æ°,åæ°,ã¡ã¢ãªä½¿ç¨é,ãã¼ã¿å,éæ¬ æè¦ç´ æ°ã®è¡¨ç¤º df.info() # è¡æ° x åæ° åå¾ df.shape # è¡æ°åå¾ len(df) # æå / æå¾ã®Nè¡è¡¨ç¤º df.head(5) df.tail(5) # ã«ã©ã åä¸è¦§ãåå¾ df.columns # åè¦ç´ ã®è¦ç´çµ±è¨éãåå¾ ## æ°å¤åè¦ç´ ã® min/max/mean/stdãªã©ãåå¾ df.describe() ## ã«ãã´ãªåè¦ç´ ã® count/unique/freq/stdãªã©ãåå¾ df.describe(exclude='number') ## 表示ãããã¼ã»ã³ã¿ã¤ã«ãæå® df.describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99])
Slice (iloc / loc / (ix))
# åºæ¬ df.iloc[3:5, 0:2] df.loc[:, ['col_1', 'col_2']] # è¡ã¯æ°å¤ã§æå®ãã¦ãåã¯ååã§æå®ãã # (ãã¼ã¸ã§ã³ã«ãã£ã¦ã¯ ix ã§ãã§ãããå»æ¢ããã) df.loc[df.index[[3, 4, 8]], ['col_3', 'col_5']]
åã«ããåé¸æ
# é¤å¤ãã§ãã df.select_dtypes( include=['number', 'bool'], exclude=['object'])
æ¡ä»¶æå®ã«ããè¡é¸æ
# åºæ¬ df[df.age >= 25] # ORæ¡ä»¶ df[(df.age <= 19) | (df.age >= 30)] # ANDæ¡ä»¶ df[(df.age >= 25) & (df.age <= 34)] ## betweenã§ãæ¸ãã (ãã¾ãè¦ãªãã) df[df['age'].between(25, 34)] # IN df[df.user_id.isin(target_user_list)] # queryè¨æ³: è³å¦ä¸¡è«ãããå人çã«ã¯å¥½ã df.query('age >= 25') \ .query('gender == "male"')
indexãªã»ãã
# åºæ¬ df = df.reset_index() # ç ´å£çå¤æ´ df.reset_index(inplace=True) # drop=Falseã«ããã¨indexãåã¨ãã¦è¿½å ããã df.reset_index(drop=False, inplace=True)
ååé¤
# åºæ¬ df = df.drop(['col_1'], axis=1) # ç ´å£çå¤æ´ df = df.drop(['col_1'], axis=1, inplace=True)
Numpy Array å
# df['col_1'] ã®ã¾ã¾ã 㨠index ãä»ãã¦ã㦠# ä»ã®dfã«ãã£ã¤ããã¨ãã«ãã°ãå¼ãè½ã¨ããããªãã¨ãããã®ã§ # numpy array ã«ãã¦å¾ç¶ã®å¦çãè¡ããã¨ãå¤ã ãã df['col_1'].values
é£çµã»çµå
é£çµ
# concat ## åºæ¬ (縦ã«ç©ã: ã«ã©ã ã¯åDataFrameã®åéå df = pd.concat([df_1, df_2, df_3]) ## 横ã«ã¤ãªãã df = pd.concat([df_1, df_2], axis=1) ## åDataFrameã«å ±éã®ã«ã©ã ã®ã¿ã§ç©ã df = pd.concat([df_1, df_2, df_3], join='inner')
- pandas cheetsheet
- Reshaping Data ã®é¨åã«è²ä»ãã®ããããããå³ãããã¾ã
- pandas.DataFrame, Seriesãé£çµããconcat
- concat è¦æããã¦ãã®è¨äºããã¶ã100å以ä¸è¦ã¦ãæ°ããã¾ãç¬
çµå
merge
: ãã¼ãæå®ãã¦ã®çµå
# åºæ¬ (å é¨çµå) df = pd.merge(df, df_sub, on='key') # è¤æ°ã®ã«ã©ã ããã¼ã¨ãã df = pd.merge(df, df_sub, on=['key_1', 'key_2']) # å·¦çµå df = pd.merge(df, df_sub, on='key', how='left') # å·¦å³ã§ã«ã©ã åãéãã¨ã df = pd.merge(df, df_sub, left_on='key_left', right_on='key_right') \ .drop('key_left', axis=1) # ãã¼ã両æ¹æ®ãã®ã§ã©ã¡ããæ¶ã
join
: indexãå©ç¨ããçµå
# åºæ¬ (å·¦çµå: mergeã¨éãã®ã§æ³¨æ) df_1.join(df_2) # å é¨çµå df_1.join(df_2, how='inner')
ã©ã³ãã ãµã³ããªã³ã°
# 100è¡æ½åº df.sample(n=100) # 25%æ½åº df.sample(frac=0.25) # seedåºå® df.sample(frac=0.25, random_state=42) # éè¤è¨±å¯: ããã©ã«ãã¯replace=False df.sample(frac=0.25, replace=True) # åããµã³ããªã³ã° df.sample(frac=0.25, axis=1)
ã½ã¼ã
# åºæ¬ df.sort_values(by='col_1') # indexã§ã½ã¼ã df.sort_index(axis=1, ascending=False) # ãã¼ãè¤æ° & éæé æå® df.sort_values(by=['col_1', 'col_2'], ascending=[False, True])
argmax / TOP-N ç³»ã®å¦ç
# æãå¤ãå°ããªè¡/åãè¦ã¤ãã df['col1'].idxmax() # æãåãå°ããªåãè¦ã¤ãã df.sum().idxmin() # TOP-N: col_1ã§ä¸ä½5件ãåºã â åä¸é ä½ã§ããã°col_2ãè¦ã df.nlargest(5, ['col_1', 'col_2']) # .smallest: ä¸ä½N件
å種æ¼ç®
ãã使ãé¢æ°åºç¤
# éè¨ df['col_1'].sum() # mean / max / min / count / ... # ã¦ãã¼ã¯å¤åå¾ df['col_1'].unique() # ã¦ãã¼ã¯è¦ç´ åæ° (count distinct) df['col_1'].nunique() # percentile df['col_1'].quantile([0.25, 0.75]) # clipping df['col_1'].clip(-4, 6) # 99ãã¼ã»ã³ã¿ã¤ã«ã§clipping df['col_1'].clip(0, df['col_1'].quantile(0.99))
åºç¾é »åº¦ã«ã¦ã³ã (value_counts)
# (NaNé¤ã) df['col_1'].value_counts() # åºç¾é »åº¦ã«ã¦ã³ã(NaNå«ã) df['col_1'].value_counts(dropna=False) # åºç¾é »åº¦ã«ã¦ã³ã (åè¨ã1ã«æ£è¦å) df['col_1'].value_counts(normalize=True)
å¤ã®æ¸ãæã (apply / map)
Seriesåè¦ç´ ã®æ¸ãæã: map
# åè¦ç´ ã«ç¹å®ã®å¦ç f_brackets = lambda x: '[{}]'.format(x) df['col_1'].map(f_brackets) # 0 [11] # 1 [21] # 2 [31] # Name: col_1, dtype: object # dictã渡ãã¦å¤ã®ç½®æ df['priority'] = df['priority'].map({'yes': True, 'no': False})
DataFrameã®åè¡ã»ååã®æ¸ãæã: apply
# åºæ¬ df['col_1'].apply(lambda x: max(x)) # ãã¡ããèªèº«ã§å®ç¾©ããé¢æ°ã§ãè¯ã df['col_1'].apply(lambda x: custom_func(x)) # é²æã表示ããã¨ã㯠# from tqdm._tqdm_notebook import tqdm_notebook df['col_1'].progress_apply(lambda x: custom_func(x))
- pandasã§è¦ç´ ãè¡ãåã«é¢æ°ãé©ç¨ããmap, applymap, apply
- Jupyter notebookã§pandasã®map/applyã¡ã½ããé²æ表示ãããæãã«ãã
ãã®ä»ã®æ¸ãæã (replace / np.where)
# replace df['animal'] = df['animal'].replace('snake', 'python') # np.where df['logic'] = np.where(df['AAA'] > 5, 'high', 'low') # np.where: è¤éver. condition_1 = ( (df.title == 'Bird Measurer (Assessment)') & \ (df.event_code == 4110) ) condition_2 = ( (df.title != 'Bird Measurer (Assessment)') & \ (df.type == 'Assessment') & \ (df.event_code == 4100) ) df['win_code'] = np.where(condition_1 | condition_2, 1, 0)
éç´ (agg)
# åºæ¬ df.groupby(['key_id'])\ .agg({ 'col_1': ['max', 'mean', 'sum', 'std', 'nunique'], 'col_2': [np.ptp, np.median] # np.ptp: max - min }) # å ¨ã¦ã®åãä¸å¾ã§éç´ãããã¨ãã¯ãªã¹ãå å 表è¨ã§æ¸ãã¦ãã¾ã£ã¦ãè¯ã df.groupby(['key_id_1', 'key_id_2'])\ .agg({ col: ['max', 'mean', 'sum', 'std'] for col in cols })
éç´çµæã®æ´»ç¨ä¾
ã»ã¼ã¤ãã£ãªã ã ããæåã¯æ £ããªãã¨å¦çã«æéåãã®ã§ä¾ãæ¸ãã¦ããã
# éç´ agg_df = df.groupby(['key_id']) \ .agg({'col_1': ['max', 'min']}) # ã«ã©ã åã max / min ã«ãªããã©ã®ãã¼ã®ãã®ãåºå¥ã§ããªãã®ã§ä¿®æ£ãã # ãã«ãã¤ã³ããã¯ã¹ã«ãªã£ã¦ããã®ã§ãã©ã㦠rename ãã agg_df.columns = [ '_'.join(col) for col in agg_df.columns.values] # éç´çµæã¯indexã«key_idãå ¥ã£ã¦ããã®ã§reset_indexã§åºã agg_df.reset_index(inplace=True) # key_idããã¼ã¨ãã¦å ã®DataFrameã¨çµå df = pd.merge(df, agg_df, on='key_id', how='left')
ãããããã¼ãã«ã«ããéè¨
pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], aggfunc={'D': np.mean, 'E': [min, max, np.mean]}) # D E # mean max mean min # A C # bar large 5.500000 9.0 7.500000 6.0 # small 5.500000 9.0 8.500000 8.0 # foo large 2.000000 5.0 4.500000 4.0 # small 2.333333 6.0 4.333333 2.0
ã«ã¼ããåããé åå士ã®æ¼ç®
åæ¹åã®å¹³åå¤ã¨ã®å·®åãç®åºããæã«ä¾¿å©ã§ã
# `df['{col}_diff_to_col_mean] = df['{col}'] - df['{col}'].mean()` çãªå¦çãä¸æ¬ã§ããæ df.sub(df.mean(axis=0), axis=1) # sub 以å¤ã«ã add / div / mul (æãç®) ããã # 以ä¸ã¯ `df['{col}_div_by_col_max] = df['{col}'] / df['{col}'].max()` ã®ä¸æ¬å¦ç df.div(df.max(axis=0), axis=1)
ãã³è©°ã (cut / qcut)
# df['col_1']ã®æå°å¤ã¨æ大å¤ã®éã4åå² â ãã®å¢çã使ã£ã¦ãã³è©°ã # ã¤ã¾ããåãã³ã«å«ã¾ããåæ°ããã©ãã pd.cut(df['col_1'], 4) # df['col_1']ã®è¦ç´ æ°ã4çåãã¦ãã³ãä½ã â ãã®å¾ã«å¢çãæ±ãã # ã¤ã¾ãããã³ã®ééããã©ãã pd.qcut(df['col_1'], 4)
- ãã¹ãã°ã©ã å¯è¦åã«ã¤ãã¦ã¯è©³ãã å¾è¿° ãã
- pandasã®cut, qcuté¢æ°ã§ããã³ã°å¦çï¼ãã³åå²ï¼
æç³»åãã¼ã¿ã§ãã使ãå¦ç
shift
: è¡ã»åæ¹åã«å¤ãããã
# 2è¡ä¸ã«ããã df.shift(periods=2) # 1è¡ä¸ã«ããã df.shift(periods=-1) # 2åããã (ãã¾ã使ããªã) df.shift(periods=2, axis='columns')
rolling
: 移åå¹³åãªã©ã®ç®åº
# windowå¹ =3ã®çªé¢æ°ã«ããåè¨å¤ãç®åº df['col_1'].rolling(3).sum() # è¤æ°ã® df['col_1'].rolling(3) \ .agg([sum, min, max, 'mean'])
cumsum
: ç´¯ç©å
åæ§ã®é¢æ°ã« cummax
, cummin
ããã
# df # A B # 0 2.0 1.0 # 1 3.0 NaN # 2 1.0 0.0 # ä¸è¨ã®dfã®ç´¯è¨åãç®åº df.cumsum() # A B # 0 2.0 1.0 # 1 5.0 NaN # 2 6.0 1.0
diff
, pct_change
: è¡ã»åã®å·®åã»å¤åçãåå¾
# ä¾ã§ä½¿ãdataframe # col_1 col_2 # 0 1 2 # 1 2 4 # 2 3 8 # 3 4 16 # åºæ¬: 1è¡åã¨ã®å·®åãç®åº df.diff() # col_1 col_2 # 0 NaN NaN # 1 1.0 2.0 # 2 1.0 4.0 # 3 1.0 8.0 # 2è¡åã¨ã®å·®åç®åº df.diff(2) # col_1 col_2 # 0 NaN NaN # 1 NaN NaN # 2 2.0 6.0 # 3 2.0 12.0 # è² ã®æ°ãæå®å¯è½ df.diff(-1) # col_1 col_2 # 0 -1.0 -2.0 # 1 -1.0 -4.0 # 2 -1.0 -8.0 # 3 NaN NaN # å¤åçãåå¾ããã¨ã㯠`pct_change` df.pct_change() # col_1 col_2 # 0 NaN NaN # 1 1.000000 1.0 # 2 0.500000 1.0 # 3 0.333333 1.0 # è¨ç®å¯¾è±¡ãdatetimeã®å ´åã¯é »åº¦ã³ã¼ãã§æå®å¯è½ # 以ä¸ã®ä¾ã§ã¯ `2æ¥å` ã®ãã¼ã¿ã¨ã®å¤åçãç®åº df.pct_change(freq='2D')
æéåä½ã§ã®éç´
# 5åããã«å¹³åãæ大å¤ãéè¨ # é »åº¦ã³ã¼ã `min` `H` ãªã©ã®è©³ç´°ã¯ ref.2 ã«é常ã«è©³ããã®ã§åç §ã®ã㨠funcs = {'Mean': np.mean, 'Max': np.max} df['col_1'].resample("5min").apply(funcs)
- pandasã§æç³»åãã¼ã¿ããªãµã³ããªã³ã°ããresample, asfreq
- pandasã®æç³»åãã¼ã¿ã«ãããé »åº¦ï¼å¼æ°freqï¼ã®æå®æ¹æ³
ã«ãã´ãªå¤æ°ã¨ã³ã³ã¼ãã£ã³ã°
ã«ãã´ãªå¤æ°ã¨ã³ã³ã¼ãã£ã³ã°ã®ç¨®é¡ã«ã¤ãã¦ã¯ãã®è³æã詳ãã
One-Hot Encoding
# ãã® DataFrame ãå¦çãã # name gender # 0 hoge male # 1 fuga NaN # 2 hage female # prefixãä»ãããã¨ã§ãªãã®ã«ã©ã ã®One-Hotããããããããªã tmp = pd.get_dummies(df['gender'], prefix='gender') # gender_female gender_male # 0 0 1 # 1 0 0 # 2 1 0 # çµåãããã¨å ã®ã«ã©ã ãåé¤ãã df = df.join(tmp).drop('gender', axis=1) # name gender_female gender_male # 0 hoge 0 1 # 1 fuga 0 0 # 2 hage 1 0
Label Encoding
from sklearn.preprocessing import LabelEncoder # trainã¨testã«åããã¦ãããã¼ã¿ãä¸æ¬ã§LabelEncodingããä¾ cat_cols = ['category_col_1', 'category_col_2'] for col in cat_cols: # æ £ä¾çã« `le` ã¨ç¥ããã¨ãå¤ãæ°ããã le = LabelEncoder().fit(list( # train & test ã®ã©ãã«ã®åéåãåã set(train[col].unique()).union( set(test[col].unique())) )) train[f'{col}'] = le.transform(train[col]) test[f'{col}'] = le.transform(test[col]) # label encoding ãããã¡ã¢ãªä½¿ç¨éãæ¸ãããã®ã§å¿ããã« train = reduce_mem_usage(train) test = reduce_mem_usage(test)
- 注è¨
- ä¸è¨æ¹æ³ã ã¨testã«ã®ã¿å«ã¾ããã©ãã«ãencodingããã¦ãã¾ã
- æ°æã¡æªãå ´åã¯ãtrainã«ãªããã®ã¯ä¸æ¬ã§
-1
ã¨ãã«æ¸ãæãã¦ãã¾ã (å人çã«ã¯ãã¾ãæ°ã«ãã¦ããªãã®ã§æ£ããããæ¹ãã©ããä¸å®â¦ã)
- kaggleæ¬å®è£
- kaggleæ¬ã§ã¯trainã«åºã¦ãããã®ã ãã§LabelEnconding
Frequency Encoding
for col in cat_cols: freq_encoding = train[col].value_counts() # ã©ãã«ã®åºç¾åæ°ã§ç½®æ train[col] = train[col].map(freq_encoding) test[col] = test[col].map(freq_encoding)
Target Encoding
# è¶ éã«ããã¨ã (éæ¨å¥¨) ## col_1ã®åã©ãã«ã«å¯¾ã㦠target(correct) ã®å¹³åå¤ã¨ã«ã¦ã³ããç®åº ## ä¸å®ã®ã«ã¦ã³ãæªæº(ä»®ã«1000件)ã®ã©ãã«ã¯ç¡è¦ãã¦éè¨ãããã¨ããä¾ target_encoding = df.groupby('col_1') \ .agg({'correct': ['mean', 'count']}) \ .reset_index() \ # å°æ°ã©ãã«ã¯ãªã¼ã¯ã®åå ã«ãªãã®ã§æ¶ã .query('count >= 1000') \ .rename(columns={ 'correct': 'target_encoded_col_1', }) \ # ã«ã¦ã³ãã¯è¶³åãã«ä½¿ã£ãã ããªã®ã§æ¶ã .drop('count', axis=1) train = pd.merge( train, target_encoding, on='col_1', how='left') test = pd.merge( test, target_encoding, on='col_1', how='left')
- ä¸è¨ã®ä¾ã¯é常ã«éãªå®è£ ã§ããçé¢ç®ã«ããã¨ãã¯Kaggleæ¬ã®å®è£ ãèªãã§Foldãã¨ã«è¨ç®ãã¾ããã
æååæä½
pandas official method list ã«ããããè¼ã£ã¦ããã®ã§ä¸åº¦ç®ãéããã¨ããããããã¾ãã
åºæ¬
# æåæ° series.str.len() # ç½®æ series.str.replace(' ', '_') # 'm' ããå§ã¾ã(çµãã)ãã©ãã series.str.starswith('m') # endswith # 表ç¾ãå«ãã§ãããã©ãã pattern = r'[0-9][a-z]' series.str.contains(pattern)
ã¯ãªã¼ãã³ã°
# 大æå/å°æå series.str.lower() # .upper() # capitalize (male â Male) series.str.capitalize() # è±æ°åæ½åº: æåã®é©åé¨åã ãã ã ## ããããè¤æ°ã®å ´åã¯DFãè¿ã£ã¦ãã ## extractall: ãã¹ã¦ã®é©åé¨åããã«ãã¤ã³ããã¯ã¹ã§è¿ã£ã¦ãã series.str.extract('([a-zA-Z\s]+)', expand=False) # åå¾ã®ç©ºç½åé¤ series.str.strip() # æåã®å¤æ ## å¤æå: Qiitaã¯ãããã°ã©ãã³ã°ã«é¢ããç¥èãè¨é²ã»å ±æããããã®ãµã¼ãã¹ã§ãã ## å¤æå¾: Qiitaã¯,ããã°ã©ãã³ã°ã«é¢ããç¥èãè¨é²å ±æããããã®ãµã¼ãã¹ã§ã. table = str.maketrans({ 'ã': ',', 'ã': '.', 'ã»': '', }) result = text.translate(table)
æåã®å¤æã«ã¯str.translate()ã便å©
æ¥ä»ç³»å¦ç
åºæ¬
# åºæ¬: èªã¿è¾¼ã¿æã«å¤æå¿ããã¨ãã¨ã df['timestamp'] = pd.to_datetime(df['timestamp']) # æ¥ä»ã®ãªã¹ããä½æ dates = pd.date_range('20130101', periods=6) # æ¥ä»ã®ãªã¹ããä½æ: ç§åä½ã§100å pd.date_range('20120101', periods=100, freq='S') # æ¥ä»ã§ãã£ã«ã¿ df['20130102':'20130104'] # unixtime ã«ãã df['timestamp'].astype('int64')
é«åº¦ãªæ¥ä»æ½åº
- pandasã«ã¯ã¨ã¦ãè¤éãªæ¥ä»æ½åºã®ä»çµã¿ãå®è£
ããã¦ããã
æ¯æã®ç¬¬4åææ¥
ãæå第ä¸å¶æ¥æ¥
ã¨ãã£ãæ½åºãä¸ç¬ã§ãã(æ¥æ¬ã®ç¥æ¥ã対å¿ãã¦ããªãã®ã§å¾è¿°ã®jpholiday
ãªã©ã§å¤å°å¤æ´ã¯å¿ è¦ã§ããã) - pandasã®æç³»åãã¼ã¿ã«ãããé »åº¦ï¼å¼æ°freqï¼ã®æå®æ¹æ³ ã«è©³ããã®ã§ãæ¥ä»é¢ä¿ã®å®è£ ãå¿ è¦ãªéã¯ãã²ä¸èªããããã¨ããããããã¾ãã
# æã®æçµæ¥ãæ½åºãã pd.date_range('2020-01-01', '2020-12-31', freq='M') # DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30', # '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31', # '2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31'], # dtype='datetime64[ns]', freq='M') # 2020å¹´ã®ç¬¬4åææ¥ãæ½åºãã pd.date_range('2020-01-01', '2020-12-31', freq='WOM-4SAT') # DatetimeIndex(['2020-01-25', '2020-02-22', '2020-03-28', '2020-04-25', # '2020-05-23', '2020-06-27', '2020-07-25', '2020-08-22', # '2020-09-26', '2020-10-24', '2020-11-28', '2020-12-26'], # dtype='datetime64[ns]', freq='WOM-4SAT')
ç¥æ¥å¤å®
- pandasã§ã¯ãªããkaggleã§ã使ããã¨ã(ãã¶ã)ããã¾ããããå®åä¸ä¾¿å©ãªã®ã§æ²è¼ãã¦ããã¾ãã
- jpholiday official
import jpholiday import datetime # æå®æ¥ãç¥æ¥ãå¤å® jpholiday.is_holiday(datetime.date(2017, 1, 1)) # True jpholiday.is_holiday(datetime.date(2017, 1, 3)) # False # æå®æã®ç¥æ¥ãåå¾ jpholiday.month_holidays(2017, 5) # [(datetime.date(2017, 5, 3), 'æ²æ³è¨å¿µæ¥'), # (datetime.date(2017, 5, 4), 'ã¿ã©ãã®æ¥'), # (datetime.date(2017, 5, 5), 'ãã©ãã®æ¥')]
å¯è¦å
ãã¶ã¤ã³ã綺éºã«ãããã¾ããªã
ãã®Qiitaè¨äºã«è¼ã£ã¦ãããã¾ããªããæ¸ãã¦ããã¨ãã°ã©ããã¨ã¦ã綺éºã«ãªãã®ã§ã¨ã¦ãããããã§ãã
import matplotlib import matplotlib.pyplot as plt plt.style.use('ggplot') font = {'family' : 'meiryo'} matplotlib.rc('font', **font)
ã·ã³ãã«ãªã°ã©ã
import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt # åºæ¬ df['col_1'].plot() # è¤æ°ã®ã«ã©ã ã®ããããã 2x2 ã®ã¿ã¤ã«ç¶ã«è¡¨ç¤º # (ã«ã©ã æ°ãã¿ã¤ã«æ°ãè¶ ãã¦ããã¨æããã) df.plot(subplots=True, layout=(2, 2)) # ä¸è¨ã§X軸,Y軸ã®å ±éå df.plot(subplots=True, layout=(2, 2), sharex=True, sharey=True)
ãã¹ãã°ã©ã
# ãã¹ãã°ã©ã df['col_1'].plot.hist() # binã20ã«å¢ãã / ãã¼ã®å¹ ãç´°ããã¦éãéãã df['col_1'].plot.hist(bins=20, rwidth=.8) # X軸ã®ã¬ã³ã¸ãæå® ## 0-100æ³ã5æ³å»ã¿ã§è¡¨ç¤ºããã¤ã¡ã¼ã¸ df['col_1'].plot.hist(bins=range(0, 101, 5), rwidth=.8) # ãã¹ãã°ã©ã ãéãªãæã«ééããã df['col_1'].plot.hist(alpha=0.5) # Y軸ã®æå°å¤ã»æ大å¤ãåºå® df['col_1'].plot.hist(ylim=(0, 0.25))
ç®±ã²ãå³
df['col_1'].plot.box()
åå¸å³
df.plot.scatter(x='col_1', y='col_2')
並åå¦ç
- pandasã§ã®å¦çã¯æ®å¿µãªããéãã¯ãªãã¨æãã¾ããBigQueryçã¨æ¯è¼ããã¨æ®å¿µãªã¬ãã«ã§ãã(ã¾ãå¦çã®éããã®ãã®ãæ¯è¼ããã®ã¯ã¢ã³ãã§ã¢ã§ããâ¦ã)
- 大éã®ç¹å¾´éãå ¨ã¦æ£è¦åããã¨ããã大éã®è¦ç´ ã«mapããããæã¨ãã¯ä¸¦åå¦çãé§ä½¿ããã¨ä¾¿å©ã ã¨æãã¾ãã
from multiprocessing import Pool, cpu_count def parallelize_dataframe(df, func, columnwise=False): num_partitions = cpu_count() num_cores = cpu_count() pool = Pool(num_cores) if columnwise: # åæ¹åã«åå²ãã¦ä¸¦åå¦ç df_split = [df[col_name] for col_name in df.columns] df = pd.concat(pool.map(func, df_split), axis=1) else: # è¡æ¹åã«åå²ãã¦ä¸¦åå¦ç df_split = np.array_split(df, num_partitions) df = pd.concat(pool.map(func, df_split)) pool.close() pool.join() return df # é©å½ãªé¢æ°ã«DataFrameãçªã£è¾¼ãã§åæ¹åã«ä¸¦åå¦çãã df = parallelize_dataframe(df, custom_func, columnwise=True)
- Make your Pandas apply functions faster using Parallel Processing (ååºã¯å¥ã®è¨äºã ã£ãã¨æãã¾ããè¦å½ãããªãã£ãâ¦ã)
'20/07/28 追è¨
- pandaparallelãswifterã¨ãã並è¡å¦çãè¡ã£ã¦ãããã©ã¤ãã©ãªãå
å®ãã¦ãã¦ããããã§ãã
- ä»å¾ã¯ãã¡ãã使ãã®ãè¯ãããããã¾ããã
- åèè¨äº: ãã£ãæ°è¡ã§pandasãé«éåãã2ã¤ã®ã©ã¤ãã©ãª(pandarallel/swifter)
ãã¾ã: Excelèªã¿æ¸ã
kaggleã§ã¯ä½¿ããªããã©å®åã§ä½¿ã人ä¸å®æ°ããï¼ (åã¯ä½¿ã£ããã¨ãªã)
# write df.to_excel('foo.xlsx', sheet_name='Sheet1') # read pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
pandasã身ã«ã¤ããã«ã¯ï¼
ã¾ãã¯ããã¨ãªããå ¬å¼Tutorialã«è¼ã£ã¦ããããªmaterialã以ä¸ã®ãããªé çªã§ä¸éãåãã®ãæéãã¨æãã¾ãã(å¯è¦å以å¤)
å®è·µçãªåé¡ãããããã¨ãã¯åå¦çå¤§å ¨ãããã®ãè¯ãããã§ãããKaggleã³ã³ãã«åå ããå ´åã¯å ¬éNotebookãè¦ãªããç·´ç¿ããç¨åº¦ã§ãååãã¨æãã¾ãã
ãããã«
Kaggleé¢ä¿ã®è²ã ãªè¨äºãæ¸ãã¦ããã®ã§ãè¯ãã£ããèªãã§ã¿ã¦ãã ãããã
å®è·µçãªTipsé
ã³ã³ãåæ¦è¨
Kaggle Data Science Bowl 2019 ä¸ä½è§£æ³ã¾ã¨ã
ç·¨éå±¥æ´
- '20/01/28: 3rd solutionã追å
ããã¯ãªã«ï¼
- Kaggleã§10/24-1/23ã«éå¬ãããData Science Bowl 2019ã³ã³ãã®ä¸ä½è§£æ³ã¾ã¨ãã§ãã
- 1/27æç¹ã§å ¬éããã¦ãã10ä½ä»¥å ã®è§£æ³ãã¾ã¨ãã¦ã¿ã¾ããã
- Shake-up/downã®æ¿ããã³ã³ãã§ããããä¸ä½å ¥è³ããã¦ããæ¹ã®è§£æ³ã«ã¯å¦ã¶ã¨ãããå¤ããä¸ä½ã«å ¥ãã¹ããã¦å ¥ã£ãæ¹ãå¤ãå°è±¡ã§ããã
- æµãèªã¿ãã¦ã¾ã¨ããã®ã§ãééã£ã¦ããã¨ããã¨ããæ°ã¥ãã®ç¹ããã°ãææãã ããã
- éã¡ãã«åå ã®ãã®ããã¨4ã¤ãããå ¬éããã¦ããã®ã§å¾ã§è¶³ããã¨æãã¾ãã
1st
Stats
private 0.568 / public 0.563
è¦ç´
- LightGBMã®ã·ã³ã°ã«ã¢ãã«(!)
- Foldãã¨ã«ã·ã¼ããå¤ãã5Fold
詳細
Validation
- LBãä¸å®å®ãªã®ã§è¦ãªãã£ã
- 以ä¸2ã¤ã®ValidationSetãå©ç¨ãã
- GroupK CV (installation_id / 5x5Fold)
- QWKãä¸å®å®ã ã£ãã®ã§å éå¹³åRMSEãæ¡ç¨ãã
- weight: the weight is the sample prob for each sample (We use full data, for the test part, we calculate the expectation of the sample prob as weight). (Assessmentãä½åãã£ãããã®éæ°ãåã£ã¦ããï¼)
- Nested CV
- ä¸è¨CVã¯ç´æã«åããçµæãåºããã¨ããã£ã
- ãã®ãããæå
ã§Trainãåå²ãã¦ãã§ãã¯ã«ä½¿ã£ã
- çä¼¼train: å ¨ãã°ã使ã£ã1400ã¦ã¼ã¶ã¼
- çä¼¼test: ãã°ãä¸é¨æã¡åã£ã2200ã¦ã¼ã¶ã¼
- ããã50-100åè¡ã£ã¦ã(testã®è©ä¾¡ã®?)å¹³åãValidationã¨ãã¦ç¢ºèªãã
- GroupK CV (installation_id / 5x5Fold)
Feature Engineering
é
- 2ä¸åãããç¹å¾´éãä½ã£ã¦ãnull importanceã§500åã¾ã§åã£ã
å 容
- åãAssessmentããé¡ä¼¼ããã²ã¼ã ã«é¢é£ããç¹å¾´éãé常ã«å¤§åã ã£ãã
(åºæ¬çã«ã²ã¼ã å
ã®é åºããã¨ã«ãã²ã¼ã ã
ã©ã®Assessmentã¨ä¼¼ã¦ããã
ããããã³ã°ããã) - mean/sum/last/std/max/slope ã true attempt, correct true, correct feedback ã«å¯¾ãã¦ç®åºããã
ãã°ãã¼ã¿ã以ä¸ã®ããã«åå²ãã¦ç¹å¾´éãä½ã£ã
- å ¨å±¥æ´
- éå»5/12/48æé
- ååã®Assessmentããç¾å¨ã¾ã§
Eventã¤ã³ã¿ã¼ãã«ç¹å¾´éãä½ããmean/lastãevent_idãevent_codeã§ã°ã«ã¼ãã³ã°ãã¦ç®åºããã
- ããã¤ãã®Eventã¤ã³ã¿ã¼ãã«ç¹å¾´éã¯ããªãå¹ãã¦ãã
- Videoã¹ãããç¹å¾´éãä½ã£ã
clip eventã¤ã³ã¿ã¼ãã« / clipæé
ã§ç®åº- clipæéã¯ãªã¼ã¬ãã¤ã¶ã¼ãåºãã¦ããã¦ããã®
event_id / event code
çµã¿åããã«å¯¾ããç¹å¾´éevent_code2030_misses_mean
Feature Selection
- éè¤ããç¹å¾´éã®åé¤
- Adversarial AUC ã0.5ã«ãªãããã«åé¤
- null importance (TOP500ã«)
Model
- testã§accuracy_groupãããããã®ã¯trainã«ä½¿ã£ã
- trainã«ã¯RMSEã使ããvalidationã«ã¯å éå¹³åRMSEã使ã£ã
Ensemble
- è¡ã£ã¦ããªã
- 0.8xLightGBM+0.2xCatBoostã®ã¢ã³ãµã³ãã«ã¢ãã«ã¯private0.570ã ã£ãã æçµãµãã«ã¯ä½¿ã£ã¦ããªã(æå ã®CVãæªãã£ããã)
2nd
Stats
private 0.563 / public 0.563
è¦ç´
- LightGBM / CatBoost / NN ã®ã¢ã³ãµã³ãã«
- åºæ¬çãªç¹å¾´éã«å ããçµéã«ãã£ã¦æ¸è¡°ãããç¹å¾´éãword2vecãæ´»ç¨ã
- éç´ããåã®ãã°ã®åè¡ãäºæ¸¬ããç¹å¾´éãæ´»ç¨ã
詳細
Validation
- 1ã¦ã¼ã¶ã¼ããã1ãµã³ãã«ã«ãªãããã«ãªãµã³ãã«
- StratifiedGroupKFold, 5-fold
Feature Engineering
åºæ¬çãªç¹å¾´é
- session, world, types, title, event_id, event_code ãã¯ã¼ã«ãå¥ãå ¨ä½ã§ã«ã¦ã³ã
- sessionãã¨ã«åæ¸ãã¦æ¸è¡°ããã¦ã«ã¦ã³ã
- çµéæ¥æ°ã§æ¸è¡°ããã¦ã«ã¦ã³ã
- num_correct, num_incorrect, accuracy, accuracy_groupã«å¯¾ãã¦å¤§éã®çµ±è¨å¤ãç®åº
- ååã®Assessmentããã®çµéæé
Word2Vec
- Assessmentã¾ã§ã®ã¿ã¤ãã«ã®å±¥æ´ãæç« ã¨ã¿ãªã â word2vecã§ã¿ã¤ãã«ããã¯ãã«å â éè¨
Meta Features
- ãã°ãã¼ã¿ã®åè¡ã«Assessmentã®è©ä¾¡ãä»ä¸âäºæ¸¬âéç´ åºå ¸
Feature Selection
- éè¤ããç¹å¾´éã®åé¤
- ç¸é¢ãé«ãç¹å¾´éã®åé¤
- null importance (TOP300)
Model / Ensemble
- ã¢ã³ãµã³ãã«: 0.5 x LightGBM + 0.2 x CatBoost + 0.3 x NN
- åã¢ãã«ã¯ 5seed averaging
3rd
- 5-fold TRANSFORMER Model(Single Model)
- private LB 0.562 / public LB 0.576
- éå½ã®æ§æ£æã§å¿ããããã¾ãæ¸ãããã¨ã®ãã¨ã§ãã
('20/01/28追è¨: æ¸ãã¦ãã ãã£ãã®ã§è¿½è¨ãã¾ãã)
3rd solution - single TRANSFORMER model
åç½®ã(æ訳)
- DNNã§åé¡ã解ãã®ã好ãã ãããªãã¹ãå¤ãã®åé¡ãDNNã§è§£ãã¦ãã
- ãã¼ã¿ãã®ãã®ã®ç解ããã¯ãã¼ã¿ã®æ§é ã«çç®ãããªãã¹ãæ å ±ã®æ¬ æããªãããã«ã¢ãã«ã«å ¥åããããå¿ããã¦ããã
- è¨ãæããã¨ãç¹å¾´éã¨ã³ã¸ãã¢ãªã³ã°ããããã¼ã¿ã«ããè¯ããã£ããããNNã®ãããã¯ã¼ã¯ãã¶ã¤ã³ã®çºè¦ã«ãã注åãã¦ããã
詳細
注ç®ãã¹ãç¹
- ä½ç½®é¢ä¿ã®æ å ±ã¯CVãä¸ããã®ã§ãBERT/ALBERT/GPT2ã¨ãã£ãposition-embeddingã使ããã®ã¯ç²¾åº¦ãè¯ããªãã£ãã
- ãã®ãããposition-embeddingã使ããªãTransformerã¢ãã«ãçµãã
Pre-processing
- game_sessionãã¨ã«event_code/event_id/accuracy/max_roundãªã©ã®ã«ã¦ã³ããéè¨
- game_sessionãåèªã«è¦ç«ã¦ã¦ã·ã¼ã±ã³ã¹ã¨ãã¦ã¢ãã«ã«å ¥åãã
Model
- 100 sessionãå ¥ãã (çãã·ã¼ã±ã³ã¹ã¯PADã§åãã)
- embeddingã®ä½ãæ¹
- Categoricalå¤æ°: ['title', 'type', 'world']
- åå¥ã«embedâconcatânn.linearã§æ¬¡å åæ¸
- é£ç¶å¤ã®å¤æ°: ['event_count', 'game_time', 'max_game_time'] (+accuracy/max_round?)
- np.log1pã§æ£è¦åânn.linearã§ç´æ¥embed
- Categoricalå¤æ°: ['title', 'type', 'world']
- params
- optimizer: AdamW
- schedular: WarmupLinearSchedule
- learning_rate: 1e-04
- dropout: 0.2
- number of layers : 2
- embedding_size: 100
- hidden_size: 500
Loss function
- accuracy_groupã®å®ç¾©ã以ä¸ã®ããã«åæ§æãã
new_accuracy_group = 3 * num_correct - num_incorrect (num_incorrect: contrained not to exceed 2) (new_accuracy_group >= 0 ã®å¶ç´ãå ¥ãã¦ããã¯ãâ¦ï¼)
- [num_correct, num_incorrect] ãã¿ã¼ã²ããã¨ãã¦ãmseãmodified_lossã¨ãã¦æ±ã£ãã
- accuracy_group ãtargetã«å ¥ãã¦ãæçµçã«ä»¥ä¸ã®ããã«ãã¦accuracy_groupã®äºæ¸¬ã¨ãã
new_accuracy_group = 3 * num_correct_pred - num_incorrect_pred final_accuracy_group = (accuracy_group_pred + new_accuracy_group) / 2
Data Augumentation
- å¤ã(30以ä¸)ã®ã²ã¼ã ã»ãã·ã§ã³ãããã¦ã¼ã¶ã¼ã®ã»ãã·ã§ã³ãã©ã³ãã é¤å»ãã
- Train: å¤ãé ã«æ大50%ãã©ã³ãã é¤å»
- Test: å60%ãã©ã³ãã é¤å»
Data Augumentation (pre-trainç¨)
- Game typeã®ã»ãã·ã§ã³ããpre-trainç¨ã®å¦ç¿ãã¼ã¿ãçæãã
- Gameã»ãã·ã§ã³ã®correctããnum_correct/num_incorrect/accuraby_group(ã®ãããªãã®)ãä½ã£ã¦ãã¼ã¿ãå¢å¹
- 41,194ã®ãã¼ã¿ãtrainã«å ãããã¨ãåºæ¥ã
- å¦ç¿æ¹æ³
- pre-train: original label + ä¸è¨ã®å¦ç¿ãã¼ã¿
- fine-tuning: original labelã®ã¿
4th
Stats
- private 0.561 / public 0.572 (NNã®ãã¬ã³ã)
- private 0.560 / public 0.566 (3é建ã¦Stacking)
詳細
Validation
è²ã é å¼µã£ããã©ãã¾ãè¡ããªãã£ãããããæ®éã«ãã£ã
(installation_idã®GroupKã§ãããªãã£ã)
Feature Engineering
- testã§accuracy_groupãããããã®ã¯trainã«ä½¿ã£ã
- 決å®æ¨ç³»ã«ãããã¨æªåããã®ã§ä½¿ããªãã£ã
- ããã¤ãã®Clipãã¿ã¤ãã«ã¯é常ã«éè¦ã ã£ã
- Eventã·ã¼ã±ã³ã¹ã®TfIDFãæ´»ç¨
- åevent_idãtitle + event_code + correct_flag + incorrect_flagã«å¤æ â ã¦ã¼ã¶ã¼ã®ã·ã¼ã±ã³ã¹ãæç« ã¨ã¿ãªã â TfIDF
- Assessment, ã¿ã¤ãã«, Assessmentã®è©ä¾¡, ã ãã«TfIDF使ã£ã¦ãå¹ããªãã£ã
- NNã«ã¯NNã«é©ããç¹å¾´éãåºãããã«å¿ããã
- ã¿ã¤ãã«ã®embedding (7次å )
- ã¿ã¤ãã«ã®æ£è§£/ä¸æ£è§£æ°ããã³ãã®æ¯ç
- ã¿ã¤ãã«éå§æéããã®çµéæé(ç§)
- 以åã®ã¿ã¤ãã«ã®æ£è§£/ä¸æ£è§£æ°ããã³ãã®æ¯ç
Model / Ensemble
3é建ã¦Stacking (NNãã¬ã³ããããå¼±ãã£ã)
- RNNx3, lgbm, catboost
- ããã«ã¯ãã¹ããªãã¼ã·ã§ã³ (5x5)
- MLP(x100 starts averaging), Lightgbm(x100 seed averaging)
- Ridge
- RNNx3, lgbm, catboost
å ¨ã¦ã®ã¢ãã«ã¯å帰ã§å¦ç¿ããã
- code(NN)
7th
Stats
Private 0.559 / Public 0.559 / CV 0.575
è¦ç´
- ç¹å¾´éã¨ã³ã¸ãã¢ãªã³ã°ãèã ã£ããæå¾ã¯51ç¹å¾´éã使ã£ãã(150åããåã£ã)
0.3 LGB, 0.3 CATB, 0.4 NN
ã®ã¢ã³ãµã³ãã«ã ã£ã- 20Foldã®ãã®ã³ã°ãå ¨ã¢ãã«ã«é©ç¨ããNNã¯ããã«3seed averagingãè¡ã£ã
- testã§accuracy_groupãããããã®ã¯trainã«ä½¿ã£ã
- Validationã§ã¯ãã¹ãã»ããã®æ§é ãèæ ®ãã¦ãåã¦ã¼ã¶ã¼ãã1Assessmentãã©ã³ãã æ½åºããã
8th
Stats
Private 0.558 / Public 0.556
è¦ç´
- ã·ã³ãã«ãª3層MLP(256x256x256)
詳細
Validation
- 5 GroupKFold
- inversely weighted oof qwk ãã¦ã©ãããã¦ãã
- discussionã«æ¸ããããããã
Feature Engineering
- Preprocess
- Log transform â std transform
- fillna with zeros
- NaNã ã£ããã¨ã示ãæ°ããªç¹å¾´éã追å
- 2ãµãã®ãã¡çæ¹ã¯testã§accuracy_groupãããããã®ãtrainã«ä½¿ã£ã
- 使ã£ããµã: 0.559 private
- 使ããªãã£ããµã: 0.552 private
- 主è¦ãªç¹å¾´é (8thããä¸ä½ã«ãã£ããã®ã¯çã)
- titleã®durationã16åããé·ããã®ã¯ã¯ãªãããããã©ã°ãç«ã¦ãã
- åã©ããã¡ã16åé£ç¶ã§åãã¿ã¤ãã«ããã£ã¦ãã®ã¯èãã¥ãã
- titleã®å¹³åãã¹ãround_durationã§å²ã£ããã®
- ãªãã¼ãç¹å¾´é
- titleã®durationã16åããé·ããã®ã¯ã¯ãªãããããã©ã°ãç«ã¦ãã
Feature Selection
- null importance
- 1100ç¹å¾´éãä½ã£ã¦216åãé¸ãã
Model
- ã·ã³ãã«ãª3層MLP(256x256x256) x 9models(seedéãã ã?)
- å層ã§Batch Normalization / Dropout 0.3
- 3xleaky Relu + 1linear
- ecpochã¯63/65/68ã¨å ãã«å¤ãã
- Optimizer: Adam / BatchSize: 128
- learning_rate: 0.0003 w/cyclic decay
- cyclic decay: ã³ã¼ããå ±æããã¦ãã
- accuracy_groupã«å ãã¦ã3 x sqrt(accuracy)ãç®çé¢æ°ã¨ãã¦å©ç¨
- é¢æ£å¤ããå¤ãã®æ å ±ãã¢ãã«ãå¦ç¿ã§ããããã«
- ã ãããã¾ã大ããªå½±é¿ã¯ãªãã£ãã¨ã®ãã¨
- ã¢ã³ãµã³ãã«
- 9models x 2outputs = 18prediction ã®ãã¬ã³ãã£ã³ã°
- code
ãã®ä»
- thresholdã¯25ååãã¦CVãä¸çªè¯ãã£ããã®ãé¸ãã
- trainã®targetåå¸ã«åããããããæé©åããã»ããè¯ãã£ã
9th
Stats
è¦ç´
- ã»ã¼aggregationç¹å¾´éã§ç¹å¾´éã¨ã³ã¸ãã¢ãªã³ã°
- ããã¤ãã®å¤æ§ãªã¢ãã«ãä½ã£ã¦Stacking
- 巨大ãªã©ã³ãã ãµã¼ãã«ãããããå¤æ¢ç´¢
詳細
Model
- Stacking ãé常ã«å¹ãã
- LightGBM x 7 + NN x 1 â Ridge
- LightGBM
- gbdt/goss/dart
- targetãããã¤ãå©ç¨ãã
- accuracy_group
- accuracy
- accuracy_group > 2
- accuracy_group > 1
- accuracy_group > 0
Threshold tuning
- å ¬éKernelã®OptimizedRounderã¯åæå¤ã«ä¾åããå±ææé©è§£ã«é¥ãæåãå¤ãã£ãã
- ãã®ãããtruncateããtrainãã©ã³ãã ãµã¼ãããããã«ããã
10th
詳細
Validation
- StratifiedKFold 10fold
- åinstallation_idãã1ãµã³ãã«ã¥ã¤(?)ã©ã³ãã æ½åº
- åFoldã§51validation setsãå©ç¨
- 1ã¤ã¯early_stopingã«
- æ®ã50åã§qwkã®å¹³åãåã£ã¦validation scoreç®åºãã
Feature Engineering
- 3000-5000åãããä½ã£ã¦300åãå©ç¨ãã
- magic featureã¯ãªãã£ãã¨æã
- 主è¦ãªç¹å¾´é (10thããä¸ä½ã«ãã£ããã®ã¯çã)
- æ£è¦åããaccuracyç³»ç¹å¾´é
- ã¿ã¤ãã«ãã¨ã«é£æ度ãéãã®ã§accuracyç³»ç¹å¾´éãæ£è¦åãããã®ãå©ç¨ãã
- ä¾: (Accuracy - Accuracy_mean_per_title) / Feature_std_per_title
- ã¿ã¤ãã«ãã¨ã«ç¹å¾´éãä½ã£ã
- ä¾:
target_distances length in Air Show
- å¤ããã¦10ã¿ã¤ãã«åä½ã£ã¦æ«æãã
- ä¾:
- æ£è¦åããaccuracyç³»ç¹å¾´é
Feature Selection
- LightGBMã®feature_importanceãå ã«300åãé¸ãã
- åFoldã§50åã®ãã¼ã¿ã»ãããä½ãã5iterationãã¨ã«LightGBMã®init_modelãã©ã¡ã¼ã¿ã¼ã使ã£ã¦ãã¼ã¿ã»ãããå¤ããã(ããããããªãã£ãâ¦)
Model
- LightGBM x 6seed averaging
- feature_fractionã¯1.0ã«ãã
- ã¿ã¤ãã«ãã¨ã«å¹³åæ£ççãéãã®ã§å ¨ã¦ã®æ¨ã§ä½¿ãã®ãè¯ãã£ãã®ã ãã
Threshold
- local CV ãæ大åãããããå¤ãåºå®ããããprivateã§ãå©ç¨ãã
Kaggle Data Science Bowl 2019 åæ¦è¨ ã10ä¸ãã«ã®å¤¢ãè¦ã話ã
ããã¯ãªã«ï¼
- Kaggleã§10/24-1/23ã«éå¬ãããData Science Bowl 2019ã³ã³ãã®åå è¨é²ã§ã
- åä¾åãã®æè²ã¢ããªã®ãã°ãã¼ã¿ãå ã«ãåä¾ãã¡ã課é¡ãã©ããããã®ç²¾åº¦ã§è§£ããã¨ãã§ããããæ¨å®ããã¿ã¹ã¯ã§ããã
- åªåè³é10ä¸ãã«ã®å¤§ç¤æ¯ãèããªã³ã³ãã§ãå人ã§æé«5ä½ã¾ã§é ä½ãä¸ãã£ãã¨ãã«ã¯ãªããªããã夢ãè¦ããã¨ãã§ãã¾ããã
- ãã ãè©ä¾¡ææ¨ã®ç¹æ§åã³publicLB(æ«å®é ä½)ã®ç®åºã«å©ç¨ãããã¼ã¿æ°ä¸è¶³ãªã©ãããæ«å®é ä½(publicLB)ã¨æçµé ä½(privateLB)ãæ¿ããå ¥ãæ¿ããã³ã³ãã§ããã
- è©ä¾¡ææ¨ã«æ¯ãåããã¦ã¢ã¿ãã¿ããæãå¥ãpublic 17thããprivate 56thã¨å¤§ããé ä½ãä¸ããã¨ãããã¾ããããããªãçµæã«çµãã£ã¦ãã¾ã£ãã®ã§ãããåçãè¾¼ãã¦ãã£ããã¨ã®ã¡ã¢ãæ®ãã¦ããã¾ãã
ãã夢è¦ã¦ããã¨ãã®ãã¤ã¼ã
æ¨ã¦ãµãã®ã¤ããã§æãããµãã§ããªãããããã¨ããã¾ã§æ¥ã¦ãã¾ã£ãâ¦ã pic.twitter.com/2zudPLf4ut
— ML_Bear (@MLBear2) 2020å¹´1æ11æ¥
ãã£ããã¨
ã³ã³ãã«åå ããã¾ã§
- æ¨å¹´10æã®IEEEã³ã³ãã§åãã¦éã¡ãã«ãåããã®ã§ãããã³ã³ãã®ç· åãçµäºç´åã«å»¶é·ããããªã©ã®ãã©ãã«ããã大å¤ã«ç²å¼ãã¦ãã¾ããã
- ãªã®ã§ãkaggleãã°ãããããã¨æã£ã¦å°ãkaggleãä¼ãã§ãã¾ãããã12æã«åå ããKaggle Days Tokyo ã®ãªãã©ã¤ã³ã³ã³ãã楽ããã¦ãã¼ãã«ã³ã³ã欲ã復活ãã¦ãã¾ããã(Kaggle Days Tokyo ãªãã©ã¤ã³ã³ã³ãåæ¦è¨)
- ããã§ãå¹´æ«å¹´å§æéããã£ãã®ã§è»½ãKernelãDiscussionãè¦ã¦ã¿ãã¨ã以ä¸ã®ãããªå·¥å¤«ãä½ãè¨åããã¦ããªãã£ãã®ã§ãããããã ãã§ãã¾ãã¾ãè¡ããããªãã¨ãæã£ã¦åå ãã¦ã¿ããã¨ã«ãã¾ããã
- test-setå ã§AccuracyGroupãç¹å®ã§ãããã¼ã¿ãtrainã«å©ç¨ãã
- targetãå¤ããã¢ãã«ãæ´»ç¨ãã
- accuracyãã®ãã®
- ããããæ£è§£ãããã©ãã
- 4100(4110)ã¤ãã³ããä½åèµ·ããã
â» QWKã¯æºããææ¨ã¨èãã¦ãããã¨ãåã£ãã®ã§ãã¯ã³ãã£ã³æºãã¦ã½ãéãããããã¨ããæç®ãåã£ãã®ã¯æ¸ãã¾ã§ããªãã¨æãã¾ãç¬ã
ã¯ã³ãã£ã³ã§ã½ãéåãããããæãããããã¨ãæã£ã¦ãé ã®ãã¤ã¼ããããã3é±éåã¨ãä¿¡ããããªãã
ã©ããããæéãããããããããªããã©ãã¨ããããå¹´æ«å¹´å§æãªã®ã§DSBã¯ããã¦ã¿ãã
— ML_Bear (@MLBear2) 2020å¹´1æ2æ¥
ã³ã¼ãã³ã³ãåãã¦ãªãã ãã©ãã³ã¼ããç¹å¾´é管çã¨ãããªããªãè¾ãããããã«æ £ããã ãã§çµãã£ã¡ããããw
ã³ã³ãåå ç´å¾
- ã¾ããã¼ã¿ãã£ããè¦ãå¾ãKernelããã¼ã¹ã«ãã¦åºç¤çãªç¹å¾´éãä½ãã¾ããã
- ãã¼ã¹kernel
- installation_idå ¨ä½ã§SUMåããã¿ãããªãã¤ã¨ãã¯å½ç¶æãã¾ããã(ææ°çã¨ãã ã¨æ¶ãã¦ããã)
- adjust_factorã¨ãã®ããããããããããªãã£ãã®ã§ç¡è¦ãã¾ããã
CVæ§ç¯
- trainã¯testã«æ¯ã¹ã¦ããããã¬ã¤åæ°ãå¤ãã¦ã¼ã¶ã¼ãæ£è¦ãããã®ã§åããªãã¨ãããªãã¨æã£ã¦ãã¾ããã
ãã®ãããAdversarial Validation ããã¨ã«testã¨ä¹é¢ãã¦ãããªä¸ä½30%ç¨åº¦ã®ãã°ãç¹å®ãã以ä¸ã®å¦çã«æ´»ç¨ãã¾ããã
- QWKã®ãããæé©åã¸ã®æ´»ç¨
- testã¨ä¹é¢ãã¦ããªãtrainã®ãã¼ã¿ãããtestã®åå¸ã«åããã¦500åç¨åº¦ãµã³ããªã³ã°ãè¡ã£ã¦è©ä¾¡ãã¼ã¿ã»ãã群ãä½ããQWKã®ãããå¤æé©åãè¡ãã¾ããã
- 500ã£ã¦é©å½ã«æ±ºãããã©ããããªæãã§ãµã³ããªã³ã°ãã¦ãã人ã¯å¤ãã£ãã¤ã¡ã¼ã¸ã§ãã
- trainã¨testã®åå¸ãã©ããããããã¦ããããããªãã£ãã®ã¨ãthreshold optimizerã®æåãä¸å®å®ã«æããã®ã§å¹³ååãããã£ãã
- ã¢ãã«ã®è©ä¾¡ã¸ã®æ´»ç¨
- ä¸è¨ã¨åããã¼ã¿ã»ãã群ãç¨ãã¦ãRMSEã®å¹³åãåã£ã¦ã¢ãã«ã®ç²¾åº¦ã®ç¢ºèªãè¡ãã¾ããã
- early_stoppingããã®æé¤
- å¦ç¿æã®ValidationSetããåé¤ãã¦early_stoppingã®åèã«ããªãããã«ãã¾ãã
- TrainSetããæ¶ããã¹ãããã¦ã¿ã¾ãããå ¨ç¶ãã¡ã ã£ãã®ã§å¦ç¿ã«ã¯ä½¿ãã¾ããã
- QWKã®ãããæé©åã¸ã®æ´»ç¨
ä¸è¨æ¹éã¯Trainãhold-outãã¦é©å½ã«åã£ããã®ã§æå ã§å®é¨ããªãã決ãã¾ãã
- testããã¾ãåç¾åºæ¥ã¦ãããå¾®å¦ãªã®ã§ãããä½ããªãããã¯ãã·ããªã¨ä¿¡ãã¦ãã£ã¦ã¾ããã
- IEEEã³ã³ãã§ãã¼ã ã¡ã³ãã¼ããã£ã¦ãã®ãå¦ãã§ãã®ã§åèã«ãã¦ããã¾ããã
Data Augumentation
- test-setå ã§AccuracyGroupãç¹å®ã§ãããã¼ã¿ãtrainã«å©ç¨ãã¾ããã
- æå ã®æ°å¤ã¯å ¨é¢çã«è¯ããªãã®ã§ãããpublicLBããªããä¸ãã£ã¦ãã¾ããã
- ãã®ãããæå¾ã®æå¾ã§æ¶ãã¦ãã¾ãã¾ããã-0.004ãããã®ãã¹ã§ãããæçµãµãã®çæ¹ã§ã¯æ®ãã°ããã£ãã
Private Dataset Probing
- Assessment1åããã£ããã¨ãªã人ãã©ããããããã®ãç¥ãããã£ãã®ã§å°ãã ãè¡ãã¾ããã
- publicããçµæ§å¤ãã£ãã®ã§ãpublicã¯ãã¾ãåèã«ããªãããã«ãã¾ããã(çµå±æçµçã«åèã«ããã®ã§ãã)
Feature Engineering
以ä¸ãè¡ã£ã¦ããã¼ã¹ããåã£ã¦ãããã®ã¨åããã¦1150åãããã«ãªãã¾ããããã ãå®è³ª3æ¥ããã£ã¦ãªãã®ã§ãã®ããããã£ã¨ããããã£ãã
- ãã¼ã¹ã®Kernelã«ããã¯å¹ãã§ããã£ã¦ãã¤ã足ãã¦ããã¾ããã
- åãã¿ã¤ãã«ã®éå»ã®æ績ãã¤ãã³ãã«ã¦ã³ã
- åãã¯ã¼ã«ãã®(以ä¸åã)
- éä¸ã®ã²ã¼ã ã®è©ä¾¡ã詳ãã
- 課é¡ãã¨ã® correct rate
- 4020/4025ç³»ã¤ãã³ãã®éè¨
- correctç³»ã®éè¨
- clip length
- ä»ã«ç¬èªã®ãã®ãããã¤ã足ãã¾ãã
- target encoding
- title (âã¿ã¤ãã«ã®é£æ度)
- title x ä½åç®ã®ãã©ã¤ã
- å¥ã®ã¢ãã«ã§äºæ¸¬ããå¤ãç¹å¾´éã¨ãã¦æ»ã
- äºæ¸¬ãããã®
- accuracy(mean of correct)ãã®ãã®
- 4100(4110)ã¤ãã³ããä½åèµ·ããã
- ããããæ£è§£ãããã©ãã
- ããã¯ç¹å¾´éã¨ãã¦å©ç¨ãã¾ããããæçµã¢ã³ãµã³ãã«æã®Stackingã®1ã¢ãã«ã¨ãã¦ä½¿ã£ã¦ãè¯ãã£ãããã
- keeeeei79ããã¯Stackingã®ã¢ãã«ã¨ãã¦å©ç¨ãããããã(æ£è¦åãªã©ãç¹ã«ãã)
- äºæ¸¬ãããã®
- word2vec: event_id(+correct)ãåèªã¨ãã¦ã¿ãªã â ã¦ã¼ã¶ã¼ãã¨ã«ã¤ãªãã¦æç« ã«ãã â event_id ããã¯ãã«å â SWEM
- æ¯è¼çããå¹ãã¦ã¾ãã
- target encoding
- æ¨ã¦ããã¤
- PageRank
- titleãevent_idã®é·ç§»ãã°ã©ãå â Assessmentã®AccuracyGroupãä¼æ¬ããtitleãªã©ã®éè¦åº¦ãç®åº â ã¦ã¼ã¶ã¼ãã¨ã«éè¨
- feature importancesã§ä¸ä½ã«ä¸ãã£ã¦ããã®ã§ãã(publicLB)ã¹ã³ã¢ã«ã¯ã»ã¼ç¡é¢¨ã ã£ãã®ã§ã³ã¼ããç ©éã«ãªããªãããã«æ¨ã¦ã¾ãã
- LDA
- titleãevent_idã®é·ç§»ãLDA â ã¦ã¼ã¶ã¼ãã¨ã«éè¨
- ããã¯importanceãä½ãã£ã
- PageRank
Feature Selection
- Null Importancesã§600åãããåããæçµçã«ã¯550åãããã®ç¹å¾´éã«ãã¼ãã¾ããã
- Kaggle Days Tokyo ã® senkin-san slide ã®P18ã®å¼ãå©ç¨
- gain_scoreãã»ãã®å°ãã ããã¤ãã¹ã®ãã®ã¾ã§ä½¿ãã¨ã¡ããã©ããã£ã
QWK threshold optimization
- kernelã¨åããã®ãããã¾ãã
- ã¿ã¤ãã«ãã¨ã«æé©åããã®ãä½åº¦ããã©ã¤ãã¾ãããçµå±ãã¾ãè¡ãã¾ããã§ããã
Models
- ç¹å¾´éã¯ãã¹ã¦åã(LightGBM以å¤ã¯æ£è¦åãã¦ãã)ã§ä»¥ä¸ã®ã¢ãã«ãä½ãã¾ããã
- LightGBMx3 (èãå¤ã/æ®é/å°ãªã)
- ãã©ã¡ã¼ã¿ã¯optunaã§æé©åãã¾ãã
- NN
- ç¹å¾´éä½æã§åèã«ããã«ã¼ãã«ã¨åãã§ã
- ä½è«ã§ããBaseModelãçµæ§ãã¬ã¤ã«ä½ããã¦ããã®ã§å®è£ ã§åèã«ãã¾ããã
- Random Forest
- depth=6ãããã§ãã©ã¡ã¼ã¿ã¯é©å½ã§ãã
- LightGBMã«ããå£ããããã®ç²¾åº¦ãåºã¦é©ãã¾ãã
- ããJackããã¯LightGBMãæ¨ã¦ã¦XGBoostãé¸ãã ãããã®ã§ããããLightGBMãå¾®å¦ã ã£ãï¼
- Ridge
- LightGBMx3 (èãå¤ã/æ®é/å°ãªã)
- LightGBM以å¤ãæå¤ã¨å¼·ãã£ãã®ã§é©ãã¾ããã
- ãã¹ã¦ã®ã¢ãã«ã§Seed Averagingãè¡ã£ã¦ãã¢ã³ãµã³ãã«ãè¡ãã¾ããã
Ensemble
- Stackingã¨WeightedAverageã§è¿·ãã¾ãããæå
ã®å®é¨ã§ã¯Stackingãå¼·ãã£ãã®ã§ããã以ä¸2ã¤ã®çç±ã§çµå±WeightedAverageã«ãã¾ããã(-0.002ãããã®ãã¹ã§ãã)
- publicLBã®æ°åãWeightedAverageãå¼·ãã£ã
- trainã¨privateãä¹é¢ãã¦ãã¨ãã«çæ»ããã®ãæãã£ã
- WeightedAverageã®ã¦ã§ã¤ãã¯optunaã§æ¢ç´¢ããããã«ãã¾ãã
- Stackingã¯2段ã®ã¤ããã§ããã
- LightGBMx3+NN+Ridge+RF â Ridge
- 4th solutionè¦ã¦ãã¨3段ã§ãã£ã¦ãã®ã§é©ãã¾ãã(ãã¤ã試ã)
- petfinderã®è§£æ³ã«åºã¦ããrankåãããã©(å¤å)ããã¾ãé¢ä¿ãªãã£ã
æçµçµæ
- public: 17th (0.570) â private: 56th (0.551) (3500teams)
ã³ã³ããéãã¦ã®ææ³
- Shake downãã¦ç²ãã
- ä¸æã¯ã¾ããã§5ä½ã¾ã§é ä½ãä¸ãããã¨ãåºæ¥ã¾ããã
- æ¨ã¦ãµãã ã£ãã®ã§èªåã§ãã¾ããã ã¨ããã£ã¦ãããã®ã®ã夢ã¨å¸æãè¨ãã¾ãããå¾ã¾ããã§ããã
- æçµçã«ã¯1ãã¼ã¸ç®ã®å¤ã¾ã§é£ãã§ãã¾ããç²ãã¾ããã
- èªåããpublicLBãä¿¡ãã¦ç²ãã
- ã³ã³ãéå§ç´å¾ã¯publicLBã¯æ°ã«ããªãã§ãããã¨å¿ã«èªã£ã¦ããã®ã§ãããã³ã³ãçµäºéè¿ã§å¤æã«å°ã£ãã¨ãã«ããªã«ãå¿ã®æ ãæã欲ããã¦çµå±è¦ã¦ãã¾ãã¾ããããã®çµæãããã¤ã信念ãæ²ãããã¨ã§-0.006ç¨åº¦ã®ãã¹ã«ãªã£ã¦ãã¾ã£ãã(çµæè«ã§ãã)
- ã³ã³ãä¸ç¤ã®ä½è£ã®ããã¨ãã«ããªã·ã¼ãææåãã¦è²¼ã£ã¦ãããã¨ãããã»ããè¯ããããããªãã
- æ¯ãåããã¦ç¹å¾´éçæããããªãã«ãªã£ããããã®ãçãã£ãã§ãããä½ããããªãã§ãããªã«é å¼µã£ã¦ãèªåãä¿¡ãã¦ãããããªãã£ããã ãã¨ã¨ã¦ãæ²ããæ°æã¡ã«ãªãã¾ããã
- QWKã«é常ã«æããã£ã¦ç²ãã
- æ§ã ãªæ¹æ³ã§QWKããã¯/å®å®åã試ã¿ãã大åã¯å¾å´ã«çµããã¾ããã
- ãã ãoofããµã³ããªã³ã°ãã¦å¹³ååãããããå¤ãæ±ããããªã©çµæçã«ã¿ããªãã£ã¦ãææ³ãèªåã§è¦ã¤ãããã¦ããã£ãã
- JackãããQWKã®ç´æ¥ã®æé©åããã¦ããããã度èãæããã¾ããã
- ã³ã¼ãã³ã³ãã«æ
£ãããã©ç²ãã
- ä»ã¾ã§ã®ã³ã³ãã¯ã²ãããBQã«SQLæãããã³ã ã£ãã®ã§pandasåä¸ãã£ã¦è¯ãã£ãã§ãã
- ã³ã³ãçµç¤ã¯FastSubmissinã§åããã¨ãè¦ããPDCAã®ãµã¤ã¯ã«ãæ ¼æ®µã«æ©ããªãã¾ããããã£ã±ãGCPã®ãã«ãã¤ã³ã¹ã¿ã³ã¹æé«ã
- ãã¹ããã¡ããã¨æ¸ããã®ã§ãããã¤ãã®ãµãã§ãã¹ãçºè¦ã§ãã¦å©ããã¾ããã
èªåãä¿¡ããããªãã£ã人ã®æ«è·¯
ãµãå±¥æ´ãè¦è¿ãããpubã«æãããã¦æå ã®ãã¹ãçµæãèªåã®ä¿¡å¿µãæ²ããã®ãä¸è¨2ç¹ã ã£ããããããªããã°0.556+ãããã ã£ãããéã¯ç¡çã ã£ãããã ãã©å¿ã®å¼±ããåºãã®ã§ã¨ã¦ãæãã
— ML_Bear (@MLBear2) 2020å¹´1æ23æ¥
ã»DataAugumentationãæ¨ã¦ã(-0.004)
ã»Ensembleã«Stackingã§ã¯ãªãWeightedAverageãé¸æãã (-0.002)
ã¾ã¨ã
ææ³ã3ã¤ã¨ãç²ããã«ãªã£ã¦ãã¾ãã¾ããw å種ãã©ãã«ãå«ã3é±éãã£ã¬ã³ã¸ã§ã¨ã«ããç²ããã®ã§ãããã¾ã楽ããã£ãããªã¼ã¨ã
3æããã¦ã©ã«ãã¼ãã®ã³ã³ããå§ã¾ãã¿ãããªã®ã§ãããã¾ã§è²ã åå¼·ãã¦æºåãã¦ãã¾ãç²ããæ¥ã ãéãããã°å¬ãããªã¨æãã¾ãã
å·éã«èªãã§ã¿ãã¨ããªã«ãè¨ã£ã¦ããã®ã(ry ãªææ³ã§ããç¬ã