ããã«ã¡ã¯ãååæ¸ããçªæï¼é£ã®ãã¼ãã¼ã M3 2019ã¨ããè¨äºããHHKBã®å ¬å¼Twitterã¢ã«ã¦ã³ãã«ãã¤ã¼ããããèãä¸ãã£ã¦ããã¨ã ã¹ãªã¼ã¨ã³ã¸ãã¢ãªã³ã°Gã®æ²³å (@vaaaaanquish) ã§ãã
ä»åã¯ã¨ã ã¹ãªã¼ AIãã¼ã ãéçºãéç¨ãã¦ããæ©æ¢°å¦ç¿ããã¸ã§ã¯ãåãã®Pythonã©ã¤ãã©ãªã§ãããgokartãã®èª¬æã¨ããã®å¨è¾ºã©ã¤ãã©ãªã¨ãªããcookiecutter-gokartããthunderboltããredshellsãã«ã¤ãã¦ç´¹ä»ãããã¨æãã¾ãããããããé¡ããã¾ãã ã
- ã¯ããã«
- Pipelineåã®ã¡ãªããã»ãã¡ãªãã
- gokart
- cookiecutter-gokart
- thunderbolt
- redshells
- ç§ã®éçºã»éç¨å½¢æ
- ãããã«
ã
ã¯ããã«
è¿å¹´ãå¤ãã®æ©æ¢°å¦ç¿ããã¸ã§ã¯ãã«ããã¦Pipelineã©ã¤ãã©ãªãç¨ãããããã¼ã¿åéããå å·¥ãã¢ãã«ã®å¦ç¿ãæ¨è«ã1ã¤ã®ã¯ã¼ã¯ããã¼ã¨ãã¦åãæ±ãã®ãä¸è¬çã¨ãªã£ã¦ãã¾ã*1*2ã å社åãã¼ã ã§å¤ãã®æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ããã¼ã¿ã¨ã³ã¸ãã¢ã注ç®ããï¼ã¤ã®èª²é¡ã¨ãªã£ã¦ãããå®éã«æ¥æ¬å½å ã§ããã¼ã¿ãã¤ãã©ã¤ã³ã«é¢ããåå¼·ä¼ãªã©ãéå¬ããããã®éç¨æ¹æ³ãå ±æãè°è«ããã¦ãã¾ãã
Pipelineã©ã¤ãã©ãªã§ã¯scikit-learn Pipelineãluigiã代表ã«ãã¯ã©ã¦ããåæ£ç°å¢ãæèããDigdagãAirflowã¨ãã£ããã¼ã«ã®éç¨äºä¾ãè³ã«ããããã«ãªã£ã¦ãã¾ããè¿å¹´ã§ã¯Google Cloud AutoMLãAmazon SageMakerã¨ãã£ãã¯ã©ã¦ããµã¼ãã¹ã¨ãã¦ãæ©æ¢°å¦ç¿ã¿ã¹ã¯ã®ä¸é¨ãèªååãPipelineåãã試ã¿ãåºã¦ãã¦ãã¾ã*3ã
ãããªä¸ãã¨ã ã¹ãªã¼ AIãã¼ã ã§ãgokartã¨ããluigiã®wrapperã©ã¤ãã©ãªãOSSã¨ãã¦éçºãéç¨ãã¦ãã¾ãã
ã
Pipelineåã®ã¡ãªããã»ãã¡ãªãã
Pipelineã©ã¤ãã©ãªãå©ç¨ããã«ããã£ã¦ãããã¤ãã®ã¡ãªããã»ãã¡ãªãããåå¨ããã¨èãã¦ãã¾ãã å è¿°ããããã«ãPipelineã©ã¤ãã©ãªã®å©ç¨ã¯æ©æ¢°å¦ç¿ããã¸ã§ã¯ãã«ãããããã¡ã¯ãã¹ã¿ã³ãã¼ãã«ãªãã¤ã¤ããã¾ãããå°å ¥ã®éã«ã¯ããã¼ã ãè·å ´ã«ç°å¢ãæ´ã£ã¦ããããåã©ã¤ãã©ãªã®ç¹æ§ã¯æ´»ãããããæ éãªæ¤è¨ãå¿ è¦ã«ãªãã¨æãã¾ãã 以ä¸ã§ã¯ãluigiãAirflowã®ãããªPipelineã©ã¤ãã©ãªã®ã¡ãªããããã¡ãªããã«ã¤ãã¦ãçè ã®èãã示ãã¾ãã
ã
Pipelineåã®ã¡ãªãã
æ©æ¢°å¦ç¿ããã¸ã§ã¯ãã«ããã¦luigiã®ãããªå°ããªPipelineã©ã¤ãã©ãªãç¨ããäºã®å©ç¹ã¯ãåæã§å¤ãè°è«ããã¦ãã¾ãããç§å人ã¨ãã¦ã¯ä»¥ä¸ã®3ã¤ã«éç´ãããã¨èãã¦ãã¾ãã
- ãã¼ã¿ãã¢ãã«ã®åç¾æ§ã®ç¢ºä¿
- ã¿ã¹ã¯ã®å ±éå
- éçºã¨prodéç¨ç§»è¡ã®ãããã ã
æ©æ¢°å¦ç¿ããã¸ã§ã¯ãã®å¤ãã¯ãä¸è¬çãªã½ããã¦ã§ã¢éçºã«æ¯ã¹ããã°ããã®åç¾ãé常ã«å°é£ã¨ãã課é¡ã常ã«æã¡åããã¦ãã¾ããããã¯ãåããã©ã¡ã¼ã¿ãå©ç¨ããã¨ãã¦ãå ã®ãã¼ã¿ã®å¤åãç°å¢ã®å¤åã«ãã£ã¦åçã®çµæãå¾ãããªãå ´åãããã¨ãã£ããæ©æ¢°å¦ç¿ã¢ãã«ã®ç¹æ§ããæ¥ã課é¡ã§ããã¾ãæ©æ¢°å¦ç¿ã¢ãã«ã®å ¥åã«ç¨ããç¹å¾´éãçæãããåå¦çãã¨å¼ã°ããå·¥ç¨ã«ããã¦ãããã®é çªãå¤ããã ãã§çµæãå¤ãããªã©ã管çã®é£ããå¦çãå¤ãå«ã¾ãã¦ãã¾ããæ©æ¢°å¦ç¿ããã¸ã§ã¯ãåãã®Pipelineã©ã¤ãã©ãªã®å¤ãã¯ããåç¾æ§ããéè¦è¦ãããã¼ã¿ããã°ãã¢ãã«ã®ä¿åãå®è¡é åºãæ§ã ãªå½¢ã§ãµãã¼ãããäºã§ããã®èª²é¡ã解決ãã¦ãã¾ãã
ã¾ããæ©æ¢°å¦ç¿ã¢ãã«ã®éçºãã§ã¼ãºã§ã¯ããã¼ã¿ã®åå¾ãå å·¥ã調æ´ããã¹ãçãç¹°ãè¿ãåæ°ãå¤ããªããã¡ã§ããåã¿ã¹ã¯æ¯ã«ã¯ã©ã¹åããéç¨ã§ã®ç¤¾å ã·ã¹ãã ãã¹ã¯ãªããã®å ±éåããã³ãã¬ã¼ãåããå ¨ä½ã®éçºé度ãé«ããäºã¯æ確ã§ããå®éãMLOpsã®æèã§ã¯ããã¼ã¿åºç¤ã®æ´çã«å ãã¦ãPipelineã©ã¤ãã©ãªã«ãã£ã¦ã¿ã¹ã¯ãå ±éåããªããããã¼ã¿ã®æ´»ç¨ããæ©æ¢°å¦ç¿ã¢ãã«éçºããªãªã¼ã¹ãéç¨ããã¸ãã¹æ´»ç¨ã¾ã§ãã·ã¼ã ã¬ã¹ã«ãã¦ããã¨ãã£ãéçºã¹ã¿ã¤ã«ã主æµã«ãªãã¤ã¤ããã¾ãã
ã
Pipelineåã®ãã¡ãªãã
ã¡ãªããã«åãã¦ãç§ã®èããPipelineã©ã¤ãã©ãªå©ç¨ã®ãã¡ãªããã¯ã以ä¸ã®2ã¤ãããã¾ãã
- ãã¼ã¿ä¿æãè¨ç®ãã»ãã¥ãªãã£çã®ãªã½ã¼ã¹ã嵩張ã
- ã¤ã³ãã©ããã¼ã«ã®é¸å®ãã¿ã¹ã¯è¨è¨ã®é£ãã
Pipelineã©ã¤ãã©ãªã¯ããã¼ã¿ã®åç¾æ§ã確ä¿ããã¨ããåé¢ã§ãåç¾æ§ç¢ºä¿ã®ããã®ãªã½ã¼ã¹ãå¿ è¦ã¨ãªãããããã¸ã®ã³ã¹ãæèãä½ããªããã¡ã§ãã ä¾ãã°ãã¢ãã«ãè¾æ¸ãã¡ã¤ã«ãåã¿ã¹ã¯å®è¡æç¹ã§ä¿æããäºã¯ãå¤ãã®ã¹ãã¬ã¼ã¸å®¹éãå¿ è¦ã¨ãªãã§ãããã ã¾ããè¨ç®ã³ã¹ãã®å°ããªã¿ã¹ã¯ã¨å¤§ããªã¿ã¹ã¯ãæ··å¨ããPipelineã«ããã¦ã¯ããªã¼ãã¹ã±ã¼ã«ãããããªãã·ã³ãªã½ã¼ã¹ãå©ç¨ãã¦ããªããã°ãè¨ç®ãªã½ã¼ã¹ã«ã¤ãã¦ã嵩張ãäºã«ãªã£ã¦ãã¾ãã¾ãã å®éã«Google AutoMLã§26ä¸ã®è«æ±ãæ¥ã話ãApache Airflowã使ã£ã¦ã¿ããã©éç¨ã«ã¯ä¹ããªãã£ã話ãªã©ããã¤ãã©ã¤ã³åããã¦ããäºã«ãã£ã¦ã³ã¹ãæèãæ¸è¡°ããåãã¦å®ã³ã¹ããé«ã¾ãã¨ãã£ãäºä¾ã¯å¤ãåå¨ãã¦ãã¾ããã¾ãããåä¸ã®DBããã¢ãã«ãä½ããã ãã®å¦çã«ãé¢ããããéå»ã®ã¢ãã«ãä¿åããã¹ãã¬ã¼ã¸ããã°ç¨ã®ã¯ã©ã¦ããµã¼ãã¹çããã¼ã¿ãä¿æããå ´æãå¢ãã¦ãã¾ãå ´åãå¤ãã§ããããPipelineã©ã¤ãã©ãªãå©ç¨ããã«ããã£ã¦ã¯ããããã®ã»ãã¥ãªãã£æ ä¿ã«å¯¾ãã人çãªã½ã¼ã¹ãå²ãå¿ è¦ãããã¾ãã
å ãã¦ãPipelineã¨ãã¦å ±éåãããã¦ããäºã§ãæãçæ³ã¨ããéçºãéç¨å½¢æ ãå©ç¨ã§ããªãã¨ããå ´åãããã§ããããPipelineãå¾¹åºããäºã§ãæ¬æ¥ä½ã³ã¹ãã§éç¨ã§ããã¿ã¹ã¯ã«å¯¾ãã¦ãè¨ç®ãªã½ã¼ã¹ãå²ãã¦ãã¾ãã¨ãã£ãç¶æ ã¯å°ãªãããã¾ãããæ©æ¢°å¦ç¿ã®ã¿ã¹ã¯ã®å¤ãã¯å®è¡æéã®åæ£ã大ãããªããã¡ã§ãããã¾ãããã©ã®Pipelineãã¼ã«ã使ãããã©ã®ããã«Pipelineãçµãã¹ãããæ©æ¢°å¦ç¿ã¢ãã«éçºè ãèããå¿ è¦ãåºã¦ãã¦ãã¾ãã¾ããæ©æ¢°å¦ç¿ã¢ãã«éçºè ãå¤ãã®å½¹å²ãæã¡ããã¼ã¿ç®¡çãåæãã¤ã³ãã©éçºããªãªã¼ã¹å¾ã®éç¨ã¾ã§è¦ãã¦ããå ´åã¯ããããæèã§ãã¾ãããéã«ãã¼ã¿ã¨ã³ã¸ãã¢ãæ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã¨åæ¥ãã¦ãããããªä½å¶ã§ã¯æèããã®ãé£ãããªã£ã¦ããã§ãããã
ã
gokart
åè¿°ããéãæ©æ¢°å¦ç¿ããã¸ã§ã¯ãã«å¯¾ããPipelineã©ã¤ãã©ãªå°å ¥ã«ããã£ã¦ã¯ãéçºç¶æ³ããã¼ã ãäºæ¥ã®ã¿ã¤ãã³ã°ãã¡ãªãããæ´»ãããã¾ã¾ããã¡ãªãããå¸åã§ããããèããªããã°ãªãã¾ãããã¾ããPipelineã©ã¤ãã©ãªã®ã¡ãªããã»ãã¡ãªããã«å ãã¦ãç´ç²ãªã½ããã¦ã§ã¢ã¨ã³ã¸ãã¢ãªã³ã°ãéçºä½å¶ãç°å¢ã«åºã¥ãã¡ãªããã»ãã¡ãªãããããã§ãããããããã®å ¼ãåããèæ ®ããä¸ã§ãã¨ã ã¹ãªã¼ã§ã¯gokartã¨ããã©ã¤ãã©ãªãéçºãéç¨ããã«è³ã£ã¦ãã¾ãã
gokartã¯ãspotifyãOSSã¨ãã¦éçºãã¦ãããluigiãã®wrapperã¨ãã¦éçºãã¦ããPythonã©ã¤ãã©ãªã§ãã
ã¨ã ã¹ãªã¼ã§ã¯ãéå»ããã°ã«ãæ¸ããéãluigiãç¨ããã¯ã¼ã¯ããã¼éçºãè¡ã£ã¦ãã¾ããã ããããã www.m3tech.blog www.m3tech.blog
ã
ãã®ä¸ã§è¦ãã¦ããå è¿°ã®ãããªã¡ãªããããã¡ãªãããå ã«ãgokartã¯ãluigiã«å¯¾ãã¦ä»¥ä¸ã®ãããªæ©è½ãä»ä¸ãã¦ãã¾ãã
- ã¿ã¹ã¯å ±éåã®ããã®åºåãã¡ã¤ã«ã®å¶ç´ã¨æ¡å¼µ
- å¼·åãã¤ç°¡æãªåç¾æ§ã®ããã®ãã¼ã¿ä¿æ
- ã¯ã©ã¦ããµã¼ãã¹ãSlackéç¥ã®ãµãã¼ã
ã
å ±éåã®ããã®åºåãã¡ã¤ã«å½¢å¼ã®å¶ç´ã¨æ¡å¼µ
Pipelineã©ã¤ãã©ãªãå©ç¨ãã¦ããã¨ãã¦ããããããã®æ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã«ãã£ã¦ç®¡çãã¦ãããã¼ã¿ã®ãã©ã¼ããããéãï¼ä¾ãã°Aããã¯featherã使ã£ã¦ãããBããã¯pickleã¡ã¤ã³ã¨ãã£ãã·ã¼ã³ã®ããã«dumpå½¢å¼ãéã£ãããåã©ã¤ãã©ãªã®ãã¼ã¸ã§ã³ãéã£ããã¨ãã£ãï¼å ´åã«ãã¿ã¹ã¯ã®å ±éåã¾ã§ã®éå£ãç¡é§ã«å¤§ãããªã£ã¦ãã¾ãã¾ããgokartã§ã¯ãã¿ã¹ã¯ã®outputãloadã«å©ç¨ã§ãããã¡ã¤ã«å½¢å¼ãå¶éãã¦ãã¾ãããããã以ä¸ã®FileProcessorã¯ã©ã¹ã«ãã£ã¦å®ç¾©ãããããã©ã«ãã§ã¯pickleãnpzãgzãtxtãcsvãtsvãjsonãxmlããµãã¼ããã¦ãã¾ã*4ã
gokart/file_processor.py at master · m3dev/gokart · GitHub
ãã¡ãããå¶ç´ããä¸ã§ã®ã¡ãªãããä½ãããããæ©æ¢°å¦ç¿ã¢ããªã³ã°ã§ããå©ç¨ãããpickleãããã¡ã¯ãã¹ã¿ã³ãã¼ãã¨ãã¦æ±ãããã¡ã¤ã«ãµã¤ãºã大ãããªãå ´åã«ãã¡ã¤ã«ãèªåã§åå²ããã¨ãã£ãæ¡å¼µãå ãã¦ãã¾ãããã®å¶ç´ã¨æ¡å¼µã®æ©è½ã«ãã£ã¦ãã¿ã¹ã¯ã®å ±éåããæ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ããã¼ã¿ã¨ã³ã¸ãã¢éã®ã³ã¼ãã®ããåããã¹ã ã¼ãºã«è¡ãäºãã§ãã¦ãã¾ãã
ã
å¼·åãã¤ç°¡æãªåç¾æ§ã®ããã®ãã¼ã¿ä¿æ
FileProcessorã§å®ç¾©ããããã©ã¼ãããã§ããã°ãmake_targetã¡ã½ãããå©ç¨ãã¦ãåºåãç°¡æã«è¨å®ããäºãå¯è½ã§ãã ä¸è¨ã®ããã«ãmake_targetã¡ã½ããã«ãã£ã¦æ¡å¼µåãå¤å®ãæå®ã®pathã«å¯¾ãã¦åºåãä¿åããäºãã§ãã¾ãã
import gokart from luigi import IntParameter import pandas as pd class SampleTask1(gokart.TaskOnKart): task_namespace = 'sample' num_param = IntParameter() def output(self): return self.make_target('output/sample.pkl') def run(self): df = pd.DataFrame([1]*self.num_param) self.dump(df)
ããã¯luigiã«ãããFileSystemTargetã«ä¼¼ã¦ãã¾ãããgokartã§ã¯ãåºåãã¡ã¤ã«ã«å¯¾ãã¦ãrequiresã¨ãã¦å®è¡ãããåã¿ã¹ã¯ã®ãã©ã¡ã¼ã¿ããèªã¿ã¹ã¯ã®ãã©ã¡ã¼ã¿ãããæååããã·ã¥ãç®åºããåºåãã¡ã¤ã«ã«èªåã§ä»ä¸ããããã«ãªã£ã¦ãã¾ããä¸è¨ã®ã¿ã¹ã¯å®è¡æã«ã¯ããã©ã¡ã¼ã¿ããçæãããããã·ã¥ã«å¿ãã¦ã以ä¸ã®ãããªãã¡ã¤ã«ãåºåããã¾ãã
$ tree . ./ âââ log â  âââ module_versions â  â  âââ SampleTask1_6d384b6bdcd078fb13b025966f537692.txt â  âââ processing_time â  â  âââ SampleTask1_6d384b6bdcd078fb13b025966f537692.pkl â  âââ task_log â  â  âââ SampleTask1_6d384b6bdcd078fb13b025966f537692.pkl â  âââ task_params â  âââ SampleTask1_6d384b6bdcd078fb13b025966f537692.pkl âââ output âââ sample_6d384b6bdcd078fb13b025966f537692.pkl
åãã©ã¡ã¼ã¿ã«å¿ããããã·ã¥ãä»ããç¶æ ã§ãåãã°ã¨åºåãä¿åããã¦ãã¾ãããã ãã©ã¡ã¼ã¿ãããã·ã¥ã«ãªã£ã¦ããããããã©ã¡ã¼ã¿ãæ¥ä»ã§ãã¡ã¤ã«ã管çãããã¨ãã£ãåé¡ã®å¤ã管çæ¹æ³ãåé¿ããäºãã§ããããã«ãªã£ã¦ãã¾ãã ããã©ã«ãã§ã¯ã以ä¸ã®ãã¡ã¤ã«ãdumpããã¾ãã
- output/sample_.pkl : SampleTask1ã§dumpãããã¡ã¤ã«
- module_versions: ã¿ã¹ã¯ãå®è¡ããéã«å©ç¨ããå ¨ã¦ã®ã¢ã¸ã¥ã¼ã«ã®ãã¼ã¸ã§ã³
- processing_time: ã¿ã¹ã¯ã®å®è¡ã«ããã£ãæé
- task_log: ã¿ã¹ã¯ãloggerãéãã¦åºåãããã°
- task_params: ã¿ã¹ã¯å®è¡ã«å©ç¨ããparameter
åãã°ã¨åºåã®dumpãããã¡ã¤ã«ãåç §ããã°ãã¿ã¹ã¯ãå¿ ãåç¾ã§ããã¨ããäºãæèããä½ãã«ãªã£ã¦ãããåDBãããã¡ã¤ã«ããã¦ã³ãã¼ãããã¿ã¹ã¯ããåå¦çãæ©æ¢°å¦ç¿ã¢ãã«ã®å¦ç¿ã«ããã¦ããåãã©ã¡ã¼ã¿ã¨ããã·ã¥ãä¸æã«å®ã¾ãäºã§ããã¼ã¿ã®åç¾ã確å®ã«è¡ããã ãã§ãªããããã©ã¡ã¼ã¿ãå¤æ´ãã¦ç¹°ãè¿ãå®é¨ãè¡ããã¨ããæ©æ¢°å¦ç¿ã¢ããªã³ã°ã«ããã¦æãéè¦ãªä½æ¥ã容æã«ãã¦ãã¾ãã
ã
ã¾ããpickleãå©ç¨ããå ´åã¯outputã¡ã½ãããçç¥ããäºãå¯è½ã§ãã
import gokart from luigi import IntParameter from luigi.util import requires from sample_task1 import SampleTask1 class Sample(gokart.TaskOnKart): task_namespace = 'sample' @requires(SampleTask1) class SampleTask2(Sample): sample_param = IntParameter() def run(self): df = self.load() df = df.sample(self.sample_param) self.dump(df)
ä¸è¨ã®ä¾ã®ããã«task_namespaceãå®ç¾©ããã¯ã©ã¹ãäºåã«ä½æãã¦ããã°ããã¿ã¹ã¯ãå®è¡ãçµæãdumpãããã¨ããã¹ã¯ãªããããtask_namespaceãæã¤ã¯ã©ã¹ãç¶æ¿ããä¸ã§ãrequiresãã³ã¬ã¼ã¿ã«ããSampleTask1ãrequiresã«æå®ããã®ãã¡ã¤ã«ãèªã¿è¾¼ã¿å å·¥ããrunã¡ã½ãããç¨æãããã®3ã¤ã«ãªããããä¸è¨ã®ããã«ããªãã·ã³ãã«ãªè¨è¿°ã«ãªãã¾ããããã ãã®è¨è¼ã§å®æã«ã¿ã¹ã¯ãå®ç¾©ã§ãããã¤åç¾æ§ãæ ä¿ãããããæ§ã ãªæ å ±ãä¿åãã¦ãããæãgokartã®å©ç¹ã§ãluigiããããã«æ軽ã«Pipelineã©ã¤ãã©ãªã®ã¡ãªãããä½æã§ãã¾ãã
ã
ã
ä¸è¨ã®taskãå®è¡ããéã«ã¯gokart.runã¡ã½ãããå®è¡ããmain.pyãªãã¹ã¯ãªããã¨configãã¡ã¤ã«ãå¥éç¨æããã¨è¯ãã§ãããã
ã
以ä¸ã«configãã¡ã¤ã«ã®ä¾ã示ãã¾ãã
[TaskOnKart] workspace_directory=./resources local_temporary_directory=./resources/tmp [sample.SampleTask1] num_param=10 [sample.SampleTask2] sample_param=2
workspace_directoryãmake_targetãåºåå ã¨ãã¦å©ç¨ãããã£ã¬ã¯ããªã§ããããã«ã¯ããs3://hogeããgs://piyoãã®ãããªå½¢å¼ã§AWS S3ãGoogle Cloud Storageãæå®ããäºãã§ãã¾ãããã${WORK_SPACE}ãã%(WORK_SPACE)sãã¨ãã£ãè¨æ³ã§ç°å¢å¤æ°ããå¤ãèªã¿è¾¼ãäºãå¯è½ã§ã*5ã å©ç¨å¯è½ãªã¯ã©ã¦ããµã¼ãã¹ã¯ç¾å¨GCPã¨AWSã®2ã¤ã§ãããå ã«ãã¡ãªããã§ç¤ºããéãããããã¯ã©ã¦ããµã¼ãã¹ã®ç®¡çã«ã¯æ°ã使ãå¿ è¦ãåºã¦ãã¾ããã»ãã¥ãªãã£ã®ç®¡çãææ £ããã¨ã³ã¸ãã¢ãã¨ã ã¹ãªã¼ã«å± ããµã¼ãã¹ãé¸æãã¦ããã¨ãã£ãå½¢ã«ã¯ãªã£ã¦ãã¾ãããS3ã®ãããªKVSã§ããã°æéãå®ããä¸è¬çã«ãç°¡æã«éç¨å¯è½ã§ã¯ããã¨æãã¾ãã
ã
以ä¸ã«ã¯å®éã«å®è¡ããmain.pyã®ä¾ã示ãã¾ãã
import luigi import gokart import sample_task1 import sample_task2 if __name__ == '__main__': luigi.configuration.LuigiConfigParser.add_config_path('param.ini') gokart.run()
luigiã®add_config_pathã¡ã½ãããå©ç¨ãã¦ãå ç¨ã®configãã¡ã¤ã«ãèªã¿è¾¼ãã§ãã¾ãã
ã
以ä¸ã®ãããªã³ãã³ãã§å®è¡ãã¾ããconfigãã¡ã¤ã«ã®è¨å®ã§ãªãå¼æ°ãç¨ãã¦ãã©ã¡ã¼ã¿ãå¤æ´ããäºãã§ãã¾ãã
python main.py sample.SampleTask2 --local-scheduler --sample-param=3
ãããä¸è¬çãªgokartå®è¡ã®æä½ã¨ãªã£ã¦ãããæ§ã ãªæ¡ä»¶ã§ç¹°ãè¿ãå®é¨ãè¡ããªãããç°¡åã«å ¨ã¦ã®å®é¨ã®åç¾ãè¡ãäºãã§ããããä½æãã¦ããã¾ãã
ã
ã¯ã©ã¦ããµã¼ãã¹ãSlackéç¥ã®ãµãã¼ã
åè¿°ã®éããã¿ã¹ã¯ã®å®è¡çµæã®ä¿åã«ã¯AWS S3ãGoogle Cloud Storageãé¸æããäºãã§ãã¾ããããã·ã¥å¤ãä»ããªãè¨å®ãå¯è½ãªã®ã§ãããã¹ãã¬ã¼ã¸ã«ãããã©ã¡ã¼ã¿ã§å¦ç¿æ¸ã¿ã¢ãã«ãè¨ç½®ããããããªBatchãç°¡åã«ä½æããäºãå¯è½ã§ãã
ã¾ãåã¿ã¹ã¯ã®éå§ãçµäºãç°å¸¸çµäºã¨ãã£ãluigi.Eventã«å¿ãã¦Slackã«éç¥ãæããæ©è½ãåãã¦ãã¾ãã è¨å®ã¯ä»¥ä¸ã®ããã«SlackConfigã«å¯¾ãã¦TokenãChannelãReplyããUserãæå®ããã ãã§ãã
[SlackConfig] token=${SLACK_TOKEN} " ç°å¢å¤æ°ããèªã¾ãã channel=sample_notice to_user=kawai
SlackConfigã«é¢ããè¨å®ããªãå ´åã¯éç¥ã¯è¡ãããªããã·ã³ãã«ãªå®è£ ã§ããããããã¯ã·ã§ã³ã§åãgokartã«åé¡ãçºçããå ´åã«æ¤ç¥ããä»çµã¿ã¨ãã¦å½¹ç«ã¤ã§ãããã
ã
gokartã®ã¡ãªããããã¡ãªãã
ä¸è¨ã®ãããªæ©è½ãããgokartã¯ãPipelineã©ã¤ãã©ãªã®ã¡ãªããã§ããããã¼ã¿ãã¢ãã«ã®åç¾æ§ããéçºã¨prodéç¨ç§»è¡ã®ãããããã大ããåã«æ¼ãåºããã©ã¤ãã©ãªã¨è¨ãã¾ãã æ©æ¢°å¦ç¿ããã¸ã§ã¯ãã®éçºã¹ãã¼ããé«ããããã«ããçºçããæ©æ¢°å¦ç¿ãªãã§ã¯ã®èª²é¡ããããªã«è§£æ±ºãã¦ãããã®ãgokartã§ãã
ä¸æ¹ã§ãPipelineã©ã¤ãã©ãªã®ãã¡ãªããã§ããããã¼ã¿ä¿æãè¨ç®ãã»ãã¥ãªãã£çã®ãªã½ã¼ã¹ã嵩張ããã¨ãã£ãç¹ã¯ãããé¡èã«ãªã£ã¦ãã¾ãã¾ãã åç¾æ§ã®ããã«ãå ¨ã¦ã®ã¿ã¹ã¯ã®çµæã¨ãã©ã¡ã¼ã¿ãã¢ã¸ã¥ã¼ã«ãã¼ã¸ã§ã³ããã°ãä¿åããæ©è½ããã£ãããã¿ã¹ã¯ãåºæ¥ãã ãç´°ããç°¡æã«æ¸ããããè¨è¨ããã¦ãããããå½ç¶ã¨è¨ãã°å½ç¶ã§ãã
ã
ç¾ç¶ã¨ã ã¹ãªã¼ã§ã¯ãAIãã¼ã ã®æ§æãäºæ¥å
容ãããä¸è¨ã®ãããªãã¡ãªãããå¢å¤§ãã¦ãã¡ãªãããä¸åã£ã¦ããã¨æãã¾ãã
1人ã1ããã¸ã§ã¯ããéãã¦æã¤ãããå§ä»»ãå¼ãç¶ãã®è¦³ç¹ã§ã³ã¼ãã®å
±éåãã¢ã¸ã¥ã¼ã«ã®ãã¼ã¸ã§ã³ç®¡ççãéè¦ã«ãªãäºãåã¿ã¹ã¯çµæãä¿åãã¦ããè¨ç®ãªã½ã¼ã¹ãã¹ãã¬ã¼ã¸ã«åé¡ãåºãªãç¨åº¦ã®ãã¼ã¿éã§ããäº*6ã大ããªè¦å ã§ãã
ããä»®ã«ãæ°ããªãµã¼ãã¹éå§ã«ãã£ã¦ãã¼ã¿ãæ¥å¢ããããã¯ã¨ã³ãã«hadoopçãå©ç¨ããªãã¨ãããªããªã£ãå ´åãé·æã®è¦ç¹ãããgokartã使ãã¡ãªããã¯èããªã£ã¦ãã¾ããªã¨ãæãã¾ãã
ã
ãã¼ã¿éããã¼ã ã®å½¢æ ãäºæ¥å 容ã«åããã¦Pipelineã©ã¤ãã©ãªã使ãäºãéè¦ã§ãããgokartã¯ä¸ã§ãå°ãä¸è¦æ¨¡è¾ºãã®ãã¼ã ã«åããã©ã¤ãã©ãªã ã¨æãã¾ããMLãã¼ã ç«ã¡ä¸ãæãã20人以ä¸ã®ããã¸ã§ã¯ããKaggleçã®ã³ã³ãã¹ãã§ãã¼ã å ãªã©ã§å©ç¨ããäºã§ãæ©æ¢°å¦ç¿ã¢ããªã³ã°ã®ç«ã¡ä¸ããã¹ãã¼ããæ©ããäºãã§ããã¯ãã§ãã
ã
cookiecutter-gokart
gokartã§ã®ç«ã¡ä¸ããããã«å éããããããcookiecutterãç¨ãããã³ãã¬ã¼ããOSSã¨ãã¦å ¬éãã¦ãã¾ãã
cookiecutterã³ãã³ãã使ã£ã¦ã以ä¸ã®ããã«å¯¾è©±å½¢å¼ã§gokartããã¸ã§ã¯ããã¹ã¿ã¼ããããäºãã§ãã¾ããä¸è¬çãªcookiecutterãã³ãã¬ã¼ãåæ§ãããã©ã«ãã®ã¾ã¾ã§è¯ãå ´åã¯ç©ºç½ã®ã¾ã¾Enterãæ¼ãã¾ãã
cookiecutter https://github.com/m3dev/cookiecutter-gokart project_name [project_name]: m3sample # ããã¸ã§ã¯ãã®ã«ã¼ããã£ã¬ã¯ããªå package_name [package_name]: sample # Pythonã¢ã¸ã¥ã¼ã«ã«ããéã®ããã±ã¼ã¸å python_version [3.6]: # å©ç¨ããPythonãã¼ã¸ã§ã³ author [your name]: m3dev # ä½æè ã®åå package_description [What is this project?]: this is sample # ä½ãããã¸ã§ã¯ãã®èª¬ææ license [MIT License]: # å©ç¨ããã©ã¤ã»ã³ã¹
ãã®ã³ãã³ãã§ã以ä¸ãå«ãgokartããã¸ã§ã¯ãã®ãã£ã¬ã¯ããªãä½æããã¾ãã
- sampleã¿ã¹ã¯ã®ã¹ã¯ãªãã
- configã®sample
- sampleã¿ã¹ã¯ãåä½ãããããã®main.py
- sampleã¿ã¹ã¯ã®unittestã¹ã¯ãªãã
- ã¢ã¸ã¥ã¼ã«ã¨ãã¦å©ç¨ããããã®setup.py
- unittestããã§ãã¯ããããã®GitHub Actions CI/CD
- LICENSEãREADME.md
ãã®cookiecutter-gokartãå©ç¨ããäºã§ãGitHubä¸ã§gokartããã¸ã§ã¯ããããå§ããããã§ãããã ã¨ã ã¹ãªã¼ç¤¾å ã§ã社å åãã®è¨å®ãä»ä¸ãããã®ãå¤ãå©ç¨ãã¦ããç¨ãgokartã§cookiecutterãå©ç¨ãã価å¤ã¯é«ãã¨èãã¦ãã¾ãã Pipelineã§å¿ããã¡ãªãã¹ãã³ã¼ãã«ã¤ãã¦ããµã³ãã«ã示ãã¦ãã¾ãã®ã§ãåèã«ãã¦ããã ããã°ã¨æãã¾ãã
ã
thunderbolt
æ©æ¢°å¦ç¿ããã¸ã§ã¯ãã§ã¯ãåã¿ã¹ã¯ã§è¤æ°åã«æ¸¡ã£ã¦ãã©ã¡ã¼ã¿ãå¤æ´ãã¦å®é¨ãè¡ã£ãããæ¥å¸¸çã«èµ°ãBatchã®ã¢ãã«ãåå¾ãããã¨ãã£ãå ´é¢ã«å¤ãåºãããã¾ããgokartã§ã®éçºã«ããã¦ã¯ããããã®ãã¡ã¤ã«ãæå ã«è¨ç½®ã確èªãã¦Pythonããèªã¿è¾¼ãã¨ããæä½ãç°¡åã«è¡ãããã«ãthunderboltã¨ããOSSãå ¬éãã¦ãã¾ãã
thunderboltã¯ãgokartãåºåãããtask_logããtask_paramsããèªã¿è¾¼ã¿ããããã®æ å ±ãpandas.DataFrameã§é²è¦§ã§ãã¾ããã¾ãããã®æ å ±ããç´æ¥ã¿ã¹ã¯ãdumpãããã¡ã¤ã«ãloadããäºãã§ãã¾ãã
thunderboltã«ã¤ãã¦ã¯ãjupyter notebookä¸ã§ã®examplesãè¦ã¦ãããã®ãä¸çªæ©ããã¨æãã¾ãã thunderbolt/example.ipynb at master · m3dev/thunderbolt · GitHub
exampleã§ãè¡ã£ã¦ããããã«ãgokartã§å®è¡ãããã¼ã¿ã®è¡¨ç¤ºããã¼ããPythonä¸ã§è¡ãäºãã§ãã¾ãã
from thunderbolt import Thunderbolt tb = Thunderbolt(os.environ['TASK_WORKSPACE_DIRECTORY']) # åãã©ã¡ã¼ã¿ã®è¡¨ç¤º print( tb.get_task_df() ) # thunderboltãæã¤task_idãæå®ãã¦ãã¼ã¿ããã¼ã data = tb.load(task_id=1)
ããã«ãããè¤æ°å試ããå®é¨çã¿ã¹ã¯ããããµã¼ãä¸ãä»äººãå®è¡ããã¿ã¹ã¯ã¾ã§ãåºåã®ã«ã¼ããã£ã¬ã¯ããªãæå®ããäºã§ãPythonã«ãã£ã¦ç®¡çããäºãå¯è½ã«ãã¦ãã¾ããã¿ã¹ã¯åã®ãã£ã«ã¿ãgokartã®èªåãã¡ã¤ã«åå²ã«ã対å¿ãã¦ãããgokartã§ãã©ã¡ã¼ã¿ãå¤ãã¦æ²¢å±±å®é¨ããå¾ã«jupyter notebookã§çµæãå¯è¦åããããä¸è¦ãªå®é¨çµæãåé¤ããPythonã¹ã¯ãªãããæ¸ãã¨ãã£ãããã¼ãå®ç¾ããããã®ãã¼ã«ã«ãªã£ã¦ãã¾ãã
ã
redshells
ã¨ã ã¹ãªã¼ã§ä½ãããæ©æ¢°å¦ç¿ã¢ãã«æ§ç¯ã«é¢é£ããgokartã¿ã¹ã¯ã«ã¤ãã¦ã¯ããã®å¤ããredshellsã¨ããOSSã¨ãã¦å ¬éãã¦ãã¾ãã
redshellsã¯ãå®éã«ã¨ã ã¹ãªã¼ã®ãããã¯ã·ã§ã³ã§åãã¦ããæ©æ¢°å¦ç¿ã¢ãã«ã®ã³ã¼ãã§ããåºæ¬çãªTF-IDFãtext embeddingãxgboostãMatrix Factorizationã¨ãã£ãææ³ããGraph Convolutional Neural Networkçã®æ¯è¼çæ°ããææ³ã¾ã§ãã¨ã ã¹ãªã¼å ã§ä½¿ãããå¤ãã®ã¢ãã«ãgokartã¿ã¹ã¯åãå ¬éãã¦ãã¾ãã
Preferred Networksãããå ¬éãã¦ããOptunaã«ãä¸é¨å¯¾å¿ãã¦ãããåºæ¬çã«ã¯ããã©ã¡ã¼ã¿ãªã©ã®è¤éãªäºãèããpipelineãçµãã°ã¢ãã«ãåºæ¥ããããªã©ã¤ãã©ãªã¨ãªã£ã¦ãã¾ãã
ãã¡ãã«ã¤ãã¦ã¯ãgokartã®åä½ã«è»½ã触ããä¸ã§examplesãè¦ã¦ããããã°ã¨æãã¾ãã
redshells/examples at master · m3dev/redshells · GitHub
ã
ç§ã®éçºã»éç¨å½¢æ
ããã¾ã§ç´¹ä»ãããã¼ã«ã使ããgokartãredshells -> thunderbolt -> cookiecutter-gokartã¨ããéçºãµã¤ã¯ã«ãç¨ãã¦ãæ©æ¢°å¦ç¿ã«ããããã¼ã¿åæããã¢ãã«æ§ç¯ããã©ã¡ã¼ã¿èª¿æ´çã®å®é¨ããããã¯ã·ã§ã³ã³ã¼ãåãã·ã¼ã ã¬ã¹ã«é²ããããããã«ãã¦ãã¾ãã
以ä¸ã«ç§ã®éçºããã¼ã模ããSampleã®Jupyter Notebookã示ãã¾ãã
luigiã¯ä¸»ã«CLIããå®è¡ãã¾ãããipynbä¸ã§gokart (luigi)ãåããããã«ä»¥ä¸ã®ãããªã¹ã¯ãªãããè¨å®ãã¦ãããipynbã§gokartéçºãé²ããããããã«ãã¦ãã¾ãã
# luigiãsys.exitã§ããã»ã¹ãçµäºãããã®ãåé¿ãã import sys def ipy_exit(*args): exit(keep_kernel=True) sys.exit = ipy_exit # ~~~ Taskè¨è¿° ~~~ # gokartã¿ã¹ã¯ã¯ããã»ã¹ããã¯ãçºçããã®ã§--no-lockãå¿ è¦ gokart.run(['sample.LoadIrisData', '--local-scheduler', '--no-lock'])
ipynbã§ã®éçºã®è¯ãã¨ããã¯ãå¯è¦åããã¼ã¿å å·¥ã®æ°è»½ãã«ããã¾ããããã®åã³ã¼ããæ±ããªããã¡ã§ãããã¼ã¿ãµã¤ã¨ã³ãã£ã¹ããæ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã®ã³ã¼ãå質ã®è°è«ãå¤ããããããããªè©±ã§ããå¯è¦åã«ã¤ãã¦ã¯ipynbã®æ©è½ã使ãã¤ã¤ãåå¦çããã¼ã¿å å·¥ã«ã¤ãã¦ã¯ç´°ããã¿ã¹ã¯åããgokart.runã§åããã¦ãã¾ãã
ã
ã¿ã¹ã¯ã®å®è¡çµæã«ã¤ãã¦ã¯ããã©ã¡ã¼ã¿ã«å¿ããããã·ã¥ä»ããã¡ã¤ã«ãçæããã¦ããã®ã§ããã¡ããå©ç¨ãã¾ããããã©ã¡ã¼ã¿ã¨ããã·ã¥ã®çªãåããã®ããã«thunderboltãå©ç¨ãã¦ãã¾ãã thunderboltãå©ç¨ãã¦ã¡ã¢ãªä¸ã«ãã¼ã¿ããã¼ããçµæã®ãã§ãã¯ãè¡ããªããgokartã¿ã¹ã¯ãæ§ç¯ãã¦ããã¾ãã sns.pairplotãpandas_profilingãå©ç¨ãã¦EDAãredshellsã使ã£ã¦Optunaã«ããæé©åãããã¦XGBoostãå¦ç¿ã¨ãå®çªã®æµããipynbã§å®çµããã¦ãã¾ãã
ã
ããã¾ã§gokartãå©ç¨ãã¦è¨è¿°ãã¦ããã°ãcookiecutter-gokartã§ä½æãããã³ãã¬ã¼ãã¸ã®ç§»æ¤ãç°¡åã§ãã ã¾ããã³ã¡ã³ããã¡ã¢ãæ®ããããipynbã«ã³ã¼ãã®æå³ãæ®ãäºã§ãé©åã«ã¢ãã«ã®ãã¹ãã³ã¼ããæ¸ãäºãã§ãããããã¯ã·ã§ã³ã«æã£ã¦ããæãããªãã¹ã ã¼ãºã«ãªãã¾ããã
ã
ãããã«
æ¬è¨äºã§ã¯ãæ©æ¢°å¦ç¿åãã®Pipelineã©ã¤ãã©ãªã®ã¡ãªããããã¡ãªããã«å ãã¦ãã¨ã ã¹ãªã¼ãéçºãéç¨ãå ¬éãã¦ããPipelineã©ã¤ãã©ãªã§ããgokartã®èª¬æã¨ããã®å¨è¾ºã©ã¤ãã©ãªã®ç´¹ä»ãè¡ãã¾ããã
gokartã®èµ·æºã®å¤ãã¯ãAIãã¼ã ãã¼ã ãªã¼ãã¼ã®è¥¿å ´(@m_nishiba)ã®éå»ã®æè¡ã¤ãã³ãç»å£è³æã«ãæ¸ããã¦ãã¾ãã®ã§ãã¡ããåèã«ãªãã¨æãã¾ãã
ãã¡ãã®ã¹ã©ã¤ãã«ãããéããAIãã¼ã èªä½ãåºæ¥ã¦æ°å¹´ã®ãã¼ã ã¨ããå´é¢ãããã¾ãããgokartãå¨è¾ºã©ã¤ãã©ãªã¯ã¾ã ã¾ã ä¸ä¾¿ãªæãããç¶æ ã§ãããã¨ã ã¹ãªã¼ AIãã¼ã ã¨å ±ã«æé·ãã¦ããäºã«ãªãã¨æãã¾ããçããã®æ°è»½ãªPull Requestãissueæ稿ããå¾ ã¡ãã¦ãã¾ãã ã ã
ã¨ã ã¹ãªã¼ã§ã¯ããããã£ãOSSã®å ¬éãæè¡æ¤è¨ãè°è«ã¨ãããã®å ¬éããæè¡åä¸ãã¨ãã¦è©ä¾¡ãããããã«ãªã£ã¦ãã¾ãã gokartããã®ä»ã©ã¤ãã©ãªã¸ã®Starã使ã£ã¦ã¿ãææ³ã®æ稿ã«å ãã¦ãã¨ã ã¹ãªã¼æ ªå¼ä¼ç¤¾æ¡ç¨ã¸ã®å¿åãå¿ãããå¾ ã¡ãã¦ããã¾ããå®ãããé¡ããã¾ãã
ã
*1:https://towardsdatascience.com/build-a-pipeline-for-harvesting-medium-top-author-data-c4d7ed73729f
*2:https://databricks.com/session/netflixs-recommendation-ml-pipeline-using-apache-spark
*3:Pipelineã®å®ç¾©ã«ãå¯ãã¾ããããã§ã¯åºç¾©ã«æãã¦ãã¾ã
*4:featherãApache Arrowã¨ãã£ããã©ã¼ãããããã使ããã¾ãããã¼ã¸ã§ã³æ¯ã®å¤æ´ãå¤ãåç¾æ§ã®è¦³ç¹ã§æªãµãã¼ãã§ã
*5:å¾è ã¯gokartç¬èªã®è¨æ³ã§ã
*6:æ¥æ¬ã®å»å¸«ã35ä¸äººç¨ãªã®ã§ECçã«æ¯ã¹ãã°å½ç¶ç·ãã¼ã¿ã¯å°ãããªãã¾ã