ãã®è¨äºã¯ã¨ã ã¹ãªã¼Advent Calendar 2023ã¨MLOps Advent Calendar 2023ã®12æ¥ç®ã®è¨äºã§ãã
AIã»æ©æ¢°å¦ç¿ãã¼ã ã®åå·ã§ãã æè¿ã¯ç«ã®ãã¾ã£ã¦ã¢ãã¼ã«ãããããã¦ãããä»äºã®éªéããã¾ãã
ç¾å¨AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã¯MLã®ããããGoogle Kubernetes Engine(GKE)ä¸ã§éç¨ãã¦ãã¾ãã ç¾å¨æ°ãã¦ã¿ãã¨ãã240å以ä¸ã®ããããGKEä¸ã§åãã¦ããããã§ãã
AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã¯2019å¹´é ããç´ï¼å¹´ã»ã©GKEä¸ã§MLããããéç¨ãã¦ãã¾ãã ãã®éã«ã³ã¹ãã®æé©åãå®å®ãããããã®éç¨ãªã©ã«åãå ¥ãã¦ãã¾ããã ãã®è¨äºã§ã¯ã主ã«ã¹ã±ã¼ã«ã¤ã³ã¨ã³ã¹ãæé©åã«ã¤ãã¦èª¬æãããã¨æãã¾ãã
ãã¼ã ã®MLã«ã¤ãã¦å ¨ä½ãææ¡ãããå ´åã¯ä»¥ä¸ã®è¨äºã詳ããã§ãã
- GKEã®ç¨èªã®ç¢ºèª
- Kubernetesã®Evictionã«ã¤ãã¦
- Evictionãåé¿ãããã¨ã«ããåé¡
- Autopilotã¯éã®å¼¾ä¸¸ã«ãªãã
- ã¾ã¨ã
- We are hiring!!
GKEã®ç¨èªã®ç¢ºèª
ã¾ãæåã«GKE(Kubernetes)ã®ç¨èªã«ã¤ãã¦ç¢ºèªããã¦ããã¾ãã Kubernetesããããæ¹ã¯é£ã°ãã¦æ¬¡ã«è¡ã£ã¦ãã ããã
- Pod
- Node
- Node Pool
Pod
Podã¯ï¼ã¤ã¾ãã¯è¤æ°ã®ã³ã³ããã®ã°ã«ã¼ãã表ãKuberntesã®ãªã½ã¼ã¹ã§ãã è¤æ°ã®Containerãèµ·åã§ããã®ã§ãdocker composeã®ãããªãã®ã¨èªèããã¨åãããããããããã¾ããã ã¹ãã¬ã¼ã¸ããããã¯ã¼ã¯ã®å ±æãªã½ã¼ã¹ãæã¤ã®ã§ãåã³ã³ããéã§éä¿¡ãããããã¼ã¿ãå ±æããããããã¨ãåºæ¥ã¾ãã
Node
Nodeã¯ï¼ã¤ã®VMã¾ãã¯ç©ççãªãã·ã³ã表ãã¾ãã åNodeã«ã¯è¤æ°ã®Podãé ç½®ããã¾ãã GKEã®Standardã¢ã¼ãã§ã¯Nodeã¯ä¸ã¤ã®Compute Engineã表ãã¾ãã Amazon Elastic Kubernetes Service(EKS)ã§ãNodeã¯EC2ã¤ã³ã¹ã¿ã³ã¹ã表ãããã§ãã
Podã§ã¯åã³ã³ããã«å¯¾ãã¦CPUãã¡ã¢ãªã®ä¸éãä¸éãæå®ãããã¨ãåºæ¥ã¾ãã ã¡ã¢ãªã®ä¸éãè¨å®ããã¨Kubernetesã®ã¹ã±ã¸ã¥ã¼ã©ã¯ããã®æ å ±ãå©ç¨ãã¦ã©ã®Nodeã«ã©ã®Podãé ç½®ãããã決å®ãã¾ãã
Node Pool
Node Poolã¯ã¯ã©ã¹ã¿å ã§åãæ§æãæã¤Nodeã®ã°ã«ã¼ãã§ãã æ°ããªPodãä½æãããã¨ããæã«CPUãã¡ã¢ãªãªã©ãªã¯ã¨ã¹ãã足ããªãå ´åãNode Poolã¯Nodeãã¹ã±ã¼ã«ã¢ã¦ããã¦ããã¾ãã éã«Nodeãä½ã£ã¦ããå ´åã¯Nodeãã¹ã±ã¼ã«ã¤ã³ãã¦ããã¾ãã
GKEã®Standardã¢ã¼ããEKSã¯Nodeåä½ã§èª²éããã¾ãã ãã®ããèªåã§ã¹ã±ã¼ã«ã¤ã³ãã¦ãããæ©è½ã¯ã¨ã¦ãå¬ãããã®ã§ãã
Kubernetesã®Evictionã«ã¤ãã¦
Kubernetesã«ã¯Evictionã¨ããæ©è½ãããã¾ãã ããã¯Nodeã®ãªã½ã¼ã¹ãä½ã£ã¦ããå ´åã«Podãå¥ã®Nodeã«éé¿ãããNodeãã¹ã±ã¼ã«ã¤ã³ãã¦ãããã¨ããæ©è½ã§ãã ããã«ãã£ã¦æéã®èª²éãããç¨åº¦æãã¦ããã¾ãã
ãããããã®æ©è½ã¯MLã®ãããã®ãããªé·æéè¨ç®ãã¦ããã®ç¶æ ãä¿æããå¿ è¦ãããå ´åã«ã¯è´å½çã¨ãªãã¾ãã è¿é ã®MLãããã¯ãã¤ãã©ã¤ã³ã©ã¤ãã©ãªãªã©ãå©ç¨ãã¦ããããå ¨ä½ãè¤æ°ã®ã¹ãããã«åå²ãã¦åã¹ãããããã£ãã·ã¥ããã¨ãããããªãã®ãå¤ãã¨æãã¾ãã AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ãgokartã¨ãããã¤ãã©ã¤ã³ã©ã¤ãã©ãªãå©ç¨ãã¦ãã¾ãã ããããå®éã«è¨ç®ããé¨åã¯æ°æéããããã®ãããããã®å®è¡ä¸ã«Evictionãèµ°ãã¨ããã¾ã§ã«è¨ç®ããç¶æ ãã¨æ¶ãã¦ãã¾ãã¾ãã
Evictionã§å°ãã®ã¯ãããã ãã§ã¯ããã¾ããã ä¾ãã°ãã¬ããªã«æ°ã1ãããªãAPIãªã©ã¯å¥ã®Nodeã§æ°ãã«èµ·åãã¦ããéã«ã¯ãµã¼ãã¹ã®æç¶ãåºã¦ãã¾ãã¾ãã ãã®æ§ã«Evictionãåé¿ãããå ´åã¯è²ã ãããGoogle Cloudã§ãEvictionãåé¿ããæ段ãç¨æãã¦ããã¦ãã¾ãã 以ä¸ã®ãããªannotationãæ¸ããã¨ã«ãã£ã¦åé¿ãããã¨ãåºæ¥ã¾ãã
metadata: annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
ãããããã®evictionãåé¿ããã¨å¥ã®åé¡ãåºã¦ãã¦ãã¾ãã¾ãã
Evictionãåé¿ãããã¨ã«ããåé¡
Evictionãåé¿ãããã¨ã«ãã£ã¦å®éã«çºçããåé¡ãè¦ã¦ã¿ã¾ãããã å¼ãã¼ã ã«ã¯ã¡ã¢ãªãè¨å¤§ã«å¿ è¦ãªãããããã¡ã¢ãªå©ç¨éã¯å°ãªããé·æéããããããã¾ã§æ§ã ãªããããããã¾ãã ããã§ä»¥ä¸ã®ãããªã±ã¼ã¹ãèãã¦ã¿ã¾ãã
- 巨大ãªNodeã«å·¨å¤§ãªPodãå®è¡ãããã
- ãã®Nodeã«ã¡ã¢ãªå©ç¨éãå°ãªãã¦é·æéãããPodãè¿·ãè¾¼ã
- 巨大ãªPodãçµäºãã
- ã¡ã¢ãªã¯å ¨ç¶å¿ è¦ãªãã®ã«é·æéãããPodãæ®ãç¶ãã
- evictionãæå¦ãããããã巨大ãªNodeã¯ãã£ã¨æ®ãç¶ãã
ãã®æ§ã«evictionãåé¿ããçµæãé«ä¾¡ãªNodeããã£ã¨æ®ãç¶ã課éé¡ãè·³ãä¸ããã¾ãã
ããã§å¼ãã¼ã ã§ã®åé¿æ¹æ³ãã©ãããããè¦ã¦ã¿ã¾ãããã Podã«ã¯nodeSelectorã¨ããå¤ãæå®ã§ãã¾ãã nodeSelectorã¯Podãè¼ã£ã¦ãããNodeã®ç¨®é¡ãé¸ã¶ã¨ãããã®ã§ãã ããã¦ãNodePoolã«ã¯ãªã½ã¼ã¹ãå°ãªãNodeç¨ã®ãã®ãããªã½ã¼ã¹ãå¤ãNodeç¨ã®ãã®ã¾ã§ç¨æãã¾ãã ããã«ãã£ã¦ããç¨åº¦æéã¨Nodeã®ã¹ã±ã¼ã«ãæå®ãããã¨ãåºæ¥ã¾ããï¼(æåã§)
Nodeã«ã¤ãã¦ã¯ãã£ã¨ã¡ã¢ãªã欲ããã¨ããè¦æããã£ãããGPUãè¼ããNodeãã»ããã¨ããè¦æããã£ããæ§ã ã§ãã ããã¦ãSpot VMã®ãããªæéãå®ãNodeãªã©ãç¨æãã¦ããã©ã«ãã§ã¯ãã¡ãã使ãããã«ãããªã©ããããã¾ããã çµæNodePoolã管çããæ°ãå¢ãã¦ããã人éãèããªããã°ãããªãNodeãå¢ãã¦ããã¾ãããã
GKEã«ã¯Standardã¢ã¼ãã¨ã¯å¥ã«Autopilotã¢ã¼ããããã¾ãã Autopilotã¢ã¼ãã¯NodePoolã®ç®¡çãã¦ã¼ã¶ã¼ãããªãã¦ãè¯ãã¢ã¼ãã«ãªãã¾ãã ãããå©ç¨ããã°ããã®Node管çå°çããæãåºãããã§ãï¼
Autopilotã¯éã®å¼¾ä¸¸ã«ãªãã
ãã¦ãAutopilotãã©ã®ãããªãã®ããè¦ã¦ã¿ã¾ãããã GKEã®StandardãNodeã®VMã®æé課éã ã£ãã®ã«å¯¾ãã¦ãAutopilotã¯åPodã®ãªã½ã¼ã¹æé課éã«ãªãã¾ãã ã¤ã¾ããæã ãä»ã¾ã§ç®¡çãã¦ããNodeã®ç©ºã容éãããã¨ãå°ããªPodãè¿·ãè¾¼ãã§Nodeã®èª²éé¡ãã¨ãè¨ã£ã¦ããã®ãPodã«ãã課éãããªãã®ã§æ°ã«ããªãã¦ãè¯ããªãã¾ãã ãããããã§ãã¹ã¦è§£æ±ºããã¨æããããã®æ
Autopilotã§ã¯safe-to-evictãå©ç¨ã§ãã¾ãã
ãããæ¯ãåºãã«æ»ãã¾ãã safe-to-evictã®è£å´ãèããã¨ãGoogle Cloudã®ãªã½ã¼ã¹ã®æé©åã¨ã¯æ¹åæ§ãéã£ã¦ããã§ãããããAutopilotã§ã®å±éã®é£ããã¯æ³åã«é£ãããã¾ããã ãã®ãããç¾æç¹ã®AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã¯ãæéã®ãããJobã®Autopilotã¢ã¼ãã®å©ç¨ã«è¸ã¿åãã¦ãã¾ããã§ããã
ãã®ããã°ãããã§ç· ãããã¨æãã¤ã¤ããã¡ã¯ããã§ãã¯ã®ããã«è²ã 調ã¹ã¦ããããã®ãããªè¨äºã«åºä¼ãã¾ãã
ãªãã¨Autopilotã§ãsafe-to-evictå©ç¨ã§ããããã«ãªã£ã¦ããï¼ï¼ï¼ï¼ æã çã«ã¯è¶ é大ãã¥ã¼ã¹ãã7æãããã«åºã¦ã¾ãããæ°ã¥ãã®ã«ï¼ã¶æãããã¾ãããã 注æç¹ã¨ãã¦ã¯evictionãé²ããã®ã¯7æ¥éã ãã¨ãããã¨ã§ãã
èªåãã¡ã®ç¨éã¨ãã¦ã¯7æ¥ä»¥ä¸ãããJobã¯æ°åãããªãã®ã§ãAutopilotè¡ãããã¨ãªã試ãã¦è¦ããã¨ãã¾ãããããã¦ããã
safe-to-evictã¯Spot VMã§ã¯å©ç¨ã§ãã¾ãã
Spot VMã¨ããã®ã¯å¯ç¨æ§ãä¿è¨¼ãããªããããã«é常ã®VMã«æ¯ã¹ã¦é常ã«å®ä¾¡ã«å©ç¨ã§ããã¤ã³ã¹ã¿ã³ã¹ã§ãã Spot VMã¯æå°ã§60%ã®å²å¼ãæ大ã§91%ã®å²å¼ãé©ç¨ããã¾ãã å¯ç¨æ§ãä¿è¨¼ãããªãã¨ã¯ãããããã¾ã§ã¹ã±ã¼ã«ã¤ã³ãå¤çºããããã§ããªãã®ã§Standardã¢ã¼ãã§ã¯å¤ãã®JobãSpot VMã§safe-to-evictãå©ç¨ãã¦ãã¾ããã Spot VMãsafe-to-evictã§å©ç¨ã§ããäºãå®å ¨ã«ä¿è¨¼ãããã®ã¯é£ãããã§ãããGoogle Cloudä¸ã§ãå©ç¨ã§ããªãã®ãç¾ç¶ã§ã*1ã å人çãªæè¦ã§ã¯ãSpot VMã®ã¹ã±ã¼ã«ã¤ã³ãããevictionã®æ¹ãé »åº¦ãå¤ãã¨æã£ã¦ãã¾ãã ãªã®ã§ãæ¬é³ãè¨ãã°ãsafe-to-evictãå®å ¨ã«ã¯ä¿è¨¼åºæ¥ãªããå é¨ã®evictionãçºçããªããã¨ãã£ãæ¹åæ§ã§è¯ãã®ã§å©ç¨ããã¦æ¬²ããæã§ãããããã¯ããã§ä¿è¨¼ã®ã©ã¤ã³ãé£ãããã ãªã¨æããªããæ å ±ã¦ã©ãããã¦ãã¾ãã
ã¨ããããã§æéæé©åã®ããã«Standardã«æ®ãç¶ããããNodePoolã®éæ¾ãæ±ãã¦Autopilot(éSpot VM)ã«è¡ãã æã ã®GKEã§ã®MLOpsã¯ã¾ã ã¾ã ããããã ãã¨ããäºã§ç· ãããã¦ãããã¾ãã
ã¾ã¨ã
Kubernetesä¸ã§ã®MLã¯APIã®ãããªå©ç¨ã¨ã¯ç°ãªããã¨ãèããå¿ è¦ãããããªããªãé£ããã§ãã ããããKubernetesä¸ã«éãããã¨ã«ãã£ã¦ç£è¦ãªã©ãèªååãããªã©å¬ããç¹ãããããããã¾ãã ä½ããKubernetesãããããç°å¢ã¨ããã®ã¯ã¨ã¦ã楽ããã§ãã ä»å¾ãAutopilotã®é²åãè¦ã¤ã¤ããã¤ã§ãä¹ãæãåºæ¥ããããªã¤ã³ãã©ã®ä½ãæ¹ãå¿ããã¦ããããã¨æãã¾ãã
We are hiring!!
AIã»æ©æ¢°å¦ç¿ãã¼ã ã§ã¯ãèªåãã¡ã§GKEã¯ã©ã¹ã¿ãéç¨ããªããç©æ¥µçã«MLOpsåºç¤éçºã«åãçµãã§ãã¾ãã ã·ã¹ãã å ¨ä½ãè¦éãã¦åºç¤ãä½ã£ã¦ããããMLOpsã¨ã³ã¸ãã¢ãåéãã¦ãã¾ãï¼
*1:è¨äºå·çæç¹ 2023/12/11