ããã«ã¡ã¯ãRed Hatã§ã½ãªã¥ã¼ã·ã§ã³ã¢ã¼ããã¯ãããã¦ããç³å·ã§ãã
éå»ã«ä½åº¦ããã®ããã°ã®ä¸ã§OpenShift Data Scienceã«ã¤ãã¦åãä¸ãã¦ãã¾ãããã ç´è¿ã®ãªãªã¼ã¹ãã¼ã¸ã§ã³ã§ããv2.4ããæ©æ¢°å¦ç¿ã«ãããåæ£å¦ç¿ãå®ç¾ããæ©è½ãTech Previwã¨ãã¦è¿½å ããã¾ããã access.redhat.com
ä»åã¯ãã¡ãã®æ©è½ããããã¤ãã¦ã¿ã¦ã©ããã£ãä»çµã¿ã§åæ£å¦ç¿ãå®ç¾ãã¦ããã®ãè¦ã¦ããããã¨æãã¾ãã ãªãTech Previewæ©è½ã«ã¤ãã¦ã®å¶ç´ã«ã¤ãã¦ã¯ãã¡ããåç §ä¸ããã
ã¾ããããã¯ãã®æ£å¼å称ãRed Hat OpenShift Data Science(RHODS)ããRed Hat OpenShift AI(RHOAI)ã«å¤æ´ã¨ãªãã¾ããã
ããã¥ã¡ã³ããªã©ã¾ã ç´ã£ã¦ããªãé¨åãããã¾ããä»å¾ã¯ãã¡ãã®å称ã使ã£ã¦ããããã¨æãã¾ãã
ããããRHOAIãã©ããã£ã製åãªã®ããçã«ã¤ãã¦ã¯éå»ã®ããã°ã«ã¾ã¨ãã¦ãããããã¡ããåããã¦åç §ä¸ããã
rheb.hatenablog.com
rheb.hatenablog.com
rheb.hatenablog.com
CodeFlare Projectã«ã¤ãã¦
åæ£å¦ç¿ã«é¢ããæ©è½ã¯CodeFlareã¨å¼ã°ããOSSããã¸ã§ã¯ãã®ä¸ã§éçºãé²ãããã¦ãã¾ãã
CodeFlareããã¸ã§ã¯ãã«ã¯è¤æ°ã®ã³ã³ãã¼ãã³ãã§æ§æããã¾ãã
ã»CodeFlare SDK: Code Flareã®åãã¼ã«ãPythonã§å©ç¨ããããã®SDKã
ã»MCAD: å¦ç¿ã®ãããå¦çãè¡ãããã¸ã§ãã®ãã¥ã¼ã¤ã³ã°ã¨ãã£ã¹ãããã
ã»InstaScale: ã¸ã§ãè¦æ±ã«å¿ãã¦ãªã³ããã³ãã«ã¯ã©ã¹ã¿ã«GPUãã¼ãã追å ã
ã»KubeRay: K8sä¸ã«åæ£å¦çãè¡ãããã®Rayã¯ã©ã¹ã¿ãæ§ç¯ã
以ä¸ã§ã¯é çªã追ã£ã¦ãã¤ã³ã¹ãã¼ã«æ¹æ³ããåã³ã³ãã¼ãã³ããã©ã®ããã«åä½ããã®ãè¦ã¦ããããã¨æãã¾ãã
Operatorã®ã¤ã³ã¹ãã¼ã«
ã¾ãOpenShiftã³ã³ã½ã¼ã«ã®OperatorHubããææ°ã®Operatorãã¤ã³ã¹ãã¼ã«ãã¾ãã
CodeFlareé¢é£ã®æ©è½ã¯ææ°ã®v2.4ããTPã®å¯¾è±¡ã¨ãªãããæ£ãããã¼ã¸ã§ã³ãã¤ã³ã¹ãã¼ã«ã§ãã¦ããã確èªãã¾ãããã
Operatorã®ã¤ã³ã¹ãã¼ã«ãå®äºãããå¿ è¦ãªã³ã³ãã¼ãã³ããã¤ã³ã¹ãã¼ã«ããããDataScienceCluster(DSC) CRãä½æãã¾ãã Operatorã®ãã¼ã¸ã§ã³ãv1.xã®æã¯KFDef CRãä½æããã¤ã³ã¹ãã¼ã«å¯¾è±¡ã®ã³ã³ãã¼ãã³ããé¸æãã¦ãã¾ããããv2ããã¯ãã®DSCã«ã¦è¨å®ããããå¤æ´ã¨ãªãã¾ããã
ããã©ã«ãã®è¨å®ã§ã¯.spec.components.codeflare.managementState
ãRemovedã«ãªã£ã¦ãããããç»åã®éãManagedã«å¤æ´ãã¾ããåãã.spec.components.ray.managementState
ã«ã¤ãã¦ãManagedã¨ãã¾ãããã
è¨å®ããDSCãä½æããã¨å¿ è¦ãªã³ã³ãã¼ãã³ããOpenShiftã«ã¤ã³ã¹ãã¼ã«ããã¾ãã
Ray Clusterã®å®ç¾©ã¨ç«ã¡ä¸ã
CodeFalreã§ã¯åæ£å¦ç¿ç¨ã®åºç¤ã¨ãã¦OSSã®RayãOpenShiftç°å¢ã§ç«ã¡ä¸ãã¾ããRayã¯æ±ç¨çãªåæ£å¦çãå®ç¾ããOSSã®ãã¬ã¼ã ã¯ã¼ã¯ã§ãããè¤æ°ã®ãµã¼ãã¼ãªã½ã¼ã¹ãã¾ã¨ãããã¨ã§ãæ©æ¢°å¦ç¿ã®åã¿ã¹ã¯ï¼åå¦çããã¤ãã¼ãã©ã¡ã¼ã¿ã¼ãµã¼ããå¦ç¿ãæ¨è«ãçï¼ã«ãããå¦çãå¹çåãããã¨ãå¯è½ã¨ãªãã¾ãã
Rayèªä½ã¯ã³ã³ããç°å¢ä»¥å¤ã§ãå©ç¨ãããã¨ãå¯è½ã§ãããKubeRay Operatorãå©ç¨ãããã¨ã§Kubernetesç°å¢ã§ãå©ç¨ãããã¨ãã§ãã¾ãã CodeFlareãå©ç¨ããå ´åããã®KubeRay Operatorãèªåã§ã¤ã³ã¹ãã¼ã«ããã¾ãã
ã¾ãã¦ã¼ã¶ã¼ãè¡ããã¨ã¯CodeFlare SDKã使ç¨ããå¿
è¦ãªRay Clusterã®ãªã½ã¼ã¹ãå®ç¾©ãããã¨ã§ãã
SDKã«ã¤ãã¦ã¯pip install codeflare-sdk
ã§ã¤ã³ã¹ãã¼ã«ã§ãã¾ãã
Pythonã¹ã¯ãªããã¨ãã¦Ray Clusterã®æ
å ±ãå®ç¾©ãããã®æ
å ±ãMCADã«å¯¾ãã¦éä¿¡ãã¾ãã
ãã¡ãã®ãªãã¸ããªã«CodeFlareã試ãããã®ãµã³ãã«Notebookãã¡ã¤ã«ãããã¾ãã
以ä¸ã¯ããããä¸é¨æç²ãæ¹å¤ããã³ã¼ãã¨ãªãã¾ãã
# CodeFlare SDKã®ã¤ã³ãã¼ã from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration from codeflare_sdk.cluster.auth import TokenAuthentication # OpenShiftã¸ã®ãã°ã¤ã³æ å ±å®ç¾©ã¨ãã°ã¤ã³ auth = TokenAuthentication( token = "SHA256...", # oc whoami -t ã®å®è¡çµæ server = "https://api.sample.openshiftapps.com:6443", # OpenShift APIãµã¼ãã¼ã®URL skip_tls=False ) auth.login() # Ray Clusterã®å®ç¾© cluster = Cluster(ClusterConfiguration( name='raytest', namespace='default', # Ray Clusterãç«ã¡ä¸ããNameSpace num_workers=2, # Ray Clusterãæ§æããWorker Podã®æ° # Rayã®åWorkerã«å²ãå½ã¦ããªã½ã¼ã¹ (CPU/Memory/GPU) min_cpus=1, max_cpus=1, min_memory=4, max_memory=4, num_gpus=0, image="quay.io/project-codeflare/ray:latest-py39-cu118", # Worker Podã®ã¤ã¡ã¼ã¸ instascale=False # InstaScaleã®å©ç¨æç¡ )) # Clusteræ å ±ã®MCADã¸ã®éä¿¡ cluster.up() cluster.wait_ready() # ä½æãããRay Clusterã®è¡¨ç¤º cluster.details()
SDKãéãã¦OpenShiftã«ãã°ã¤ã³ããclusterãªãã¸ã§ã¯ãã®ä¸ã§å¿
è¦ãªãªã½ã¼ã¹ã®å®ç¾©ãè¡ã£ã¦ãã¾ãã
SDKã®è©³ç´°ãªå©ç¨æ¹æ³ã«ã¤ãã¦ã¯ä»¥ä¸ã®ãã¼ã¸ãåç
§ä¸ããã
project-codeflare.github.io
ããããRay Clusterã®ç«ã¡ä¸ããéå§ãã¾ãã OpenShiftå ã«å ã»ã©å®ç¾©ããRay Clusterãç«ã¡ä¸ããã®ã«ååãªãªã½ã¼ã¹ãããå ´åãããã«ä½æãå§ã¾ãã¾ãããããã§ãªãå ´åClusterã®ç«ã¡ä¸ããªã¯ã¨ã¹ãã¯MCADã®ä¸ã§ãã¥ã¼ã¤ã³ã°ãããç¶æ ã¨ãªãã¾ãã ãã®æã«æ´»èºããã®ãCodeFlareã®ã³ã³ãã¼ãã³ãã®ä¸ã¤ã§ããInstaScaleã§ãã ä¸è¨ã®ã³ã¼ãã®ä¸ã§clusterãªãã¸ã§ã¯ãã®å®ç¾©ã®ä¸ã§InstaScaleãæå¹ã«ããã¨ä¸è¶³ãããªã½ã¼ã¹åã®NodeãOpenShiftã¯ã©ã¹ã¿ã¼ã«èªåã§è¿½å ãããã¨ãã§ãã¾ãã
Ray Clusterãä½æããã®ã«å¿ è¦ãªWorker Nodeãåçã«OpenShiftã«è¿½å ããã¸ã§ãå®è¡ãå®äºããã追å ããNodeãèªåã§åé¤ãããã¨ãå¯è½ã¨ãªãã¾ãããªã³ããã³ãã«è¿½å Nodeã調éã§ãããããªãã¯ã¯ã©ã¦ãç°å¢ã§æå¹ãªæ©è½ã¨è¨ããã§ãããã
RHOAIã®v2.4ã§ã¯InstaScaleãTPã®å¯¾è±¡æ©è½ã¨ãªã£ã¦ããªããããããã§ã¯æ¦å¿µã®ç´¹ä»ã®ã¿ã«çãããã¨æãã¾ãã access.redhat.com
Ray Clusterãä½æãããã¨cluster.details()
ããUIã¸ã®ã¢ã¯ã»ã¹ãå¯è½ã¨ãªãã¾ãã
åæ£å¦ç¿ã®å®è¡
ä½æããRay Clusterã®ä¸ã§å¦ç¿ãå®è¡ããã«ã¯å¤§ãã2ã¤ã®æ¹æ³ãããã¾ãã
a. CodeFlare SDKãéãã¦DDPJobãå®ç¾©ãå®è¡ãã
b. Rayã©ã¤ãã©ãªã使ç¨ãå¦ç¿ãå®è¡ãã
ä»åã¯ä¸ã¤ç®ã®Code Flare SDKãæä¾ããDDPJobDefinition
ã¯ã©ã¹ã§ã¸ã§ãã®å®ç¾©ãè¡ãã¾ãã
ãã¡ãã®Notebookãã¡ã¤ã«ããã¸ã§ãã®å®ç¾©æ¹æ³ãè¦ã¦ã¿ã¾ãããã
from codeflare_sdk.job.jobs import DDPJobDefinition # ã¸ã§ãã®å®ç¾© jobdef = DDPJobDefinition( name="mnisttest", script="mnist.py", # å¦ç¿å®è¡ç¨ã®ã¹ã¯ãªãã scheduler_args={"requirements": "requirements.txt"} # å¦ç¿æã®å¼æ° ) # ã¸ã§ãã®å®è¡ job = jobdef.submit(cluster)
DDPJobDefinition
ã®ä¸ã§å¦ç¿ç¨ã®ã¹ã¯ãªããã§ããmnist.pyã¨ãRayã®åWorker Podã§è¿½å ã§ã¤ã³ã¹ãã¼ã«ããã©ã¤ãã©ãªãrequirements.txtã¨ãã¦æ¸¡ãã¦ãã¾ãã
ããã«ããå¿
è¦ãªä¾åé¢ä¿ã®ã¤ã³ã¹ãã¼ã«ãè¡ãªã£ãä¸ã§ã¸ã§ããå®è¡ãã¾ãã
DDPJobDefinitionã§ã¯TorchXãæä¾ããDDP(Distributed Data Parallel)ã®ä»çµã¿ãæ´»ç¨ãã¦ãããRay Clusterã®åWorkerä¸ã«ãã¼ã¿ãã·ã£ã¼ããããã¼ã¿ãã©ã¬ã«ãªåæ£å¦ç¿ãå®ç¾ãã¾ãã
pytorch.org
ã¸ã§ãã®å®è¡ç¶æ³ã«ã¤ãã¦ã¯Rayã®ããã·ã¥ãã¼ãããjob.status()
ã«ãã確èªã§ãã¾ãã
ã¸ã§ããå®äºãããcluster.down()
ã«ã¦ä½æããRay Clsuterãåé¤ãã¾ãã
ããã§CodeFalreãå©ç¨ããåæ£å¦ç¿ã®ä¸é£ã®ããã»ã¹ãå®äºãã¾ããã
ããä¸ã¤ã®æ¹æ³ã§ããRayã®æä¾ããã©ã¤ãã©ãªãå©ç¨ããæ¹æ³ã«ã¤ãã¦ã¯ã ãã¡ããåèã«ãã¦ã¿ã¦ä¸ããã
ã¾ã¨ã
ä»åã¯OpenShift AIã§Tech Previewã¨ãªã£ãCodeFlareã«ããåæ£å¦ç¿ã«ã¤ãã¦ãç´¹ä»ãã¾ãããLLMãã¯ããã¨ãã大è¦æ¨¡ã¢ãã«ã®æ´»ç¨ãããã®ãã¡ã¤ã³ãã¥ã¼ãã³ã°ã®ãã¼ãºãå¢ããã«ã¤ããããããæ©è½ã¸ã®ãã¼ãºã¯ããé«ã¾ã£ã¦ããã¨æããã¾ããä»å¾ãæ©è½ã®ã¢ãããã¼ããªã©ç´¹ä»ãã¦ããããã¨æãã¾ãã