å®ãã¯ã©ã¦ããµã¼ãã¹ã使ã£ã¦æ¥½ã æ©æ¢°å¦ç¿!
ã¾ã¨ã
æ©æ¢°å¦ç¿ã¸ã§ããåæ¢æãæ ¼å®ã¤ã³ã¹ã¿ã³ã¹ã§åãããã¤ã使ãã
@2020-11-17
service | scheduling | auto-scaling | auto-retry | preemptible | e.g. |
---|---|---|---|---|---|
AWS Batch | â | â | â | â | |
AWS SageMaker | â | â | â | â¡ | |
GCP AI Platform | â | â | â | â¡ | |
kubeflow on GCP | â | â | â | â | link |
ãµã¼ãã¹æ¦è¦ã¾ã¨ã
ã³ã³ãããã¼ã¹ã®å¦ç¿ãæºå
- ãã¼ã¿
- paramsã®æ¸¡ãæ¹
- checkpoints/resume
AWS-Batch GPU
ã¿ã¤ã | GPU | GPU ã¡ã¢ãª | vCPU | ã¡ã¢ãª | 帯å |
---|---|---|---|---|---|
g4dn.xlarge | 1 | 16 GiB | 4 | 16 GiB | æ大 25 Gbps |
AWS-Batchç°å¢@2020-12-30
- HostOS: Amazon ECS GPU-optimized AMI version 20201209 @2020-12-30
- CUDA driver: 450.80.02 (CUDA Toolkit 11.1 compatible)
- Docker: 19.03.13-ce
- PyTorch image: 1.7.0-cuda11.0-cudnn8-runtime
- CUDA Toolkit: 11.0
- cuDNN: 8
GPUã³ã³ãã詳細ãcompatibilityçã¯GPU深層学習 in Container - たれぱんのびぼーろく
AWS Batch GPU
g4dnæå®ãã¦ããã°èªåã§Amazon ECS GPU-optimized AMI
ã§èµ·åãã¦ããã
In managed compute environments, if the compute environment specifies any p2, p3, g3, g3s, or g4 instance types or instance families, then AWS Batch uses an Amazon ECS GPU-optimized AMI.
GPU jobãæ±ãCompute environmentã«ã¯GPUã¤ã³ã¹ã¿ã³ã¹ããè¨å®ãã¡ããã¡.
All instance types in a compute environment that will run GPU jobs should be from the p2, p3, g3, g3s, or g4 instance families. https://docs.aws.amazon.com/batch/latest/userguide/gpu-jobs.html
å²ãå½ã¦ã¡ã¢ãªé
ã¤ã³ã¹ã¿ã³ã¹ã®ã¡ã¢ãªéç¸å½ãæå®ããã¨systemãå æ ãã¦ã¦å²ãå½ã¦ãããªã. ãã®è¾ºå¾®å¦ãªã¨ããããããæåã§èª¿ã¹ã¦ããæãã«è¨å®ãããããªããã https://docs.aws.amazon.com/batch/latest/userguide/memory-management.html
AWS Batch auto-scaling
vCPUåä½ã§ã³ã³ãã¥ã¼ãã£ã³ã°ç°å¢ã管çãã¦ãã¦ãè£ã§åãã¦ããã¤ã³ã¹ã¿ã³ã¹æ°ã¯ããæãã«èª¿æ´ãã¦ããã
[æå° vCPU] ã§ã¯ãã¸ã§ããã¥ã¼ã®éè¦ã«ããããããã³ã³ãã¥ã¼ãã£ã³ã°ç°å¢ã§ç¶æãã EC2 vCPU ã®æå°æ°ãé¸æãã¾ãã https://docs.aws.amazon.com/ja_jp/batch/latest/userguide/Batch_GetStarted.html
.
æå° vCPU: ããã常㫠0 ã«è¨å®ãã¦ãããã¨ã§ãå®è¡ããä½æ¥ããªãå ´åã«ã¤ã³ã¹ã¿ã³ã¹ãæéã浪費ãããã¨ãé¿ãããã¾ããããã 0 ããä¸ã«è¨å®ããã¨ããã®æ°ã® vCPU ã常ã«ç¶æããå¿ è¦ãããã¾ãã from console
.
As demand decreases, AWS Batch can decrease the desired number of vCPUs in your compute environment and remove instances, down to the minimum vCPUs. https://docs.aws.amazon.com/batch/latest/userguide/create-compute-environment.html
.
ããã¼ã¸ãåã®ã³ã³ãã¥ã¼ãã£ã³ã°ç°å¢ãä½æããã¨ãç°å¢å ã®ã¤ã³ã¹ã¿ã³ã¹ã¯ãã¦ã¼ã¶ã¼ã®ä»æ§ã«åºã¥ã㦠AWS Batch ã§ç®¡çã§ãã¾ããã¢ã³ããã¼ã¸ãåã®ã³ã³ãã¥ã¼ãã£ã³ã°ç°å¢ãä½æããã¨ãç°å¢å ã®ã¤ã³ã¹ã¿ã³ã¹è¨å®ã¯ã¦ã¼ã¶ã¼ãå¦çãã¾ãã
.
AWS Batch ã¯ãã³ã³ãã¥ã¼ãã£ã³ã°ç°å¢ã®ä½ææã«å®ç¾©ããã³ã³ãã¥ã¼ãã£ã³ã°ãªã½ã¼ã¹ã®ä»æ§ã«åºã¥ãã¦ãç°å¢å ã®ã³ã³ãã¥ã¼ãã£ã³ã°ãªã½ã¼ã¹ã®å®¹éã¨ã¤ã³ã¹ã¿ã³ã¹ã®ã¿ã¤ãã管çãã¾ãã https://docs.aws.amazon.com/ja_jp/batch/latest/userguide/compute_environments.html
Batch auto-retry
retryStrategy: { attempts?: int (1<=x<=10) evaluateOnExit?: [ { "action"?: RETRYÂ |Â EXIT, "onExitCode"?: glob, "onReason"?: glob, "onStatusReason"?: glob } ]
åçç¡ç¨ã§ãªãã©ã¤ããæ¡ä»¶æºããããExit|Retryã¿ãããªå¶å¾¡ãå¯è½
ã¸ã§ãã®å試è¡ã®èªåå https://docs.aws.amazon.com/ja_jp/batch/latest/userguide/job_retries.html
å¥æ¹æ³
kubeflowã使ããçµããã
kubeflowã¯k8s Jobsãã¼ã¹ã®å¦ç¿ï¼PyTorchJob CRDï¼ãArgoãã¼ã¹ã®workflow/pipelinesåã¨ãã£ãæã.
ã³ã³ãããªã¼ã±ã¹ãã¬ã¼ã·ã§ã³ã«æ
£ãã¦ãªãã¨è²ã
大å¤ã ã¨ã¯æãã
references
- Google Cloud Japan Team. (2019). ããªã¨ã³ããã£ãã« VM 㨠GPU ã«ãã ML ã¯ã¼ã¯ããã¼ã®ã³ã¹ãåæ¸. Google Cloud.
- "AWS Batch Batch processing, ML model training, and analysis at any scale" official web↩