TensorFlowãKubernetesã§åæ£å¦ç¿ããããã®KubeFlowãåããã¦ã¿ã話
Kubernetesãæ©æ¢°å¦ç¿åºç¤ã¨ãã¦Kubeflowããªãªã¼ã¹ãã¾ããã
ãã£ããããã¨ãKubeFlowã¯ä»¥ä¸ã®æ©æ¢°å¦ç¿ã¢ããªã±ã¼ã·ã§ã³éçºã®ä¸é£ã®ã¯ã¼ã¯ãã¼ãããµãã¼ãããOSSã§ãã
- ã¢ãã«éçºåºç¤
ãã¼ã¿ãµã¤ã¨ã³ãã£ã¹ããæ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ãJupyterNotebookãã¤ãã£ã¦ã¢ãã«éçºããããã®ãJupyterãåããµã¼ãç°å¢ãæä¾
- TensorFlowåæ£å¦ç¿åºç¤
éçºããã¢ãã«ãTensorFlowã§åæ£å¦ç¿ããããã®Kubernetesã¯ã©ã¹ã¿ãèªåçæ
- ã¢ããªå ¬éåºç¤
å¦ç¿æ¸ã¿ã¢ãã«ãKubernetesã¯ã©ã¹ã¿ä¸ã§ãµã¼ãã¹å ¬éããåºç¤ãæä¾(Tensorflow Serving)
ã¾ã ãéçºéä¸ããã¤ãã£ã¨ãã追ãã¦ãã¾ããããå°å ¥ã®ãããã¨æ¦è¦ãã¾ã¨ãã¾ãã
0. kubeFlowç°å¢æ§ç¯
Kubeflowãåããããã«ã¯ãã¾ãKubernetesã¯ã©ã¹ã¿ãæ§ç¯ããªããã°ããã¾ããã ä»åã¯æ¤è¨¼ã®ãããªã®ã§ãéã«minikubeã使ã£ã¦ãã¾ãã
æé ã¯ä»¥ä¸ã®ã¨ããã§ãã
minikubeã®ã¤ã³ã¹ãã¼ã«
MacOSã®ã¨ãã¯ã次ã®ã³ãã³ãã§minikubeãã¤ã³ã¹ãã¼ã«ãã¾ãã
$ brew cask install minikube
Kubernetesã¯ã©ã¹ã¿ã®æ§ç¯
次ã®ã³ãã³ããå®è¡ãã¦ãKubernetesã¯ã©ã¹ã¿ãæ§ç¯ãã¾ãã
$ minikube start Starting local Kubernetes v1.8.0 cluster... Starting VM... Getting VM IP address... Moving files into cluster... Setting up certs... Connecting to cluster... Setting up kubeconfig... Starting cluster components... Kubectl is now configured to use the cluster. Loading cached images from config file.
ãã¼ã¸ã§ã³ã¯æ¬¡ã®éãã§ãã
$ minikube version minikube version: v0.24.1 $ kubectl get node NAME STATUS AGE VERSION minikube Ready 10m v1.8.0
ããã§æºåã¯å®äºã§ãã
ãªããåºæ¬ç人権ãæºãããªãPCã使ã£ã¦ããã¿ãªãã¾ã¨ ãããã·ç°å¢ã§ç¡é§ã«äººçãæ¶èãã¦ããã¿ãªãã¾ã¯ AWSãAzureãGCPãªã©ã®ãããªãã¯ã¯ã©ã¦ãã®VMãKubernetesããã¼ã¸ããµã¼ãã¹ä½¿ã£ã¦ãã ããã
Kubeflowã®ã¤ã³ã¹ãã¼ã«
Kubeflowã¯GitHubã§å ¬éããã¦ãã¾ãã®ã§ã次ã®ã³ãã³ãã§ã¯ãã¼ã³ãã¾ãã
$ git clone https://github.com/google/kubeflow
ããã§ã¯ãCPUã使ã£ã¦åæ£å¦ç¿ããä¾ã説æãã¾ãã 次ã®ã³ãã³ããå®è¡ãã¦ãç°å¢ãæ§ç¯ãã¾ãã
$ cd kubeflow $ kubectl apply -f components/ -R
ãªããKubeflowã¯GPUããµãã¼ããã¦ãã¾ããæ£ç¢ºã«ã¯ãKubernetes1.8ããGPU対å¿ãαæä¾ããã¦ããã®ã§ãNvidia CUDAãªã©ã§å¾æ¥åããã¦ã¿ããã¨ãããã¾ãã
ç°å¢æ§ç¯ã®ããã®ããã¥ãã§ã¹ããã¡ã¤ã«ã¯ãkubeflow/componentsé ä¸ã«ããã¾ãã
1. ã¢ãã«éçºåºç¤
KubeFlowã§ã¯ã¢ãã«éçºã®ãããJupyterNotebookãåä½ããç°å¢ãç¨æããã¾ãã KubeFlowã¯ãJupyterHubã使ç¨ãã¦ãã¾ãããªã®ã§è¤æ°ã®ã¦ã¼ã¶ã¼ã§ã®èªè¨¼ã¢ã¯ã»ã¹ã管çã§ãããspawnersãã¨å¼ã°ãããã©ã°ã¤ã³å¯è½ãªã³ã³ãã¼ãã³ããã¤ãã£ã¦ã¾ãã
JupyterNotebookã¯Kubernetesã®Podã§åãã¾ãã
$ kubectl get pod NAME READY STATUS RESTARTS AGE ~ä¸ç¥~ tf-hub-0 1/1 Running 0 2h tf-job-operator-5c648c58bc-vrtcx 1/1 Running 0 2h
JupyterNotebookã«ã¢ã¯ã»ã¹ããããã®Podã¯ãtf-hub-lbãã¨ããååã®ServiceãLoadBaranceã¨ãã¦å ¬éããã¦ãã¾ãã
tf-cnn $ kubectl get svc NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE ï½ä¸ç¥ï½ tf-hub-0 None <none> 8000/TCP 2h tf-hub-lb 10.107.227.23 <pending> 80:32089/TCP 2h
minikubeã®å ´åã¯ã次ã®ã³ãã³ãã§å ¬éURLã確èªã§ãã¾ãã
$ minikube service tf-hub-lb --url http://192.168.99.100:32089
ã«ã¹ã¿ã ã®JupyterNotebookã使ãããã¨ããæ§æãå¤æ´ãããã¨ãã¯ã以ä¸ã®ããã¥ãã§ã¹ããä¿®æ£ãã¾ãã kubeflow/components/jupyterhub/manifests/
2. TensorFlowåæ£å¦ç¿åºç¤
ãã¥ã¼ããªã¢ã«ãæ¸ç±ã®ãµã³ãã«ç¨åº¦ã®ã¢ãã«ã§ããã°ã 使ã£ã¦ããPCãã¯ã©ã¦ãã®VMã®ã·ã³ã°ã«ã¤ã³ã¹ã¿ã³ã¹ã§å¤§ä¸å¤«ã§ãã å®åã§ä½¿ããããªå¤§è¦æ¨¡ãªãã®ã§ããã°ãè¨ç®éãè¨å¤§ã«ãªãããã(ã³ã¹ããè¦ãªãã)è¨ç®åºç¤ãã¹ã±ã¼ã«ãããå¿ è¦ãããã¾ãã
TensorFlowã®åæ£å¦ç¿å¦çãè¡ãéã¯ã3種é¡ã®ãã¼ããã¤ããã¾ãã
- Master
- Parameter Server
- Worker
Kubeflowã§ã¯ãKubernetesã¯ã©ã¹ã¿ãã¤ãã£ã¦åæ£ã¯ã©ã¹ã¿ãæ§æãããã®ä¸ã§TensorFlowã®åæ£å¦çã³ã¼ããåããç°å¢ãç¨æããã¦ãã¾ãã
Masterã¨Parameter Serverã¯1ã2ãã¼ããWorkerã¯è¨ç®ã«å¿ è¦ãªæ²¢å±±ã®ãã¼ããåãã¾ãã
Kubeflowã®ãµã³ãã«ã³ã¼ããç¨æããã¦ããã®ã§æ¬¡ã®ã³ãã³ãã§å®è¡ãã¾ãã ãã®ãµã³ãã«ã¯ãtf_cnn_benchmarksã使ç¨ãã¦ç³ã¿è¾¼ã¿ãã¥ã¼ã©ã«ãããã¯ã¼ã¯ãå¦ç¿ããããã®ãã®ã§ãã
ã½ã¼ã¹ã¯ãã¡ãã§ããã
benchmarks/scripts/tf_cnn_benchmarks at master · tensorflow/benchmarks · GitHub
次ã®ã³ãã³ãã§Kubernetesã¯ã©ã¹ã¿ã«Jobãæå ¥ãã¾ãã
$ cd kubeflow/tf-controller-examples/tf-cnn/ $ kubectl create -f tf_job_cpu_distributed.yaml tfjob "inception-171202-163257-cpu-3" created
ãã°ããããã¨ã次ã®ãããªPodãçæããã¾ãã
kubectlã³ãã³ãã§ç¢ºèªããã¨ãinception--master-/inception--ps-ããããã1ã¤ã㤠inception--worker-ã3ã¤ã§ãã¦ããã®ããããã¾ãã
$ kubectl get pod NAME READY STATUS RESTARTS AGE inception-171202-163257-cpu-3-master-vjo4-0-hc2kl 0/1 ContainerCreating 0 18s inception-171202-163257-cpu-3-ps-vjo4-0-728vm 0/1 ContainerCreating 0 18s inception-171202-163257-cpu-3-worker-vjo4-0-h66bn 0/1 ContainerCreating 0 18s inception-171202-163257-cpu-3-worker-vjo4-1-lcljx 0/1 ContainerCreating 0 18s inception-171202-163257-cpu-3-worker-vjo4-2-j59bh 0/1 ContainerCreating 0 18s model-server-6598c6486d-b8rxg 1/1 Running 1 6m model-server-6598c6486d-clwz9 0/1 Pending 0 6m model-server-6598c6486d-dkmsk 0/1 Pending 0 6m tf-hub-0 1/1 Running 1 6m tf-job-operator-5c648c58bc-47c7f 1/1 Running 1
KubeFlowã§ã¤ãããããã§ã¹ããã¡ã¤ã«ã¯ä»¥ä¸ã®ããã«ãªã£ã¦ãã¾ãã ãã¤ã³ãã¨ãªãã¨ããã ãæç²ãã¾ãã
apiVersion
apiVersionã§ãtensorflow.org/v1alpha1ãããkindã§ãTFJobããæå®ãã¦ã¾ãã kindã¯ãJobãã§ã¯ãªãã®ã§æ³¨æã§ãã
apiVersion: tensorflow.org/v1alpha1 kind: TfJob
Master Pod
Masterã®PodãreplicaSpecsã§æå®ãã¾ãã å®è¡ãããã¸ã§ãã®ã³ãã³ãããã©ã¡ã¼ã¿ã®æå®ã§ãé常ã®Kubernetesããã¥ãã§ã¹ããã¡ã¤ã«ã¨åãã§ãã [tfReplicaType]ããMASTERãã«ãã¦ãã¾ãã ã¾ãã[replicas]ã1ãªã®ã§ãPodã1ã¤ä¸ããã¾ãã
spec: replicaSpecs: - replicas: 1 template: spec: containers: - args: - python - tf_cnn_benchmarks.py - --batch_size=32 - --model=resnet50 - --variable_update=parameter_server - --flush_stdout=true - --num_gpus=1 - --local_parameter_device=cpu - --device=cpu - --data_format=NHWC image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3 name: tensorflow workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks restartPolicy: OnFailure tfReplicaType: MASTER
Parameter Server
åæ§ã«Parameter Serverã®Podãå®ç¾©ãã¦ãã¾ãã Parameter Serverã¯[tfReplicaType]ããPSã㧠[replicas]ã1ãªã®ã§ãMasteråæ§Podã1ã¤ä¸ããã¾ãã
- replicas: 1 template: spec: containers: - args: ~ä¸ç¥~ name: tensorflow workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks restartPolicy: OnFailure tfReplicaType: PS tfImage: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
Worker Pod
æå¾ã«è¨ç®Podã§ãã Parameter Serverã¯[tfReplicaType]ããWORKERã㧠[replicas]ã3ãªã®ã§ãè¨ç®ç¨Podã3ã¤ä¸ããã¾ãã
- replicas: 3 template: spec: containers: - args: ~ä¸ç¥~ name: tensorflow workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks restartPolicy: OnFailure tfReplicaType: WORKER
ãã®æ§æã§åæ£ãã¦å¦ç¿ãè¡ããã¾ãã
ä»åã¯Kubernetesã¯ã©ã¹ã¿ãminikubeã§æ§æãã¦ãã¾ããããã¨ãã°GKEãªã©ãã¤ããã¨ãã¯ãNodePoolãæ§æãã¦ããã¼ã«å ã«é©åãªã¹ããã¯ã®ãã¼ããããã°ãããã¨ãããã¾ãã
TensorFlowã®åæ£å¦ç¿ã®å®è£ ã«ã¤ãã¦ã¯ãenakaiããã®ããã°ãããããããã§ãã
3. ã¢ããªå ¬éåºç¤
KubeFlowã®ã¢ããªå ¬éã«ã¯ãTensorflow Servingã使ããã¦ãã¾ãã ä»åã¯ãminikubeã§ç¢ºèªããã®ã§è©¦ãã¦ã¾ããã GCPãªã©ã§ã¯ã©ã¹ã¿ãæ§æã㦠å¦ç¿æ¸ã¿ã®ã¢ãã«ãGCSã«ã¢ãããã¼ããã¦ã¢ããªã§å©ç¨ããæµãã§ãã
DevFest2017ã§ããããã°ã©ãã®ããã®Google Cloud Platformè¶ å ¥éãã¨ããã¿ã¤ãã«ã§ GKEãç°¡åã«ç´¹ä»ããã»ãã·ã§ã³ãããã¦ããã ãã¾ããã ããã§ãTensorFlowã§ç»åæ¨è«ããåºç¤ããã¢ã¨ãã¦åããã¾ããã
ã»ã¼ãåããããªã¢ã¼ããã¯ãã£ãªã®ã§ãåèã¾ã§ã«ã
ææ³
ä¸éãã¡ãã£ã¢ã§å¤§å¤ãªè³ãããã¿ãã¦ãã深層å¦ç¿çéã¯ã人éãAIã«ä»äºã奪ããããããã®ãã¡ã³ã¿ã¸ã¼ããã¯ãã¾ãã ã¯ã¼ã«ãªã¢ã«ã´ãªãºã ã®è©±é¡ãè«æãå®è£ ãã¦ã¿ã¾ããéå ±ããã¼ã«/ãã¬ã¼ã ã¯ã¼ã¯ã使ã£ã¦ã¿ã¾ããæ¥è¨ã 深層å¦ç¿ã§ãããªããããã¨ã§ãã¾ããé¸æ権ãªã©ã®æ å ±ã«ããµããã¨ã¦ãããããã¾ã¶ããè¨å¤§éãã¦è¿½ãããããã¦ãªãã®ã§ãããããã
ãã®ãããªåºç¤æ§ç¯ãå°å³ãªæ¹æ³è«ã¯ãã¨ã¦ã楽ããã®ã§ãç´°ã ã¨èª¿ã¹ãªããæãåããããã¨æãã¾ããã
ã¾ã éçºéä¸ã®ããã§ãããã²ããã«ã¦ã©ãããããã¨ãããã¾ãã
ããã