Checkpoint
åæ: Checkpointã®IOã¯ç¡è¦ã§ããã»ã©å°ããï¼çéã§Checkpointãread/writeãããã¤ã¯ããªãï¼
å¦ç¿éå§æã« pull checkpoint, å¦ç¿çµäº/ä¸ææã« push checkpointãå¿
è¦.
æ½è±¡åãããªãå ´åãremote->local checkpointãã¦ã¦ã³ãã¼ããcheckpoint localä¿å + remoteã¢ãããã¼ããå¿
è¦.
ãã ãããã¡ã¤ã«ã·ã¹ãã /fsã¸ã®read/writeã¨è¦ãã°ãã¦ã¦ã³ãã¼ã/ã¢ãããã¼ããæ½è±¡åã§ãã.
fsã¯ãã¼ã¿å®ä½ã®å ´æãåããªã (c.f. ファイルシステム - Wikipedia).
Checkpointãèªã¿ããã£ããread, æ¸ãè¾¼ã¿ããã£ããwriteããã ã ï¼ééçremoteã¢ã¯ã»ã¹ï¼.
remote対å¿ãã¦ããfsã使ãã ãã§ããï¼c.f. fsspec
ï¼
æ¬ ç¹ã¯IOããã©ã¼ãã³ã¹ã ããåæã«ããããã«checkpointã¯IO-boundã«ãªããªãã®ã§ç¡åé¡
Checkpoint/stateãªãã¸ã§ã¯ã
ä½ãèªã¿ã ããããã«ããããã¨ãã話.
æ©æ¢°å¦ç¿ã§ã¯åºå®ãã©ã¡ã¼ã¿ï¼hparamsï¼ãæå®ãã¦å¯å¤ãã©ã¡ã¼ã¿ï¼weightsï¼ãå¦ç¿ãã.
æ¨è«æã«ã¯hparamsãweightsãåºå®ããã¦ãã¦ãããããloadãã¦æ¨è«ãã.
ãªã®ã§hparamsã¨weightsã¯1ã¤ã«ããã±ã¼ã¸ã³ã°ããã1çºã§checkpointããèªã¿ã ããã¨ãã.
hparamsã¨wieghtsåå¥loadã "state load 1call" ã«æ½è±¡åã§ãã.
ãã ãæ¨è«æã®ã¿åãæ¿ãããè¨å®å¤ã¯ãã°ãã°ããã®ã§ãoverrideã¯ããã¨ãã¿ã¼.
run argument
ã³ã³ãããã¼ã¹ã®å ´å:
ENTRYPOINTã§["python", "main.py"], CMDã§["--arg_a", "A", "--arg_b", "B"]
ããããã°ãè¨å®ã渡ããã«æ½è±¡åã§ããï¼Pythonãã©ãããã©ã®ãã¡ã¤ã«ãå®è¡ãããããééçã«ï¼
å¦ç¿ç¨ã¢ããªã±ã¼ã·ã§ã³ãçãªæ½è±¡åã«ãªã£ã¦ãï¼ã¢ããªãä½ã®è¨èªã§æ¸ããã¦ãããUserã¯ä¸é¢ç¥ï¼
Google AI Platform training
// Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "task.py"]
https://cloud.google.com/ai-platform/training/docs/using-containers#dockerfile-basics
Google AI Platform training
The training service runs your Docker image, passing through any command-line arguments you specify when you create the training job.
https://cloud.google.com/ai-platform/training/docs/containers-overview
èªè¨¼
å¦ç¿ã³ã³ããããã¯ééçã«æ±ãã
gateway/proxy sidecarããããã¯ã¤ã³ã¹ã¿ã³ã¹ã¸ã®AuthZ.