-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多卡微调多线程报错:Exception in thread #352
Comments
已经解决 在保存检查点的时候,没有限制,导致多卡都在保存检查点 向一个文件进行写入,python不允许 Saving checkpoints.
|
首先,没看懂你这个问题要问的究竟是什么。多卡训练的时候就应该只有主进程保存checkpoint,多个进程写入会导致文件损坏;并且多卡训练参数同步,为何要其他卡也保存检查点? |
那为啥不加那句就会报错 这是什么原因? |
好像只有你遇到了这个问题,如果没有改动代码太多,建议排查一下多卡训练的启动是否正确。 |
没有改动太多代码 我见issues里面也有人遇到了这个问题 |
@fanshuaiyao 我也出现了这个问题,加了这句也没有用啊 |
问题:使用多卡微调只有第一个卡能保存检查点 其他的卡都不行,日志确实显示在其他卡训练了
截图:
求大佬解答一下为啥 怎么解决
The text was updated successfully, but these errors were encountered: