Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡微调多线程报错:Exception in thread #352

Open
fanshuaiyao opened this issue Aug 31, 2024 · 6 comments
Open

多卡微调多线程报错:Exception in thread #352

fanshuaiyao opened this issue Aug 31, 2024 · 6 comments

Comments

@fanshuaiyao
Copy link

问题:使用多卡微调只有第一个卡能保存检查点 其他的卡都不行,日志确实显示在其他卡训练了

截图:
image

求大佬解答一下为啥 怎么解决

@fanshuaiyao
Copy link
Author

已经解决

在保存检查点的时候,没有限制,导致多卡都在保存检查点 向一个文件进行写入,python不允许
解决方法:
再mian文件中保存检查点的位置加入限制:
下面是更改后的代码,加入了 if dist.get_rank() == 0: 限制 只让0卡来进行保存检查点

Saving checkpoints.

    if args.should_save and num_steps_this_epoch > 0:
        # Ensure only the main process (Rank 0) saves checkpoints
        if dist.get_rank() == 0:
            if (epoch + 1) == args.max_epochs or (
                args.save_epoch_frequency > 0 and ((epoch + 1) % args.save_epoch_frequency) == 0
            ):
                t1 = time.time()
                save_path = os.path.join(args.checkpoint_path, f"epoch{epoch + 1}.pt")
                torch.save(
                    {
                        "epoch": epoch + 1,
                        "step": steps,
                        "name": args.name,
                        "state_dict": model.state_dict() if not args.use_flash_attention else convert_state_dict(model.state_dict()),
                        "optimizer": optimizer.state_dict(),
                    },
                    save_path,
                )
                logging.info("Saved checkpoint {} (epoch {} @ {} steps) (writing took {} seconds)".format(save_path, epoch + 1, steps, time.time() - t1))
            
            # Save the latest params
            t1 = time.time()
            save_path = os.path.join(args.checkpoint_path, f"epoch_latest.pt")
            torch.save(
                {
                    "epoch": epoch + 1,
                    "step": steps,
                    "name": args.name,
                    "state_dict": model.state_dict() if not args.use_flash_attention else convert_state_dict(model.state_dict()),
                    "optimizer": optimizer.state_dict(),
                },
                save_path,
            )

@wanghao14
Copy link

首先,没看懂你这个问题要问的究竟是什么。多卡训练的时候就应该只有主进程保存checkpoint,多个进程写入会导致文件损坏;并且多卡训练参数同步,为何要其他卡也保存检查点?
其次,args.should_save为True的条件之一是主进程,你这里额外加的判断冗余了。

@fanshuaiyao
Copy link
Author

首先,没看懂你这个问题要问的究竟是什么。多卡训练的时候就应该只有主进程保存checkpoint,多个进程写入会导致文件损坏;并且多卡训练参数同步,为何要其他卡也保存检查点? 其次,args.should_save为True的条件之一是主进程,你这里额外加的判断冗余了。

那为啥不加那句就会报错 这是什么原因?

@wanghao14
Copy link

好像只有你遇到了这个问题,如果没有改动代码太多,建议排查一下多卡训练的启动是否正确。

@fanshuaiyao
Copy link
Author

好像只有你遇到了这个问题,如果没有改动代码太多,建议排查一下多卡训练的启动是否正确。

没有改动太多代码 我见issues里面也有人遇到了这个问题

@ctgushiwei
Copy link

@fanshuaiyao 我也出现了这个问题,加了这句也没有用啊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants