Skip to content

val loss in distribute training #674

@LiuSiQi-TJ

Description

@LiuSiQi-TJ

I use librimix dataset to traing DCCRN by 8gpus
I open early stop in conf
I find the model always stop in very early stage like 10 or 20 epochs
In the log, I find, the val loss is caculated by diffierent gpus and early stop is implemented only by gpu 0, which I think is the reason to very early stop, the log is as follows:

[rank: 5] Metric val_loss improved by 0.433 >= min_delta = 0.0. New best score: -11.178
[rank: 0] Metric val_loss improved by 0.333 >= min_delta = 0.0. New best score: -11.104
[rank: 7] Metric val_loss improved by 0.530 >= min_delta = 0.0. New best score: -10.551
[rank: 4] Metric val_loss improved by 0.408 >= min_delta = 0.0. New best score: -10.931
[rank: 1] Metric val_loss improved by 0.287 >= min_delta = 0.0. New best score: -10.971
[rank: 3] Metric val_loss improved by 0.415 >= min_delta = 0.0. New best score: -11.321
[rank: 2] Metric val_loss improved by 0.418 >= min_delta = 0.0. New best score: -10.858
[rank: 6] Metric val_loss improved by 0.504 >= min_delta = 0.0. New best score: -11.375
Epoch 2, global step 1587: 'val_loss' reached -11.10351 (best -11.10351),

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions