Skip to content

Latest commit

 

History

History

.dev_scripts

Scripts for developing MMagic

1. Check UT

Please check your UT by the following scripts:

cd mmagic/
python .dev_script/update_ut.py

Then, you will find some redundant UT, missing UT and blank UT. Please create UTs according to your package code implementation.

2. Test all the models

Please follow these steps to test all the models in MMagic:

First, you will need download all the pre-trained checkpoints by:

python .dev_scripts/download_models.py

Then, you can start testing all the benchmarks by:

python .dev_scripts/test_benchmark.py

3. Train all the models

3.1 Train for debugging

In order to test all the pipelines of training, visualization, etc., you may want to set the total iterations of all the models as less steps (e.g., 100 steps) for quick evaluation. You can use the following steps:

First, since our datasets are stored in ceph, you need to create ceph configs.

# create configs
python .dev_scripts/create_ceph_configs.py \
        --target-dir configs_ceph_debug \
        --gpus-per-job 2 \
        --iters 100 \
        --save-dir-prefix work_dirs/benchmark_debug \
        --work-dir-prefix work_dirs/benchmark_debug

If you only want to update a specific config file, you can specify it by --test-file configs/aot_gan/aot-gan_smpgan_4xb4_places-512x512.py.

Here, --target-dir denotes the path of new created configs, --gpus-per-job denotes the numbers of gpus used for each job, --iters denotes the total iterations of each model, --save-dir-prefix and --work-dir-prefix denote the working directory, where you can find the working logging.

Then, you will need to submit all the jobs by running train_benchmark.py.

python .dev_scripts/train_benchmark.py mm_lol \
    --config-dir configs_ceph_debug \
    --run \
    --gpus-per-job 2 \
    --job-name debug \
    --work-dir work_dirs/benchmark_debug \
    --resume \
    --quotatype=auto

Here, you will specify the configs files used for training by --config-dir, submit all the jobs to run by set --run. You can set the prefix name of the submitted jobs by --job-name, specify the working directory by --work-dir. We suggest using --resume to enable auto resume during training and --quotatype=auto to fully exploit all the computing resources.

3.2 Train for FP32

If you want to train all the models with FP32 (i.e, regular settings as the same with configs/), you can follow these steps:

# create configs for fp32
python .dev_scripts/create_ceph_configs.py \
        --target-dir configs_ceph_fp32 \
        --gpus-per-job 4 \
        --save-dir-prefix work_dirs/benchmark_fp32 \
        --work-dir-prefix work_dirs/benchmark_fp32 \

Then, submit the jobs to run by slurm:

python .dev_scripts/train_benchmark.py mm_lol \
    --config-dir configs_ceph_fp32 \
    --run \
    --resume \
    --gpus-per-job 4 \
    --job-name fp32 \
    --work-dir work_dirs/benchmark_fp32 \
    --quotatype=auto

3.3 Train for FP16

You will also need to train the models with AMP (i.e., FP16), you can use the following steps to achieve this:

python .dev_scripts/create_ceph_configs.py \
        --target-dir configs_ceph_amp \
        --gpus-per-job 4 \
        --save-dir-prefix work_dirs/benchmark_amp \
        --work-dir-prefix work_dirs/benchmark_amp

Then, submit the jobs to run:

python .dev_scripts/train_benchmark.py mm_lol \
    --config-dir configs_ceph_amp \
    --run \
    --resume \
    --gpus-per-job 4 \
    --amp \
    --job-name amp \
    --work-dir work_dirs/benchmark_amp \
    --quotatype=auto

4. Monitor your training

After you submitting jobs following 3-Train-all-the-models, you will find a xxx.log file. This log file list all the job name of job id you have submitted. With this log file, you can monitor your training by running .dev_scripts/job_watcher.py.

For example, you can run

python .dev_scripts/job_watcher.py --work-dir work_dirs/benchmark_fp32/ --log 20220923-140317.log

Then, you will find 20220923-140317.csv, which reports the status and recent log of each job.

5. Train with a list of models

If you only need to run some of the models, you can list all the models' name in a file, and specify the models when using train_benchmark.py.

For example,

python .dev_scripts/train_benchmark.py mm_lol \
    --config-dir configs_ceph_fp32 \
    --run \
    --resume \
    --gpus-per-job 4 \
    --job-name fp32 \
    --work-dir work_dirs/benchmark_fp32 \
    --quotatype=auto \
    --rerun \
    --rerun-list 20220923-140317.log \

Specifically, you need to enable --rerun, and specify the list of models to rerun by --rerun-list

6. Train with skipping a list of models

If you want to train all the models while skipping some models, you can also list all the models' name in a file, and specify the models when running train_benchmark.py.

For example,

python .dev_scripts/train_benchmark.py mm_lol \
    --config-dir configs_ceph_fp32 \
    --run \
    --resume \
    --gpus-per-job 4 \
    --job-name fp32 \
    --work-dir work_dirs/benchmark_fp32 \
    --quotatype=auto \
    --skip \
    --skip-list 20220923-140317.log \

Specifically, you need to enable --skip, and specify the list of models to skip by --skip-list

7. Train failed or canceled jobs

If you want to rerun failed or canceled jobs in the last run, you can combine --rerun flag with --rerun-failure and --rerun-cancel flags.

For example, the log file of the last run is train-20221009-211904.log, and now you want to rerun the failed jobs. You can use the following command:

python .dev_scripts/train_benchmark.py mm_lol \
    --job-name RERUN \
    --rerun train-20221009-211904.log \
    --rerun-fail \
    --run

We can combine --rerun-fail and --rerun-cancel with flag ---models to rerun a subset of failed or canceled model.

python .dev_scripts/train_benchmark.py mm_lol \
    --job-name RERUN \
    --rerun train-20221009-211904.log \
    --rerun-fail \
    --models sagan \  # only rerun 'sagan' models in all failed tasks
    --run

Specifically, --rerun-fail and --rerun-cancel can be used together to rerun both failed and cancaled jobs.

8. deterministic training

Set torch.backends.cudnn.deterministic = True and torch.backends.cudnn.benchmark = False can remove randomness operation in Pytorch training. You can add --deterministic flag when start your benchmark training to remove the influence of randomness operation.

python .dev_scripts/train_benchmark.py mm_lol --job-name xzn --models pix2pix --cpus-per-job 16 --run --deterministic

9. Automatically check links

Use the following script to check whether the links in documentations are valid:

python .dev_scripts/doc_link_checker.py --target docs/zh_cn
python .dev_scripts/doc_link_checker.py --target README_zh-CN.md
python .dev_scripts/doc_link_checker.py --target docs/en
python .dev_scripts/doc_link_checker.py --target README.md

You can specify the --target by a file or a directory.

Notes: DO NOT use it in CI, because requiring too many http requirements by CI will cause 503 and CI will propabaly fail.

10. Calculate flops

To summarize the flops of different models, you can run the following commands:

python .dev_scripts/benchmark_valid_flop.py --flops --flops-str

11. Update model index

To update model-index according to README.md, please run the following commands,

python .dev_scripts/update_model_index.py