The official project repository for the EFCP paper from DAS Lab @ Institute of Science and Technology Austria.
Suppose the project is in the home directory ~/EFCP
, the CUDA kernels can be installed using the following commands:
$ cd ~/EFCP/cuda/mfac_kernel
$ python setup_cuda.py install
We used M-FAC on RTX-3090 and A6000 GPUs.
We provide a shell script to reproduce all our experiments and we recommend using WandB to track the results.
For this experiment we build on top of the FFCV repository and add a few more features to the parameters (see the custom
section).
Dataset generation. The FFCV repository uses the ImageNet dataset to pre-process it in order to obtain the FFCV dataset. Make sure you set the correct paths in ~/EFCP/ffcv-imagenet/write_imagenet.sh
before running this script file.
Image scaling. Comment out the section resolution
in the yaml
config.
Running the experiment. Run the following commands after replacing the parameter values starting with prefix @
with your own values.
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/ffcv-imagenet
$ bash write_imagenet.sh
$ CUDA_VISIBLE_DEVICES=0 python train_imagenet.py \
--data.train_dataset @TRAIN_PATH \
--data.val_dataset @VALIDATION-PATH \
--logging.folder @LOGGING_FOLDER \
--wandb.project @WANDB_PROJECT \
--wandb.group @WANDB_GROUP\
--wandb.job_type @WANDB_JOB_TYPE \
--wandb.name @WANDB_NAME \
--data.num_workers 12 \
--data.in_memory 1 \
--config-file rn18_configs/rn18_88_epochs.yaml \
--training.optimizer kgmfac \
--training.batch_size 1024 \
--training.momentum 0 \
--training.weight_decay 1e-05 \
--lr.lr 0.001 \
--lr.lr_schedule_type linear \
--custom.damp 1e-07 \
--custom.k 0.01 \
--custom.seed @SEED \
--custom.wd_type wd
For this experiment we build on top of the ASDL repository. We integrate our M-FAC implementations in the following files:
~/EFCP/asdl/asdl/precondition/mfac.py
for Dense M-FAC~/EFCP/asdl/asdl/precondition/sparse_mfac.py
for Sparse M-FAC
Features added. We added the following new parameters to the existing repository:
clip_type
- specifies whether clipping should be performed by value or by norm (val
,norm
)clip_bound
- the value used in clipping. Set it to0
to disable clipping, regardless of the value ofclip_type
ignore_bn_ln_type
- used to perform BN/LN ablation. Possible values arenone
,all
,modules
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/asdl/examples/arxiv_results
$ CUDA_VISIBLE_DEVICES=0 python train.py \
--wandb_project @WANDB_PROJECT \
--wandb_group @WANDB_GROUP\
--wandb_job_type @WANDB_JOB_TYPE \
--wandb_name @WANDB_NAME \
--folder @LOGGING_FOLDER \
--ngrads 1024 \
--momentum 0 \
--dataset cifar10 \
--optim <kgmfac OR lrmfac> \
--k 0.01 \
--rank 4 \
--epochs 20 \
--batch_size 32 \
--model rn18 \
--weight_decay 0.0005 \
--ignore_bn_ln_type all \
--lr 0.03 \
--clip_type norm \
--clip_bound 10 \
--damp 1e-05 \
--seed 1
We use the HuggingFace repository stated in the original M-FAC paper and integrate Sparse M-FAC to experiment with Question Answering and Text Classification. The following commands can be used to reproduce our experiments for QA and GLUE using the parameters from Appendix D in our paper.
Instructions for GLUE/MNLI. Run Sparse-MFAC on BERT-Base:
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/huggingface/examples/MFAC_optim
python run_glue.py \
--wandb_project @WANDB_PROJECT \
--wandb_group @WANDB_GROUP\
--wandb_job_type @WANDB_JOB_TYPE \
--wandb_name @WANDB_NAME \
--output_dir @OUTPUT_DIR \
--seed @SEED \
--logging_strategy steps \
--logging_steps 10 \
--model_name_or_path bert-base \
--task_name mnli \
--num_train_epochs 3 \
--optim <kgmfac OR lrmfac> \
--k 0.01 \
--rank 4 \
--lr 2e-5 \
--damp 5e-5 \
--ngrads 1024
All available arguments are available in the following classes:
- ModelArguments
- DataTrainingArguments
- TrainingArguments
- CustomArgs: stores our arguments for M-FAC optimizers and we are using
lr
from here instead oflearning_rate
fromDataTrainingArguments
Other useful parameters for the run_glue.py
script:
--do_train
--do_eval
--do_predict
--max_seq_length 128
--per_device_train_batch_size 32
--overwrite_output_dir
--save_strategy epoch # instead of logging_strategy and logging_steps that we used
--save_total_limit 1
Instructions for QA/SquadV2. Run Sparse-MFAC on BERT-Base:
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP/huggingface/examples/MFAC_optim
python run_qa.py \
--wandb_project @WANDB_PROJECT \
--wandb_group @WANDB_GROUP\
--wandb_job_type @WANDB_JOB_TYPE \
--wandb_name @WANDB_NAME \
--output_dir @OUTPUT_DIR \
--seed @SEED \
--logging_strategy steps \
--logging_steps 10 \
--model_name_or_path bert-base \
--num_train_epochs 2 \
--optim <kgmfac OR lrmfac> \
--k 0.01 \
--rank 4 \
--ngrads 1024 \
--lr 3e-5 \
--damp 5e-5
We use our own training pipeline to train a small ResNet-20 on CIFAR-10 and for our linear probing experiment that uses Logistic Regression on a synthetic dataset. The notations for the hyper-parameters are introduced in the first paragraph of the Appendix.
CIFAR-10 / ResNet-20 (272k params). For these particular experiments, check the parameters in the Appendix C of the paper and match them with the ones in ~/EFCP/args/args_mfac.py
Follow these short instructions to run Top-K or Low-Rank strategies:
- S-MFAC (Top-k compression):
- use
--optim kgmfac
&--k 0.01
(the parameter--rank
will be ignored)
- use
- LR-MFAC (Low-Rank compression):
- use
--optim lrmfac
&--rank 1
-- (the parameter--k
will be ignored)
- use
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ cd ~/EFCP
python main.py \
--wandb_project @WANDB_PROJECT \
--wandb_group @WANDB_GROUP\
--wandb_job_type @WANDB_JOB_TYPE \
--wandb_name @WANDB_NAME \
--seed @SEED \
--root_folder @EXPERIMENT_FOLDER \
--dataset_path @PATH_TO_DATASET \
--dataset_name cifar10 \
--model rn20 \
--epochs 164 \
--batch_size 128 \
--lr_sched step \
--optim <kgmfac OR lrmfac> \
--k 0.01 \
--rank 4 \
--ngrads 1024 \
--lr 1e-3 \
--damp 1e-4 \
--weight_decay 1e-4 \
--momentum 0 \
--wd_type wd
Logistic Regression / Synthetic Data. For this experiment we use the same script main.py
using the hyper-parameters from the Appendix A in our paper. The dataset we used is publicly available here. Below we only present the script to run Sparse GGT
. In order to run other optimizers, please have a look at the method get_optimizer
from helpers/training.py
file and at the method get_arg_parse
from args/args_mfac.py that stores command line arguments.
$ export EFCP_ROOT=~/EFCP # the root folder will be added as a library path
$ CUDA_VISIBLE_DEVICES=0 python main.py \
--wandb_project @WANDB_PROJECT \
--wandb_group @WANDB_GROUP\
--wandb_job_type @WANDB_JOB_TYPE \
--wandb_name @WANDB_NAME \
--seed @SEED \
--root_folder @EXPERIMENT_FOLDER \
--dataset_path @PATH_TO_RN50x16-openai-imagenet1k \
--dataset_name rn50x16openai \
--model logreg \
--epochs 10 \
--batch_size 128 \
--lr_sched cos \
--optim ksggt \
--k 0.01 \
--ngrads 100 \
--lr 1 \
--weight_decay 0 \
--ggt_beta1 0 \
--ggt_beta2 1 \
--ggt_eps 1e-05
We describe the preconditioning quantification in Section 6 of our paper. We use quantify_preconditioning method to compute the metrics for scaling and rotation, which requires the raw gradient g
and the preconditioned gradient u
. We would like to mention that calling this method at each time step for large models (such as BERT-Base) slows down training by a lot because the operations are performed using large tensors. Moreover, the quantiles are computed in numpy
because pytorch
raises an error when calling quantile
function for large tensors.
If you found our work useful, please consider citing:
@misc{modoranu2023error,
title={Error Feedback Can Accurately Compress Preconditioners},
author={Ionut-Vlad Modoranu and Aleksei Kalinov and Eldar Kurtic and Dan Alistarh},
year={2023},
eprint={2306.06098},
archivePrefix={arXiv},
primaryClass={cs.LG}
}