pytorch-optimizer

Build
Quality
Package
Status
License

pytorch-optimizer is optimizer & lr scheduler collections in PyTorch.

I just re-implemented (speed & memory tweaks, plug-ins) the algorithm while based on the original paper. Also, It includes useful and practical optimization ideas.

Currently, 43 optimizers, 6 lr schedulers are supported!

Highly inspired by pytorch-optimizer.

Getting Started

For more, see the documentation.

Most optimizers are under MIT or Apache 2.0 license, but a few optimizers like Fromage have BY-NC-SA 4.0 license, which is non-commercial. So, please double-check the license before using it at your work.

Installation

$ pip3 install -U pytorch-optimizer

If there's a version issue when installing the package, try with --no-deps option.

$ pip3 install -U --no-deps pytorch-optimizer

Simple Usage

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters())

# or you can use optimizer loader, simply passing a name of the optimizer.

from pytorch_optimizer import load_optimizer

model = YourModel()
opt = load_optimizer(optimizer='adamp')
optimizer = opt(model.parameters())

Also, you can load the optimizer via torch.hub

import torch

model = YourModel()
opt = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt(model.parameters())

If you want to build the optimizer with parameters & configs, there's create_optimizer() API.

from pytorch_optimizer import create_optimizer

optimizer = create_optimizer(
    model,
    'adamp',
    lr=1e-3,
    weight_decay=1e-3,
    use_gc=True,
    use_lookahead=True,
)

Supported Optimizers

You can check the supported optimizers & lr schedulers.

from pytorch_optimizer import get_supported_optimizers, get_supported_lr_schedulers

supported_optimizers = get_supported_optimizers()
supported_lr_schedulers = get_supported_lr_schedulers()

Optimizer	Description	Official Code	Paper
AdaBelief	Adapting Step-sizes by the Belief in Observed Gradients	github	https://arxiv.org/abs/2010.07468
AdaBound	Adaptive Gradient Methods with Dynamic Bound of Learning Rate	github	https://openreview.net/forum?id=Bkg3g2R9FX
AdaHessian	An Adaptive Second Order Optimizer for Machine Learning	github	https://arxiv.org/abs/2006.00719
AdamD	Improved bias-correction in Adam		https://arxiv.org/abs/2110.10828
AdamP	Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights	github	https://arxiv.org/abs/2006.08217
diffGrad	An Optimization Method for Convolutional Neural Networks	github	https://arxiv.org/abs/1909.11015v3
MADGRAD	A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic	github	https://arxiv.org/abs/2101.11075
RAdam	On the Variance of the Adaptive Learning Rate and Beyond	github	https://arxiv.org/abs/1908.03265
Ranger	a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer	github	https://bit.ly/3zyspC3
Ranger21	a synergistic deep learning optimizer	github	https://arxiv.org/abs/2106.13731
Lamb	Large Batch Optimization for Deep Learning	github	https://arxiv.org/abs/1904.00962
Shampoo	Preconditioned Stochastic Tensor Optimization	github	https://arxiv.org/abs/1802.09568
Nero	Learning by Turning: Neural Architecture Aware Optimisation	github	https://arxiv.org/abs/2102.07227
Adan	Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models	github	https://arxiv.org/abs/2208.06677
Adai	Disentangling the Effects of Adaptive Learning Rate and Momentum	github	https://arxiv.org/abs/2006.15815
GSAM	Surrogate Gap Guided Sharpness-Aware Minimization	github	https://openreview.net/pdf?id=edONMAnhLu-
D-Adaptation	Learning-Rate-Free Learning by D-Adaptation	github	https://arxiv.org/abs/2301.07733
AdaFactor	Adaptive Learning Rates with Sublinear Memory Cost	github	https://arxiv.org/abs/1804.04235
Apollo	An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization	github	https://arxiv.org/abs/2009.13586
NovoGrad	Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks	github	https://arxiv.org/abs/1905.11286
Lion	Symbolic Discovery of Optimization Algorithms	github	https://arxiv.org/abs/2302.06675
Ali-G	Adaptive Learning Rates for Interpolation with Gradients	github	https://arxiv.org/abs/1906.05661
SM3	Memory-Efficient Adaptive Optimization	github	https://arxiv.org/abs/1901.11150
AdaNorm	Adaptive Gradient Norm Correction based Optimizer for CNNs	github	https://arxiv.org/abs/2210.06364
RotoGrad	Gradient Homogenization in Multitask Learning	github	https://openreview.net/pdf?id=T8wHz4rnuGL
A2Grad	Optimal Adaptive and Accelerated Stochastic Gradient Descent	github	https://arxiv.org/abs/1810.00553
AccSGD	Accelerating Stochastic Gradient Descent For Least Squares Regression	github	https://arxiv.org/abs/1704.08227
SGDW	Decoupled Weight Decay Regularization	github	https://arxiv.org/abs/1711.05101
ASGD	Adaptive Gradient Descent without Descent	github	https://arxiv.org/abs/1910.09529
Yogi	Adaptive Methods for Nonconvex Optimization		NIPS 2018
SWATS	Improving Generalization Performance by Switching from Adam to SGD		https://arxiv.org/abs/1712.07628
Fromage	On the distance between two neural networks and the stability of learning	github	https://arxiv.org/abs/2002.03432
MSVAG	Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients	github	https://arxiv.org/abs/1705.07774
AdaMod	An Adaptive and Momental Bound Method for Stochastic Learning	github	https://arxiv.org/abs/1910.12249
AggMo	Aggregated Momentum: Stability Through Passive Damping	github	https://arxiv.org/abs/1804.00325
QHAdam	Quasi-hyperbolic momentum and Adam for deep learning	github	https://arxiv.org/abs/1810.06801
PID	A PID Controller Approach for Stochastic Optimization of Deep Networks	github	CVPR 18

Useful Resources

Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in Ranger21 optimizer.

Also, most of the captures are taken from Ranger21 paper.

Adaptive Gradient Clipping	Gradient Centralization	Softplus Transformation
Gradient Normalization	Norm Loss	Positive-Negative Momentum
Linear learning rate warmup	Stable weight decay	Explore-exploit learning rate schedule
Lookahead	Chebyshev learning rate schedule	(Adaptive) Sharpness-Aware Minimization
On the Convergence of Adam and Beyond	Gradient Surgery for Multi-Task Learning

Adaptive Gradient Clipping

This idea originally proposed in NFNet (Normalized-Free Network) paper.

AGC (Adaptive Gradient Clipping) clips gradients based on the unit-wise ratio of gradient norms to parameter norms.

code : github
paper : arXiv

Gradient Centralization

Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.

code : github
paper : arXiv

Softplus Transformation

By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.

paper : arXiv

Gradient Normalization

Norm Loss

paper : arXiv

Positive-Negative Momentum

code : github
paper : arXiv

Linear learning rate warmup

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/linear_lr_warmup.png

paper : arXiv

Stable weight decay

code : github
paper : arXiv

Explore-exploit learning rate schedule

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/explore_exploit_lr_schedule.png

code : github
paper : arXiv

Lookahead

k steps forward, 1 step back. Lookahead consisting of keeping an exponential moving average of the weights that is

updated and substituted to the current weights every k_{lookahead} steps (5 by default).

code : github
paper : arXiv

Chebyshev learning rate schedule

Acceleration via Fractal Learning Rate Schedules

paper : arXiv

(Adaptive) Sharpness-Aware Minimization

Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.

In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.

SAM paper : paper
ASAM paper : paper
A/SAM code : github

On the Convergence of Adam and Beyond

paper : paper

Gradient Surgery for Multi-Task Learning

paper : paper

Citations

AdamP

Adaptive Gradient Clipping

Chebyshev LR Schedules

Gradient Centralization

Lookahead

RAdam

Norm Loss

Positive-Negative Momentum

Explore-Exploit Learning Rate Schedule

On the adequacy of untuned warmup for adaptive optimization

Stable weight decay regularization

Softplus transformation

Sharpness-aware minimization

Adaptive Sharpness-aware minimization

diffGrad

On the Convergence of Adam and Beyond

Gradient surgery for multi-task learning

Citation

Please cite original authors of optimization algorithms. If you use this software, please cite it as below. Or you can get from "cite this repository" button.

@software{Kim_pytorch_optimizer_Optimizer_and_2022,
    author = {Kim, Hyeongchan},
    month = {1},
    title = {{pytorch_optimizer: optimizer and lr scheduler collections in PyTorch}},
    version = {1.0.0},
    year = {2022}
}

Author

Hyeongchan Kim / @kozistr

Name		Name	Last commit message	Last commit date
Latest commit History 2,256 Commits
.github		.github
assets		assets
docs		docs
pytorch_optimizer		pytorch_optimizer
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.rst		README.rst
hubconf.py		hubconf.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pytorch-optimizer

Getting Started

Installation

Simple Usage

Supported Optimizers

Useful Resources

Adaptive Gradient Clipping

Gradient Centralization

Softplus Transformation

Gradient Normalization

Norm Loss

Positive-Negative Momentum

Linear learning rate warmup

Stable weight decay

Explore-exploit learning rate schedule

Lookahead

Chebyshev learning rate schedule

(Adaptive) Sharpness-Aware Minimization

On the Convergence of Adam and Beyond

Gradient Surgery for Multi-Task Learning

Citations

Citation

Author

About

Releases

Packages

Languages

License

ChenzhouWeiYu/pytorch_optimizer

Folders and files

Latest commit

History

Repository files navigation

pytorch-optimizer

Getting Started

Installation

Simple Usage

Supported Optimizers

Useful Resources

Adaptive Gradient Clipping

Gradient Centralization

Softplus Transformation

Gradient Normalization

Norm Loss

Positive-Negative Momentum

Linear learning rate warmup

Stable weight decay

Explore-exploit learning rate schedule

Lookahead

Chebyshev learning rate schedule

(Adaptive) Sharpness-Aware Minimization

On the Convergence of Adam and Beyond

Gradient Surgery for Multi-Task Learning

Citations

Citation

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages