Skip to content

ChenzhouWeiYu/pytorch_optimizer

 
 

Repository files navigation

pytorch-optimizer

Build workflow Documentation Status
Quality codecov black ruff
Package PyPI version PyPI pyversions
Status PyPi download PyPi month download
License apache
pytorch-optimizer is optimizer & lr scheduler collections in PyTorch.
I just re-implemented (speed & memory tweaks, plug-ins) the algorithm while based on the original paper. Also, It includes useful and practical optimization ideas.
Currently, 43 optimizers, 6 lr schedulers are supported!

Highly inspired by pytorch-optimizer.

Getting Started

For more, see the documentation.

Most optimizers are under MIT or Apache 2.0 license, but a few optimizers like Fromage have BY-NC-SA 4.0 license, which is non-commercial. So, please double-check the license before using it at your work.

Installation

$ pip3 install -U pytorch-optimizer

If there's a version issue when installing the package, try with --no-deps option.

$ pip3 install -U --no-deps pytorch-optimizer

Simple Usage

from pytorch_optimizer import AdamP

model = YourModel()
optimizer = AdamP(model.parameters())

# or you can use optimizer loader, simply passing a name of the optimizer.

from pytorch_optimizer import load_optimizer

model = YourModel()
opt = load_optimizer(optimizer='adamp')
optimizer = opt(model.parameters())

Also, you can load the optimizer via torch.hub

import torch

model = YourModel()
opt = torch.hub.load('kozistr/pytorch_optimizer', 'adamp')
optimizer = opt(model.parameters())

If you want to build the optimizer with parameters & configs, there's create_optimizer() API.

from pytorch_optimizer import create_optimizer

optimizer = create_optimizer(
    model,
    'adamp',
    lr=1e-3,
    weight_decay=1e-3,
    use_gc=True,
    use_lookahead=True,
)

Supported Optimizers

You can check the supported optimizers & lr schedulers.

from pytorch_optimizer import get_supported_optimizers, get_supported_lr_schedulers

supported_optimizers = get_supported_optimizers()
supported_lr_schedulers = get_supported_lr_schedulers()
Optimizer Description Official Code Paper
AdaBelief Adapting Step-sizes by the Belief in Observed Gradients github https://arxiv.org/abs/2010.07468
AdaBound Adaptive Gradient Methods with Dynamic Bound of Learning Rate github https://openreview.net/forum?id=Bkg3g2R9FX
AdaHessian An Adaptive Second Order Optimizer for Machine Learning github https://arxiv.org/abs/2006.00719
AdamD Improved bias-correction in Adam   https://arxiv.org/abs/2110.10828
AdamP Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights github https://arxiv.org/abs/2006.08217
diffGrad An Optimization Method for Convolutional Neural Networks github https://arxiv.org/abs/1909.11015v3
MADGRAD A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic github https://arxiv.org/abs/2101.11075
RAdam On the Variance of the Adaptive Learning Rate and Beyond github https://arxiv.org/abs/1908.03265
Ranger a synergistic optimizer combining RAdam and LookAhead, and now GC in one optimizer github https://bit.ly/3zyspC3
Ranger21 a synergistic deep learning optimizer github https://arxiv.org/abs/2106.13731
Lamb Large Batch Optimization for Deep Learning github https://arxiv.org/abs/1904.00962
Shampoo Preconditioned Stochastic Tensor Optimization github https://arxiv.org/abs/1802.09568
Nero Learning by Turning: Neural Architecture Aware Optimisation github https://arxiv.org/abs/2102.07227
Adan Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models github https://arxiv.org/abs/2208.06677
Adai Disentangling the Effects of Adaptive Learning Rate and Momentum github https://arxiv.org/abs/2006.15815
GSAM Surrogate Gap Guided Sharpness-Aware Minimization github https://openreview.net/pdf?id=edONMAnhLu-
D-Adaptation Learning-Rate-Free Learning by D-Adaptation github https://arxiv.org/abs/2301.07733
AdaFactor Adaptive Learning Rates with Sublinear Memory Cost github https://arxiv.org/abs/1804.04235
Apollo An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization github https://arxiv.org/abs/2009.13586
NovoGrad Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks github https://arxiv.org/abs/1905.11286
Lion Symbolic Discovery of Optimization Algorithms github https://arxiv.org/abs/2302.06675
Ali-G Adaptive Learning Rates for Interpolation with Gradients github https://arxiv.org/abs/1906.05661
SM3 Memory-Efficient Adaptive Optimization github https://arxiv.org/abs/1901.11150
AdaNorm Adaptive Gradient Norm Correction based Optimizer for CNNs github https://arxiv.org/abs/2210.06364
RotoGrad Gradient Homogenization in Multitask Learning github https://openreview.net/pdf?id=T8wHz4rnuGL
A2Grad Optimal Adaptive and Accelerated Stochastic Gradient Descent github https://arxiv.org/abs/1810.00553
AccSGD Accelerating Stochastic Gradient Descent For Least Squares Regression github https://arxiv.org/abs/1704.08227
SGDW Decoupled Weight Decay Regularization github https://arxiv.org/abs/1711.05101
ASGD Adaptive Gradient Descent without Descent github https://arxiv.org/abs/1910.09529
Yogi Adaptive Methods for Nonconvex Optimization   NIPS 2018
SWATS Improving Generalization Performance by Switching from Adam to SGD   https://arxiv.org/abs/1712.07628
Fromage On the distance between two neural networks and the stability of learning github https://arxiv.org/abs/2002.03432
MSVAG Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients github https://arxiv.org/abs/1705.07774
AdaMod An Adaptive and Momental Bound Method for Stochastic Learning github https://arxiv.org/abs/1910.12249
AggMo Aggregated Momentum: Stability Through Passive Damping github https://arxiv.org/abs/1804.00325
QHAdam Quasi-hyperbolic momentum and Adam for deep learning github https://arxiv.org/abs/1810.06801
PID A PID Controller Approach for Stochastic Optimization of Deep Networks github CVPR 18

Useful Resources

Several optimization ideas to regularize & stabilize the training. Most of the ideas are applied in Ranger21 optimizer.

Also, most of the captures are taken from Ranger21 paper.

Adaptive Gradient Clipping Gradient Centralization Softplus Transformation
Gradient Normalization Norm Loss Positive-Negative Momentum
Linear learning rate warmup Stable weight decay Explore-exploit learning rate schedule
Lookahead Chebyshev learning rate schedule (Adaptive) Sharpness-Aware Minimization
On the Convergence of Adam and Beyond Gradient Surgery for Multi-Task Learning  

Adaptive Gradient Clipping

This idea originally proposed in NFNet (Normalized-Free Network) paper.
AGC (Adaptive Gradient Clipping) clips gradients based on the unit-wise ratio of gradient norms to parameter norms.

Gradient Centralization

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/gradient_centralization.png

Gradient Centralization (GC) operates directly on gradients by centralizing the gradient to have zero mean.

Softplus Transformation

By running the final variance denom through the softplus function, it lifts extremely tiny values to keep them viable.

Gradient Normalization

Norm Loss

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/norm_loss.png

Positive-Negative Momentum

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/positive_negative_momentum.png

Linear learning rate warmup

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/linear_lr_warmup.png

Stable weight decay

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/stable_weight_decay.png

Explore-exploit learning rate schedule

https://raw.githubusercontent.com/kozistr/pytorch_optimizer/main/assets/explore_exploit_lr_schedule.png

Lookahead

k steps forward, 1 step back. Lookahead consisting of keeping an exponential moving average of the weights that is
updated and substituted to the current weights every k_{lookahead} steps (5 by default).

Chebyshev learning rate schedule

Acceleration via Fractal Learning Rate Schedules

(Adaptive) Sharpness-Aware Minimization

Sharpness-Aware Minimization (SAM) simultaneously minimizes loss value and loss sharpness.
In particular, it seeks parameters that lie in neighborhoods having uniformly low loss.

On the Convergence of Adam and Beyond

Gradient Surgery for Multi-Task Learning

Citations

AdamP

Adaptive Gradient Clipping

Chebyshev LR Schedules

Gradient Centralization

Lookahead

RAdam

Norm Loss

Positive-Negative Momentum

Explore-Exploit Learning Rate Schedule

On the adequacy of untuned warmup for adaptive optimization

Stable weight decay regularization

Softplus transformation

MADGRAD

AdaHessian

AdaBound

Adabelief

Sharpness-aware minimization

Adaptive Sharpness-aware minimization

diffGrad

On the Convergence of Adam and Beyond

Gradient surgery for multi-task learning

AdamD

Shampoo

Nero

Adan

Adai

GSAM

D-Adaptation

AdaFactor

Apollo

NovoGrad

Lion

Ali-G

SM3

AdaNorm

RotoGrad

A2Grad

AccSGD

SGDW

Adaptive SGD

Yogi

SWATS

Fromage

MSVAG

AdaMod

AggMo

QHAdam

PID

Citation

Please cite original authors of optimization algorithms. If you use this software, please cite it as below. Or you can get from "cite this repository" button.

@software{Kim_pytorch_optimizer_Optimizer_and_2022,
    author = {Kim, Hyeongchan},
    month = {1},
    title = {{pytorch_optimizer: optimizer and lr scheduler collections in PyTorch}},
    version = {1.0.0},
    year = {2022}
}

Author

Hyeongchan Kim / @kozistr

About

optimizer & lr scheduler implementations in PyTorch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.8%
  • Makefile 0.2%