lion-pytorch - Explore Advanced Momentum Algorithms for Deep Learning Optimization

🦁 Lion - Pytorch

Lion, which stands for EvoLved Sign Momentum, is an innovative optimizer discovered by Google Brain. Lion aims to outperform the widely-used Adam(w) optimizer and is implemented in PyTorch. This project takes inspiration from an existing Google source and introduces slight modifications to enhance accessibility and usability for training models.

Key Insights

Learning Rate and Weight Decay: The authors of Lion suggest that the learning rate should be 3-10 times smaller than that of AdamW, with weight decay λ 3-10 times larger to maintain effectiveness. This means when adjusting the learning rate schedule, the initial, peak, and end values should all be scaled similarly when transitioning from AdamW to Lion.
Learning Rate Schedule: While Lion employs the same schedule as AdamW, research shows that a cosine decay schedule can yield better results when training Vision Transformers (ViT) compared to a reciprocal square-root schedule.
β1 and β2 Parameters: In contrast to AdamW's default β1 = 0.9 and β2 = 0.999, Lion's defaults are 0.9 and 0.99, respectively. However, for stability during training, adjusting these parameters to β1=0.95 and β2=0.98 may be beneficial.

Project Updates

Numerous updates have been documented:

Initial positive results in language modeling.
With a constant learning rate, Lion underperforms compared to Adam.
With a reduced learning rate (by a factor of 3), Lion shows promising early results, hinting it could surpass Adam.
Using a learning rate reduced by a factor of 10 results in poorer performance, suggesting further tuning is necessary.
Positive outcomes are reported in language modeling when conditions are right, with similar positive results in text-to-image training—but tuning is necessary. Challenges persist when applied to realms different from those evaluated, such as reinforcement learning or specific neural architectures.
A particular issue related to training larger models was resolved by adjusting initial temperature settings, improving Lion's outcomes.
Lion is recommended for environments with high batch sizes (64 or above).

Installation

To incorporate Lion in PyTorch:

$ pip install lion-pytorch

Basic Usage

Below is a guide for using Lion in a simple model setup:

import torch
from torch import nn

model = nn.Linear(10, 1)

from lion_pytorch import Lion

opt = Lion(model.parameters(), lr=1e-4, weight_decay=1e-2)

# Model forward and backward pass
loss = model(torch.randn(10))
loss.backward()

# Optimizer step
opt.step()
opt.zero_grad()

To leverage a fused kernel for parameter updates, which is a more advanced and efficient option, execute:

pip install triton -U --pre

Then set use_triton=True in the optimizer:

opt = Lion(
    model.parameters(),
    lr=1e-4,
    weight_decay=1e-2,
    use_triton=True
)

Appreciation and Acknowledgments

Thanks go to Stability.ai for their generous support in advancing this cutting-edge artificial intelligence research.

Citations

Research and development of Lion and Triton involve contributions from several researchers and institutions, as cited in publications such as:

@misc{https://doi.org/10.48550/arxiv.2302.06675,
    url     = {https://arxiv.org/abs/2302.06675},
    author  = {Chen, Xiangning and Liang, Chen and Huang, Da and Real, Esteban and Wang, Kaiyuan and Liu, Yao and Pham, Hieu and Dong, Xuanyi and Luong, Thang and Hsieh, Cho-Jui and Lu, Yifeng and Le, Quoc V.},
    title   = {Symbolic Discovery of Optimization Algorithms},
    publisher = {arXiv},
    year = {2023}
}

Lion is presented as a promising alternative to traditional optimizers, warranting exploration and experimentation within various neural network training contexts.