arxiv:2606.12883

The Hidden Power of Scaling Factor in LoRA Optimization

Published on Jun 11

· Submitted by

zicheng zhang on Jun 15

Upvote

Authors:

Zicheng Zhang ,

Abstract

Low-Rank Adaptation (LoRA) scaling factor α functions as a primary optimization driver rather than a secondary learning rate complement, with theoretical and empirical analysis revealing its superior impact on convergence and optimal scaling behavior.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

In Low-Rank Adaptation (LoRA), the scaling factor α is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor α and the learning rate function differently, with α emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, α outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-α, a minimalist framework that restores α to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-α consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.

View arXiv page View PDF Add to collection

Community

zhangzc99

Paper author Paper submitter about 15 hours ago

Maybe the first systematic study (empirical + theoretical) of LoRA's scaling factor $\alpha$ from an optimization perspective!

Recent studies highlight that a large learning rate ($\eta$) is crucial for LoRA optimization. However, this paper points out that such a conclusion was drawn while leaving the scaling factor $\alpha$ systematically underexplored. Through a joint empirical and theoretical lens, the authors reveal a shifting paradigm: a significantly large scaling factor $\alpha$ is what actually matters most, delivering optimization gains that learning rate scaling alone cannot replicate.

Key takeaways:

LoRA's low-rank nature smooths the optimization landscape (spectral suppression), making standard hyperparameters overly conservative and causing an optimization gap.
$\alpha$ vs $\eta$: Increasing $\alpha$ acts as a "purity-preserving accelerator", which amplifies the task signal without increasing the bilinear drift, outperforming learning rate scaling.
Under standard, small learning rates, an optimal $\alpha$ must be sufficiently large and follow a sublinear relationship with rank ($r$). This reveals that popular rank-tied heuristics (like $\alpha = r$ or $2r$) leave LoRA severely under-scaled due to their insufficient magnitudes.

Based on these insights, the authors propose LoRA-$\alpha$, which scales $\alpha$ based on a principled square-root law (e.g., using a large base coefficient like $256\sqrt{r}$). This minimalist shift allows LoRA to directly inherit standard, small Full Fine-Tuning (FFT) learning rates while matching or even exceeding FFT performance across NLP, multimodal, and RL tasks.

Bye-bye expensive hyperparameter tuning! 👋

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.12883

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12883 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12883 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12883 in a Space README.md to link it from this page.