The Hidden Power of Scaling Factor in LoRA Optimization
Abstract
Low-Rank Adaptation (LoRA) scaling factor α functions as a primary optimization driver rather than a secondary learning rate complement, with theoretical and empirical analysis revealing its superior impact on convergence and optimal scaling behavior.
In Low-Rank Adaptation (LoRA), the scaling factor α is often treated as a mere complement to the learning rate, yet its role in optimization remains poorly understood. In this paper, we reveal that the scaling factor α and the learning rate function differently, with α emerging as the dominant driver of effective optimization, delivering gains that cannot be replicated by learning rate scaling alone. Through the synergy of extensive empirical analysis and a theoretical Signal-Drift framework, we uncover three findings into LoRA's scaling mechanism: First, LoRA's spectral suppression smooths the optimization landscape, rendering standard hyperparameters overly conservative and creating an optimization gap. Second, when leveraging this smoothness to accelerate convergence, α outperforms the learning rate by amplifying the task signal without increasing the drift ratio. Third, the optimal scaling factor follows a sublinear relationship with the rank, well characterized by a square-root law with an unexpectedly large coefficient, revealing the insufficient scaling of existing rank-tied heuristics. Based on these insights, we propose LoRA-α, a minimalist framework that restores α to its principled regime, making LoRA compatible with standard small learning rates. Extensive evaluations across diverse tasks demonstrate that LoRA-α consistently improves performance while streamlining hyperparameter search, unleashing the learning potential of LoRA.
Community
Maybe the first systematic study (empirical + theoretical) of LoRA's scaling factor $\alpha$ from an optimization perspective!
Recent studies highlight that a large learning rate ($\eta$) is crucial for LoRA optimization. However, this paper points out that such a conclusion was drawn while leaving the scaling factor $\alpha$ systematically underexplored. Through a joint empirical and theoretical lens, the authors reveal a shifting paradigm: a significantly large scaling factor $\alpha$ is what actually matters most, delivering optimization gains that learning rate scaling alone cannot replicate.
Key takeaways:
LoRA's low-rank nature smooths the optimization landscape (spectral suppression), making standard hyperparameters overly conservative and causing an optimization gap.
$\alpha$ vs $\eta$: Increasing $\alpha$ acts as a "purity-preserving accelerator", which amplifies the task signal without increasing the bilinear drift, outperforming learning rate scaling.
Under standard, small learning rates, an optimal $\alpha$ must be sufficiently large and follow a sublinear relationship with rank ($r$). This reveals that popular rank-tied heuristics (like $\alpha = r$ or $2r$) leave LoRA severely under-scaled due to their insufficient magnitudes.
Based on these insights, the authors propose LoRA-$\alpha$, which scales $\alpha$ based on a principled square-root law (e.g., using a large base coefficient like $256\sqrt{r}$). This minimalist shift allows LoRA to directly inherit standard, small Full Fine-Tuning (FFT) learning rates while matching or even exceeding FFT performance across NLP, multimodal, and RL tasks.
Bye-bye expensive hyperparameter tuning! 👋
Get this paper in your agent:
hf papers read 2606.12883 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper