LoRA (Low-Rank Adaptation)
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that approximates weight updates in pre-trained neural networks using low-rank decomposition.
LoRA, introduced by Hu et al. at Microsoft in 2021, addresses the computational and storage costs of fine-tuning large pre-trained models. Instead of updating all original model weights during task-specific adaptation, LoRA freezes the pre-trained weights and injects trainable low-rank matrices into specific layers, typically the attention layers of transformer architectures. These low-rank matrices decompose the weight update ΔW into two smaller matrices A and B, where the product AB has a rank r much smaller than the original weight matrix dimensions. The rank r is a hyperparameter that controls the trade-off between adaptation capacity and parameter efficiency.
During training, only the low-rank matrices are updated, while the original weights remain unchanged. This reduces the number of trainable parameters by several orders of magnitude—for example, a full fine-tuning of GPT-3 175B requires 175 billion parameters, whereas LoRA with rank r=8 requires only 0.01% of that. At inference time, the learned low-rank matrices can be merged into the original weights with no additional latency, or kept separate to allow rapid switching between multiple task-specific adaptations without duplicating the base model.
LoRA is particularly effective for large language models and diffusion models, where full fine-tuning is prohibitively expensive. It has been shown to achieve performance comparable to full fine-tuning on many tasks while using far fewer resources. The method is orthogonal to other parameter-efficient techniques like adapter layers or prefix tuning and can be combined with them. Its simplicity and efficiency have made LoRA a widely adopted standard for adapting large models in both research and production settings.
Why it matters
LoRA makes fine-tuning of large models practical for organizations with limited compute budgets. By drastically reducing the number of trainable parameters and memory requirements, it enables customization of models like GPT-3, LLaMA, and Stable Diffusion on consumer hardware. It also facilitates multi-task serving by storing multiple lightweight LoRA modules that can be swapped without loading separate model copies, reducing storage and switching costs in production environments.
First appeared
Hu et al., Microsoft, 2021.
Related terms
FAQ
How does it work?
LoRA freezes the original pre-trained weights and inserts trainable low-rank decomposition matrices into specific layers, typically the query and value projection matrices in transformer attention. The weight update ΔW is approximated as the product of two smaller matrices A and B, where the rank r is much smaller than the original dimensions. Only these low-rank matrices are updated during fine-tuning, reducing the number of trainable parameters.
What is the typical rank value and how is it chosen?
Common rank values range from 1 to 64, with 8 or 16 being typical starting points. The rank controls the expressiveness of the adaptation: higher ranks allow more capacity but increase parameters and risk overfitting. In practice, many tasks achieve strong performance with surprisingly low ranks (e.g., r=8), and the optimal rank can be determined through validation experiments.
How does LoRA compare to full fine-tuning?
LoRA often matches or closely approaches the performance of full fine-tuning on downstream tasks while using orders of magnitude fewer trainable parameters. It also avoids catastrophic forgetting of the base model’s capabilities because the original weights remain unchanged. However, for tasks requiring very large distribution shifts, full fine-tuning may still outperform LoRA, though the gap is often small.