This is How I Understand LoRA Fine-Tuning

baoshi.rao

Friends who work with AI have likely heard of LoRA, currently the hottest parameter-efficient fine-tuning method. But do you really understand how it works? This article aims to demystify it, covering its principles, workflow, and applicable scenarios to help you build a solid and clear foundational understanding.

What is LoRA?

LoRA, short for Low-Rank Adaptation, is an efficient model fine-tuning technique. Its primary goal is to reduce the number of trainable parameters and lower computational and storage costs without compromising model performance.

Traditional fine-tuning methods involve directly modifying all or most of the model's parameters. While this adapts the model to new tasks, it comes with high computational costs and GPU memory usage.

LoRA, however, takes a different approach: instead of directly altering the original model's large weight matrices, it introduces a pair of low-rank matrices (A and B) as "patches" for each matrix being fine-tuned. These patches are trained while the original weights remain unchanged.

To put it simply:
Imagine the original large model as an engine. LoRA adds small "patches" to selected critical components (the matrices that influence model behavior). These patches "adjust" the behavior of these components to adapt to task requirements—without rebuilding the entire engine.

What is a Low-Rank Matrix?

Actual rank (r) represents the maximum number of linearly independent rows or columns in a matrix—i.e., how much independent information it contains.

For the original model matrix M (an a×b matrix), the maximum possible rank is L = min(a, b).

Low-rank matrix: When the actual rank r < L, it means the matrix has redundant information and can be reconstructed from a small number of independent rows/columns through linear combinations.

Example:
Consider a 10×10 matrix:

Maximum rank L = 10;
If its rank r = 3, it has only 3 independent pieces of information, with the remaining rows/columns derived from linear combinations of these 3;
Thus, this is a typical low-rank matrix.

Question:
Are you now curious about how to set r? Keep reading for the answer

How is LoRA Fine-Tuning Implemented?

Overview of the Process

Select the matrices in the model layers where LoRA will be inserted.
For each selected matrix, add a pair of small "patch" matrices, A and B (with dimensions [r, d] and [d, r], where r is the rank and d is the original matrix's layer size).
Freeze the original model's parameters (i.e., leave them unchanged).
Fine-tune only these two small matrices, A and B.
Merge the fine-tuned model with the original model.

Key Steps Explained

1. Selection: Which layers and matrices should LoRA be applied to?
LoRA isn't added to all layers—it’s strategically inserted into specific target layers and matrices. Practical guidelines include:

a. Prioritize Q and V matrices

Q (Query) and V (Value) are core components of Transformer attention mechanisms.
They control which information the model focuses on.
Inserting LoRA here effectively adjusts the model's attention behavior, significantly influencing text generation.
This is also the mainstream approach on platforms like HuggingFace.

b. Occasionally apply to linear layers in FeedForward (MLP) networks

For certain tasks, MLP layers influence semantic transformation.
LoRA can fine-tune this semantic processing.
However, this is less common and not standard practice.

c. Avoid LayerNorm or Embedding layers

These layers have few parameters and minimal impact on the final output.
Adding LoRA here is ineffective and may cause gradient instability or performance degradation.

In short: Patch where needed, leave the rest untouched.

2. After selection: What do the patches look like? How is rank r chosen?
Suppose you’ve selected the Q and V layers in a Transformer for LoRA fine-tuning. Here’s how it works for the Q matrix:

a. Add trainable patches to the Q matrix
If the original Q matrix has dimensions 512×512, LoRA inserts a pair of patch matrices:

QA: [r, 512]
QB: [512, r]
When multiplied, the resulting Q_patch = QA × QB has the same dimensions as the original Q matrix [512, 512].

b. Freeze original weights, train only the patches
During training, the original Q matrix remains frozen, and only QA and QB (the patch matrices) are updated. This focuses the fine-tuning on "adjusting specific directions" for the task, avoiding modifications to the entire large matrix.

c. Recommended rank r selection
The rank r is a small dimension you specify, determining the patch's expressive power and parameter count. Empirical recommendations:

Larger r: Stronger adjustment capability but higher computational cost.
Smaller r: Fewer parameters, more efficient, but may limit model behavior adaptation.

d. Training and testing workflow

During training, only the patch matrices QA and QB are updated.
Note: r is typically a predefined hyperparameter, not optimized during training.

3. Final step: Where are the patches merged into the model?
After LoRA fine-tuning, the patches are usually merged with the original model for deployment. The rule is:

Only merge layers where LoRA was inserted.
Unmodified layers (e.g., unselected matrices) remain unchanged.
The merged model is slightly larger in size but retains the original parameter structure.
Effectively, LoRA fine-tunes the critical "screws" in the model.

Example:

Original model: 512 layers.
LoRA applied only to Q/V in layer 400.
During merging, only layer 400’s Q/V are updated; other layers stay intact.

Where is LoRA Most Suitable?

From the above process, my understanding is:
LoRA fine-tuning focuses on patching specific matrices in certain layers, allowing the model to adapt to tasks while preserving its "core knowledge." In other words, LoRA doesn’t rebuild the model but makes lightweight adjustments to refine its behavior.

Thus, my takeaway: LoRA doesn’t change knowledge—it refines expression.

Given this, LoRA is well-suited for scenarios like:

Controlling the model’s tone, grammar, or output format.
Making the model express its existing knowledge in a desired manner.
Solidifying prompt effects to avoid repetitive prompt engineering.
Quickly adapting general models to specific domains (e.g., healthcare, law).

However, it has clear limitations and is less suitable for:

Teaching the model entirely new knowledge (unseen during pretraining).
Establishing new causal reasoning or complex logical chains.
Transforming a "knowledge-rich" general model into a "truly expert" model.

In summary, LoRA is more about fine-tuning expression atop existing knowledge. It won’t make the model smarter but can make its outputs more stable and aligned with your needs.