LoRA, explained

and

Sep 21, 2023

LoRA or Low-Rank Adaptation has become something of a buzzword recently. We’ve heard LoRA thrown around in conversation frequently (sometimes where it doesn’t even make sense!), so here’s a short explainer of LoRA.

Ideogram’s interpretation of the low rank adaptation algorithm.

The LoRA paper predates the current LLM craze by over a year. LoRA is an optimization for model fine-tuning that improves memory efficiency by reducing the number of parameters used in fine-tuning.

The motivation in the LoRA paper is that natural language models had been getting larger. Traditional fine-tuning techniques would update the weights of every single parameter, which means that every copy of a fine-tuned model would have a completely new set of weights. Deploying these models separately is expensive and resource intensive. These concerns have only gotten more acute since 2021, as models have gotten larger and GPUs are difficult (read: impossible) to find.

LoRA relies on an observation from a previous paper (from Facebook) which shows that the information added to an LLM during fine-tuning can be represented in many fewer dimensions than what the original LLM uses. Large-language models’ weight matrices — which typically have full rank1 — can be fine-tuned effectively at lower ranks without losing significantly fidelity. As a consequence, LoRA requires training many fewer parameters.

While we won’t explain all the linear algebra here, LoRA works by “injecting” a pair of rank decomposition matrics into each transformer layer2. The pre-trained weights are left static, and only these additional matrices are updated during the fine-tuning process. As described above, these matrices can be significantly smaller than the Transformer’s original weight matrix. For example, GPT-3 uses 12,288 dimensions, so the original Transformer weight matrix will be 12,288x12,288. The LoRA matrices might instead by 12,288x1 or 12,288x2. During inference, the LoRA matrices are added to the original LLM’s weights, meaning no extra compute is required3.

Empirically, LoRA can reduce memory usage during fine-tuning up to 3x and reduced the trainable parameters for fine-tuning GPT-3 by 10,000x.

This should hopefully give you a rough sense for what LoRA does, but what are the practical implications? At a high-level, LoRA’s goal is to make fine-tuning accessible to much broader swathe of the world than before.

If fine-tuning an LLM required updating every parameter than LLM has, it would require the same resources required to train the model from scratch (albeit for a shorter time period). Very few companies (OpenAI, AWS, Google, Microsoft) would be able to offer the ability to fine-tune models in the first place, and they would have the bandwidth to only support a few customers.

Innovation would also be limited in the open-source ecosystem because open-source LLMs would not be readily fine-tunable, which would make Vicuna and Gorilla (disclosure: from my group!) impossible. And of course, it’s also led to the explosion of startups offering LLM fine-tuning as a service.

As mentioned above, LoRA predates the current wave of LLM attention (pun intended). There has been follow-on work to continue to improve on LoRA. Most notably, QLoRA introduces an optimized 4-bit memory representation based on quantization to further reduce the memory overhead of fine-tuning, while still maintaining high model quality. We’ll save a more detailed deep dive on follow-on work to LoRA for future posts.

For now, we’ll leave you with these takeaways:

Fine-tuning an LLM still isn’t where you should start with LLMs. Of course, when applied well, fine-tuning can be an extremely powerful tool.
LoRA and follow-on techniques like QLoRA can drastically improve the memory overhead and GPU requirements (and therefore affordability) for fine-tuning LLMs.
Techniques like LoRA are and will continue to be critical to enabling innovation in LLMs.

If you’re looking to get started with LoRA, Microsoft has a popular library called loralib that’s a good place to start. There’s also some exciting new work on the tradeoffs with LoRA that we’ll talk about next week.

If you’re not familiar, a matrix’s rank refers to the number of linearly independent columns that exist in the matrix. If existing columns are linear combinations of other columns, the matrix can be represented more effectively — for example, by singular value decomposition. Matrix rank is commonly discussed in machine learning algorithms.

Note that the original LoRA paper only adapts the attention weights of the transformer architecture. They defer adapting MLP, LayerNorm, etc. to future work.

If you’re curious about the matrix dimensions, here’s a rough sketch. Let’s say we have an original set of weights with dimension 12,288x12,288. We’ll define our LoRA modifications to the weights as such:

\(W = W_0 + \Delta W\)

And we’ll construct the LoRA matrix such that:

\(\Delta W = AB\)

The two decomposed matrices can have low rank dimensions, such that:

\(A_{12288x1} \times B_{1x12288} = \Delta W_{12288x12288}\)

The dimensions of the weight update matrix are compatible with the original weight matrix.