Microsoft and Beihang release MoRA, an efficient LLM fine-tuning technique

Microsoft and Beihang release MoRA, an efficient LLM fine-tuning technique

Join us as we return to New York on June 5 to work with executive leaders to explore comprehensive methods for auditing AI models for bias, performance, and ethical compliance in diverse organizations. Find out how you can be present here.


Researchers from Microsoft And Beihang University have introduced a new technique for refining large language models (LLMs) at a fraction of the cost normally required.

The new technology, called MoRAis a parameter-efficient fine-tuning (PEFT) technique that addresses some of the limitations of other popular techniques, such as low-level adjustment (LoRA). MoRA is especially useful when you want to refine the model for tasks that require the model to acquire new knowledge. As PEFT methods become increasingly popular in enterprises, MoRA could become an important addition to the growing toolset of LLM application developers.

The limitations of LoRA

Classic fine tuning requires updating all parameters of an LLM. When the model contains billions of parameters, full tuning can become expensive and slow. Parameter-efficient fine-tuning techniques are based on the premise that when fine-tuning LLMs for downstream applications, you do not need to update all parameters. PEFT methods find the optimal subset of parameters to change to configure the model for the target task.

LoRA has become popular as a PEFT technique due to its ability to update parameters via low-rank matrices, which map the full-rank weight matrix into a very small subspace. LoRA significantly reduces memory requirements and facilitates the storage and deployment of fine-tuned models.

VB event

The AI ​​Impact Tour: the AI ​​audit

Join us as we return to New York on June 5 to engage with top executives and delve into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across organizations. Secure your attendance for this exclusive invitation-only event.

Request an invitation

While LoRA performs well on tasks such as text classification and instruction tuning, it struggles with more complex tasks that require expanding the knowledge and capabilities of LLMs, such as mathematical reasoning and ongoing training. Several studies have shown that the low updating mechanism of LoRA can limit the ability of large language models to effectively learn and remember new knowledge.

Because the rank of the LoRA adapter is significantly smaller than the full rank of the model, “this limitation limits the capacity to store new information through fine-tuning,” the researchers write.

MoRA

LoRA vs MoRA
LoRA (left) uses low-rank matrices, while MoRA (right) uses a single square matrix for parameter-efficient fine-tuning (source: arxiv)

To address the limitations of LoRA, the researchers introduce MoRA, a PEFT technique that uses a square matrix instead of low-rank matrices. The main idea behind MoRA is to use trainable parameters in a way that achieves the highest possible rank in the space of the model's original dimensions.

Unlike LoRA, the input and output dimensions of the MoRA adapter do not match those of the original model, making it impossible to combine them in the same matrix multiplication operation. To bridge this gap, the researchers developed a compression/decompression function that transforms the input between the two spaces. This algorithm allows MoRA to easily connect to LLMs of different sizes.

The square weight matrix gives MoRA a stronger ability to learn new knowledge than a LoRA model of the same size, the researchers said.

MORA in action

The researchers compared equally sized LoRA and MoRA models for different tasks and environments. On memorization tasks, MoRA significantly outperformed LoRA and was much closer to the performance of a fully refined model with fewer parameters and training steps.

MoRA training curve
MoRA's loss curve is very similar to full tuning for knowledge retention tasks (source: arxiv)

“Our method shows significant improvements over LoRA with the same number of trainable parameters, and benefits from high-level updates,” the researchers write.

When aligning mathematical reasoning instructions and tasks, MoRA showed performance nearly on par with LoRA. However, in terms of continuing education in biomedical and financial fields, MoRA outperformed LoRA as it benefited from the high-quality updates to retain new knowledge.

The researchers also found that increasing the rank of the MoRA adapter can eliminate the performance gap between PEFT and full refinement of mathematical reasoning tasks, although this comes at a higher training and storage cost.

PEFT for the company

Refinement is a key use case for enterprise LLM applications. In addition to increasing the capabilities and accuracy of LLMs in proprietary knowledge, refinement can allow companies to use smaller models for tasks that previously required expensive frontier models.

Currently, LoRA and its variants are the gold standard for parameter-efficient tuning. There is a rich ecosystem of tools and platforms for creating LoRA adapters. For example, S-LoRA is a framework that allows developers to run thousands of LoRA adapters on a single GPU, unlocking applications that require many fine-tuned LLMs, such as models customized based on each user's content.

The researchers from Microsoft and Beihang have a open source implementation from MoRA, which is compatible with LoRA. This can prove to be an important tool for business applications that want to add new knowledge to basic models.