Recently, I’ve been on this deep dive regarding a more simple way to approach using the data I have through the help of A.i (LLMs) The key takeaways were to have it properly retrieved while maintaining comprehension and desired results.

Easier said than done, at least for a simple consumer. The reality is having more than 64 GB of RAM on a simple PC, isn’t everyone’s standards, so by the same token running anything over 7B to 14B model clearly goes out the window.

Then if you think about how a simple Step (when tuning) on a 170B model can take up to 1 TB of hard space, it makes it beyond unrealistic to even think about tuning your A.i.

Now, that being said. Here’s where LoRA (Low-Rank Adaptation) comes in. In simple words this approach feels like a simple shortcut to customize your A.i.


What is LoRA?

According to Edward Hu the LoRA technique lets you customize big AI models efficiently instead of retraining the entire model(where the resources required would be massive). The way LoRA shines is by adding small “adapters” that are much smaller than the original model. Well, Imagine that it:

  • Reduces checkpoint size dramatically (from 1TB to just 25MB for GBT-3)
  • No additional inference latency
  • Can switch between different customized versions quickly
  • Preserves the performance of full fine-tuning

In layman’s terms think of a small specialized Swiss Army knife that has attachments. Well instead of actually buying the whole knife for each task. you model (the knife) stays the same, but you can add different tools to it (adapters) to suit your job.


Key Advantages

Using LoRA in your A.i stack, is for sure a game changer. First off, think about the storage, we’re talking about shrinking model checkpoints from a massive 1TB to just 25 MB for GBT-3. The reason for this being instead of needing the full 175 billion parameters it only needs to store about 4.7 million parameters.

What’s really neat is that it doesn’t slow things down during actual use - there’s zero inference latency. The adapters can be merged with the base model before you use it, so it runs just as fast as the original model. And when you need to switch between different customized versions? It’s super quick, faster than a single forward pass, and can be done in parallel.

The deployment options are pretty flexible too. You can cache thousands of LoRA modules in RAM and train multiple modules in parallel. Plus, it supports this cool tree structure for gradual specialization.

What’s really impressive is how it handles the computational side of things. You can run this on regular consumer GPUs like the RTX 2080 Ti or 3080 (and if you wish to pair it with 8-bit quantization through bitsandbytes, it becomes even more efficient. No more need for those massive GPU clusters)

The performance is where it really shines. You get results that match full fine-tuning, but here’s the kicker - it works great even with smaller datasets and actually generalizes better to new scenarios. It’s like having your cake and eating it too!


When Not to Use LoRA

While LoRA is awesome, it’s not always the best choice. If you’re trying to teach an English model to understand Martian language or when the new task is completely different from the original training, full fine-tuning might be better. Also, if you need extremely precise control over every parameter or you’re working with very small models (under 1B parameters), LoRA might not be your best bet.


Key Takeaways

  • LoRA is a game-changer for fine-tuning big models.
  • It’s efficient, effective, and surprisingly simple.
  • Perfect for when you need to customize AI models without breaking the bank.

Personal Project

Now the way I intend of using this method is to fine tune the model (a simple 7B parameters) into smaller A.i agents that each perform a specific task. Maybe one model is better at analyzing sheet music, while another might be able to paint a clear view on the stocks echo system, pair this with a good chunking method and a good vector Data Base, and I believe you have a winner. I will try to follow up in the next few weeks with a more in depth practical example.


I will leave the links to some key materials that will, really enhance the theoretical understanding of this subject and the practical aspect.


PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware. link here

Tutorial on Fine Tuning by Sam Witteveen. link here

What is Low-Rank Adaptation (LoRA) by Edward Hu. link here

Low-Rank Adaptation for finetuning LLMs EXPLAINED by Letitia Parcalabescu. link here