TGI Multi-LoRA: Deploy Once, Serve 30 Models
TGI Multi-LoRA: Deploy Once, Serve 30 Models
In today's ML world, organizations looking to leverage the value of their data will likely end up in a fine-tuned world, building a multitude of models, each one highly specialized for a specific task. However, managing multiple AI models can be complex and expensive. The answer is Multi-LoRA serving.
Motivation
Building a multitude of models via fine-tuning makes sense for multiple reasons:
- Performance: There is compelling evidence that smaller, specialized models outperform their larger, general-purpose counterparts on the tasks that they were trained on. Predibase showed that you can get better performance than GPT-4 using task-specific LoRAs with a base like Mistralai/Mistral-7B-v0.1.
- Adaptability: Models like Mistral or Llama are extremely versatile. You can pick one of them as your base model and build many specialized models, even when the downstream tasks are very different. Also, note that you aren't locked in as you can easily swap that base and fine-tune it with your data on another base (more on this later).
- Independence: For each task that your organization cares about, different teams can work on different fine tunes, allowing for independence in data preparation, configurations, evaluation criteria, and cadence of model updates.
- Privacy: Specialized models offer flexibility with training data segregation and access restrictions to different users based on data privacy requirements. Additionally, in cases where running models locally is important, a small model can be made highly capable for a specific task while keeping its size small enough to run on device.
In summary, fine-tuning enables organizations to unlock the value of their data, and this advantage becomes especially significant, even game-changing, when organizations use highly specialized data that is uniquely theirs.
Where is the Catch?
Deploying and serving Large Language Models (LLMs) is challenging in many ways. Cost and operational complexity are key considerations when deploying a single model, let alone n models. This means that, for all its glory, fine-tuning complicates LLM deployment and serving even further.
Background on LoRA
LoRA, which stands for Low-Rank Adaptation, is a technique to fine-tune large pre-trained models efficiently. The core idea is to adapt large pre-trained models to specific tasks without needing to retrain the entire model, but only a small set of parameters called adapters. These adapters typically only add about 1% of storage and memory overhead compared to the size of the pre-trained LLM while maintaining the quality compared to fully fine-tuned models.
The obvious benefit of LoRA is that it makes fine-tuning a lot cheaper by reducing memory needs. It also reduces catastrophic forgetting and works better with small datasets.
Multi-LoRA Serving
Now that we understand the basic idea of model adaptation introduced by LoRA, we are ready to delve into multi-LoRA serving. The concept is simple: given one base pre-trained model and many different tasks for which you have fine-tuned specific LoRAs, multi-LoRA serving is a mechanism to dynamically pick the desired LoRA based on the incoming request.
How to Use
Gather LoRAs
First, you need to train your LoRA models and export the adapters. You can find a guide here on fine-tuning LoRA adapters. Do note that when you push your fine-tuned model to the Hub, you only need to push the adapter, not the full merged model. When loading a LoRA adapter from the Hub, the base model is inferred from the adapter model card and loaded separately again. For deeper support, please check out our Expert Support Program. The real value will come when you create your own LoRAs for your specific use cases.
Low Code Teams
For some organizations, it may be hard to train one LoRA for every use case, as they may lack the expertise or other resources. Even after you choose a base and prepare your data, you will need to keep up with the latest techniques, explore hyperparameters, find optimal hardware resources, write the code, and then evaluate. This can be quite a task, even for experienced teams.
AutoTrain can lower this barrier to entry significantly. AutoTrain is a no-code solution that allows you to train machine learning models in just a few clicks. There are a number of ways to use AutoTrain. In addition to locally/on-prem we have:
- AutoTrain Environment
- Hardware Details
- Code Requirement
- Notes
Hugging Face Space
- Variety of GPUs and hardware
- No code
- Flexible and easy to share
DGX cloud
- Up to 8xH100 GPUs
- No code
- Better for large models
Google Colab
- Access to a T4 GPU
- Low code
- Good for small loads and quantized models
Deploy
For our examples, we will use a couple of the excellent adapters featured in LoRA Land from Predibase:
- predibase/customer_support is trained on the Gridspace-Stanford Harper Valley speech dataset, which enhances its ability to understand and respond to customer service interactions accurately. This improves the model's performance in tasks such as speech recognition, emotion detection, and dialogue management, leading to more efficient and empathetic customer support.
- predibase/magicoder is trained on ise-uiuc/Magicoder-OSS-Instruct-75K, which is a code instruction dataset that is synthetically generated.
TGI
There is already a lot of good information on how to deploy TGI. Deploy like you normally would, but ensure that you:
- Use a TGI version newer or equal to v2.1.1
- Deploy your base: mistralai/Mistral-7B-v0.1
- Add the LORA_ADAPTERS env var during deployment
- Example: LORA_ADAPTERS=predibase/customer_support,predibase/magicoder
model=mistralai/Mistral-7B-v0.1
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:2.1.1 \
--model-id $model \
--lora-adapters=predibase/customer_support,predibase/magicoder
Inference Endpoints GUI
Inference Endpoints allows you to have access to deploy any Hugging Face model on many GPUs and alternative Hardware types across AWS, GCP, and Azure all in a few clicks! In the GUI, it's easy to deploy. Under the hood, we use TGI by default for text generation (though you have the option to use any image you choose).
To use Multi-LoRA serving on Inference Endpoints, you just need to go to your dashboard, then:
- Choose your base model: mistralai/Mistral-7B-v0.1
- Choose your Cloud | Region | HW
- Ill use AWS | us-east-1 | Nvidia L4
- Select Advanced Configuration
- You should see text generation already selected
- You can configure based on your needs
- Add LORA_ADAPTERS=predibase/customer_support,predibase/magicoder in Environment Variables
- Finally Create Endpoint!
Inference Endpoints Code
Maybe some of you are musophobic and don't want to use your mouse, we don't judge. It's easy enough to automate this in code and only use your keyboard.
from huggingface_hub import create_inference_endpoint
# Custom Docker image details
custom_image = {
"health_route": "/health",
"url": "ghcr.io/huggingface/text-generation-inference:2.1.1", # This is the min version
"env": {
"LORA_ADAPTERS": "predibase/customer_support,predibase/magicoder", # Add adapters here
"MAX_BATCH_PREFILL_TOKENS": "2048", # Set according to your needs
"MAX_INPUT_LENGTH": "1024", # Set according to your needs
"MAX_TOTAL_TOKENS": "1512", # Set according to your needs
"MODEL_ID": "/repository"
}
}
# Creating the inference endpoint
endpoint = create_inference_endpoint(
name="mistral-7b-multi-lora",
repository="mistralai/Mistral-7B-v0.1",
framework="pytorch",
accelerator="gpu",
instance_size="x1",
instance_type="nvidia-l4",
region="us-east-1",
vendor="aws",
min_replica=1,
max_replica=1,
task="text-generation",
custom_image=custom_image,
)
endpoint.wait()
print("Your model is ready to use!")
Practical Considerations
Cost
We are not the first to climb this summit, as discussed below. The team behind LoRAX, Predibase, has an excellent write-up. Do check it out, as this section is based on their work.
Figure 5: Multi-LoRA Cost For TGI, I deployed mistralai/Mistral-7B-v0.1 as a base on nvidia-l4, which has a cost of $0.8/hr on Inference Endpoints. I was able to get 75 requests/s with an average of 450 input tokens and 234 output tokens and adjusted accordingly for GPT3.5 Turbo.
One of the big benefits of Multi-LoRA serving is that you don’t need to have multiple deployments for multiple models, and ultimately this is much much cheaper. This should match your intuition as multiple models will need all the weights and not just the small adapter layer. As you can see in Figure 5, even when we add many more models with TGI Multi-LoRA the cost is the same per token. The cost for TGI dedicated scales as you need a new deployment for each fine-tuned model.
Usage Patterns
Figure 6: Multi-LoRA Serving Pattern One real-world challenge when you deploy multiple models is that you will have a strong variance in your usage patterns. Some models might have low usage; some might be bursty, and some might be high frequency. This makes it really hard to scale, especially when each model is independent. There are a lot of “rounding” errors when you have to add another GPU, and that adds up fast. In an ideal world, you would maximize your GPU utilization per GPU and not use any extra. You need to make sure you have access to enough GPUs, knowing some will be idle, which can be quite tedious.
When we consolidate with Multi-LoRA, we get much more stable usage. We can see the results of this in Figure 6 where the Multi-LoRA Serving pattern is quite stable even though it consists of more volatile patterns. By consolidating the models, you allow much smoother usage and more manageable scaling. Do note that these are just illustrative patterns, but think through your own patterns and how Multi-LoRA can help. Scale 1 model, not 30!
Changing the Base Model
What happens in the real world with AI moving at breakneck speeds? What if you want to choose a different/newer model as your base? While our examples use mistralai/Mistral-7B-v0.1 as a base model, there are other bases like Mistral’s v0.3 which supports function calling, and altogether different model families like Llama 3. In general, we expect new base models that are more efficient and more performant to come out all the time.
But worry not! It is easy enough to re-train the LoRAs if you have a compelling reason to update your base model. Training is relatively cheap; in fact Predibase found it costs only ~$8.00 to train each one. The amount of code changes is minimal with modern frameworks and common engineering practices:
- Keep the notebook/code used to train your model
- Version control your datasets
- Keep track of the configuration used
- Update with the new model/settings
Conclusion
Multi-LoRA serving represents a transformative approach in the deployment of AI models, providing a solution to the cost and complexity barriers associated with managing multiple specialized models. By leveraging a single base model and dynamically applying fine-tuned adapters, organizations can significantly reduce operational overhead while maintaining or even enhancing performance across diverse tasks. AI Directors we ask you to be bold, choose a base model and embrace




