After fine-tuning your model on Fireworks, deploy it to make it available for inference. Fireworks supports two deployment methods for LoRA fine-tuned models: live merge and multi-LoRA. Each method has different tradeoffs around performance, cost, and flexibility.Documentation Index
Fetch the complete documentation index at: https://fireworks.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
You can also upload and deploy LoRA models fine-tuned outside of Fireworks. See importing fine-tuned models for details.
Choosing a deployment method
Fireworks offers two ways to deploy LoRA fine-tuned models. The right choice depends on how many fine-tuned variants you need to serve and your performance requirements.| Live merge | Multi-LoRA | |
|---|---|---|
| How it works | LoRA weights are merged into the base model at deployment time, creating a single merged model | Base model is deployed with addon support; LoRA adapters are loaded dynamically at request time |
| Number of LoRAs | One per deployment | Multiple per deployment |
| Inference performance | Matches the base model (no overhead) | Some overhead per request due to dynamic adapter application |
| Throughput | Same as base model | Lower maximum throughput under high concurrency |
| Cost efficiency | One deployment per fine-tune | Share a single deployment across many fine-tunes |
| Best for | Production workloads requiring maximum performance | Experimentation, A/B testing, or serving many variants of the same base model |
Live merge deployment
Live merge is the simplest way to deploy a fine-tuned model. Fireworks automatically merges the LoRA weights into the base model at deployment time, producing a model that performs identically to a natively fine-tuned model with no inference overhead.How it works
When you deploy a LoRA model directly, Fireworks:- Takes your LoRA adapter weights and the base model
- Merges them into a single set of weights at deployment time
- Serves the merged model as a standalone deployment
Deploy with live merge
Deploy your LoRA fine-tuned model with a single command:Your deployment will be ready to use once it completes, with performance that matches the base model.
Sending requests
Send inference requests to your live-merge deployment by referencing the deployment directly:- Python (Fireworks SDK)
- curl
When to use live merge
- You need maximum inference performance (latency and throughput matching the base model)
- You are serving a single fine-tuned model in production
- You want the simplest possible deployment workflow
Multi-LoRA deployment
Multi-LoRA lets you load multiple LoRA adapters onto a single base model deployment. This is useful when you have several fine-tuned variants of the same base model and want to share GPU resources across them rather than creating a separate deployment for each.How it works
With multi-LoRA:- You deploy the base model with addon support enabled
- You load one or more LoRA adapters onto the running deployment
- At inference time, the correct adapter is selected and applied dynamically based on the model specified in the request
Deploy with multi-LoRA
Sending requests
To route inference requests to a specific LoRA adapter on a multi-LoRA deployment, set themodel field to <model_name>#<deployment_name>. The # separator tells Fireworks to route the request to the specified adapter on the given deployment.
- Python (Fireworks SDK)
- Python (OpenAI SDK)
- JavaScript
- curl
When to use multi-LoRA
- You need to serve multiple fine-tuned models based on the same base model
- You want to maximize GPU utilization by sharing a single deployment
- You are running experiments or A/B tests across multiple fine-tuned variants
- You can accept some performance overhead compared to live merge
Performance considerations
Live merge eliminates all LoRA-related inference overhead because the adapter weights are baked into the model at deployment time. The resulting deployment behaves exactly like a natively fine-tuned base model. Multi-LoRA deployments incur overhead because adapters are applied dynamically:- Time to first token (TTFT): Increases by roughly 10–30% due to adapter loading and prompt processing overhead
- Generation speed: Overhead grows with higher request concurrency
- Maximum throughput: Lower than a live-merge deployment under sustained load
Next steps
On-Demand Deployments
Learn about deployment configuration and optimization
Import Fine-Tuned Models
Upload LoRA models fine-tuned outside of Fireworks
LoRA Performance
Understand performance tradeoffs and optimization strategies