Skip to main content

Documentation Index

Fetch the complete documentation index at: https://fireworks.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

After fine-tuning your model on Fireworks, deploy it to make it available for inference. Fireworks supports two deployment methods for LoRA fine-tuned models: live merge and multi-LoRA. Each method has different tradeoffs around performance, cost, and flexibility.
Fine-tuned LoRA models, whether created on the Fireworks platform or imported, can only be deployed to on-demand (dedicated) deployments. Serverless deployment is not supported for LoRA models.
You can also upload and deploy LoRA models fine-tuned outside of Fireworks. See importing fine-tuned models for details.

Choosing a deployment method

Fireworks offers two ways to deploy LoRA fine-tuned models. The right choice depends on how many fine-tuned variants you need to serve and your performance requirements.
Live mergeMulti-LoRA
How it worksLoRA weights are merged into the base model at deployment time, creating a single merged modelBase model is deployed with addon support; LoRA adapters are loaded dynamically at request time
Number of LoRAsOne per deploymentMultiple per deployment
Inference performanceMatches the base model (no overhead)Some overhead per request due to dynamic adapter application
ThroughputSame as base modelLower maximum throughput under high concurrency
Cost efficiencyOne deployment per fine-tuneShare a single deployment across many fine-tunes
Best forProduction workloads requiring maximum performanceExperimentation, A/B testing, or serving many variants of the same base model
If you only need to serve a single fine-tuned model, live merge is the recommended approach. It delivers the best performance with the simplest setup.

Live merge deployment

Live merge is the simplest way to deploy a fine-tuned model. Fireworks automatically merges the LoRA weights into the base model at deployment time, producing a model that performs identically to a natively fine-tuned model with no inference overhead.

How it works

When you deploy a LoRA model directly, Fireworks:
  1. Takes your LoRA adapter weights and the base model
  2. Merges them into a single set of weights at deployment time
  3. Serves the merged model as a standalone deployment
The result is a deployment that is indistinguishable from a fully fine-tuned model in terms of latency, throughput, and memory usage.

Deploy with live merge

Deploy your LoRA fine-tuned model with a single command:
firectl deployment create "accounts/<ACCOUNT_ID>/models/<FINE_TUNED_MODEL_ID>"
Your deployment will be ready to use once it completes, with performance that matches the base model.

Sending requests

Send inference requests to your live-merge deployment by referencing the deployment directly:
from fireworks import Fireworks

client = Fireworks()

response = client.chat.completions.create(
  model="accounts/<ACCOUNT_ID>/models/<FINE_TUNED_MODEL_ID>",
  messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

When to use live merge

  • You need maximum inference performance (latency and throughput matching the base model)
  • You are serving a single fine-tuned model in production
  • You want the simplest possible deployment workflow

Multi-LoRA deployment

Multi-LoRA lets you load multiple LoRA adapters onto a single base model deployment. This is useful when you have several fine-tuned variants of the same base model and want to share GPU resources across them rather than creating a separate deployment for each.

How it works

With multi-LoRA:
  1. You deploy the base model with addon support enabled
  2. You load one or more LoRA adapters onto the running deployment
  3. At inference time, the correct adapter is selected and applied dynamically based on the model specified in the request
Because adapters are applied dynamically rather than merged, there is some performance overhead compared to live merge. This overhead increases with higher request concurrency.

Deploy with multi-LoRA

1

Create base model deployment with addon support

Deploy the base model with addons enabled:
firectl deployment create "accounts/fireworks/models/<BASE_MODEL_ID>" --enable-addons
2

Load LoRA adapters

Once the deployment is ready, load your LoRA models onto the deployment:
firectl load-lora <FINE_TUNED_MODEL_ID> --deployment <DEPLOYMENT_ID>
Repeat this command for each LoRA adapter you want to load.

Sending requests

To route inference requests to a specific LoRA adapter on a multi-LoRA deployment, set the model field to <model_name>#<deployment_name>. The # separator tells Fireworks to route the request to the specified adapter on the given deployment.
Deprecation notice: The deployedModel request key for routing to LoRA addons is deprecated and will not be supported for any new deployments. Use the model field with the <model_name>#<deployment_name> format shown below.
from fireworks import Fireworks

client = Fireworks()

response = client.chat.completions.create(
  model="accounts/<ACCOUNT_ID>/models/<FINE_TUNED_MODEL_ID>#accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
  messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

When to use multi-LoRA

  • You need to serve multiple fine-tuned models based on the same base model
  • You want to maximize GPU utilization by sharing a single deployment
  • You are running experiments or A/B tests across multiple fine-tuned variants
  • You can accept some performance overhead compared to live merge

Performance considerations

Live merge eliminates all LoRA-related inference overhead because the adapter weights are baked into the model at deployment time. The resulting deployment behaves exactly like a natively fine-tuned base model. Multi-LoRA deployments incur overhead because adapters are applied dynamically:
  • Time to first token (TTFT): Increases by roughly 10–30% due to adapter loading and prompt processing overhead
  • Generation speed: Overhead grows with higher request concurrency
  • Maximum throughput: Lower than a live-merge deployment under sustained load
For a deeper dive into LoRA performance characteristics and optimization strategies, see Understanding LoRA Performance.

Next steps

On-Demand Deployments

Learn about deployment configuration and optimization

Import Fine-Tuned Models

Upload LoRA models fine-tuned outside of Fireworks

LoRA Performance

Understand performance tradeoffs and optimization strategies