Deployments - Fireworks AI Docs

New to deployments? Start with our Deployments Quickstart to deploy and query your first model in minutes, then return here to learn about configuration options.

On-demand deployments give you dedicated GPUs for your models, providing several advantages over serverless:

Better performance – Lower latency, higher throughput, and predictable performance unaffected by other users
No hard rate limits – Only limited by your deployment’s capacity
Cost-effective at scale – Cheaper under high utilization. Unlike serverless models (billed per token), on-demand deployments are billed by GPU-second.
Broader model selection – Access models not available on serverless
Custom models – Upload your own models (for supported architectures) from Hugging Face or elsewhere

Need higher GPU quotas or want to reserve capacity? Contact us.

Creating & querying deployments

Create a deployment:

# This command returns your accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID> - save it for querying
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --wait

Deployment placement (--region) must be set at creation time and cannot be changed in place.If you do not specify --region, the deployment is pinned to a single datacenter at creation time and will not be automatically migrated later.For production workloads that need geographic availability or capacity failover, always set --region explicitly:

firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region GLOBAL   # recommended default
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region US
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region EUROPE
firectl deployment create accounts/fireworks/models/<MODEL_NAME> --region APAC

Check current placement

firectl deployment get <DEPLOYMENT_ID>

The deployment metadata shows where the deployment is currently allowed to schedule replicas (placement / region configuration).

Change placement

There is no supported command to change region placement on an existing deployment. To change placement, recreate the deployment:

# 1. Create replacement with correct region
firectl deployment create accounts/fireworks/models/<MODEL_NAME> \
  --deployment-shape <shape> \
  --region GLOBAL \
  --min-replica-count 1

# 2. Verify it's healthy, then point your app at the new endpoint

# 3. Delete old deployment
firectl deployment delete <OLD_DEPLOYMENT_ID>

See Regions for mega-regions and hardware availability. See Deployment shapes below to optimize for speed, throughput, or cost. Query your deployment: After creating a deployment, query it using this format:

accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>

You can find your deployment name anytime with firectl deployment list and firectl deployment get <DEPLOYMENT_ID>.

Example:

accounts/alice/deployments/12345678

Code examples

Python (Fireworks SDK)
Python (OpenAI SDK)
JavaScript
curl

from fireworks import Fireworks

client = Fireworks()

response = client.chat.completions.create(
  model="accounts/fireworks/models/gpt-oss-120b#<DEPLOYMENT_NAME>",
  messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
)

print(response.choices[0].message.content)

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("FIREWORKS_API_KEY"),
    base_url="https://api.fireworks.ai/inference/v1"
)

response = client.chat.completions.create(
    model="accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
    messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
)

print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.FIREWORKS_API_KEY,
  baseURL: "https://api.fireworks.ai/inference/v1",
});

const response = await client.chat.completions.create({
  model: "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
  messages: [
    {
      role: "user",
      content: "Explain quantum computing in simple terms",
    },
  ],
});

console.log(response.choices[0].message.content);

curl https://api.fireworks.ai/inference/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $FIREWORKS_API_KEY" \
  -d '{
    "model": "accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>",
    "messages": [
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms"
      }
    ]
  }'

Deployment status states

Deployment states from the Gateway API spec:

CREATING - still being created
READY - ready to be used
UPDATING - in-progress updates happening
DELETING - being deleted
DELETED - soft-deleted
FAILED - creation failed (see status for details)

UI-only states are display labels derived from deployment fields:

Inactive: state == READY && max_replica_count == 0 && ready_replica_count == 0
Scaled to 0: state == READY && min_replica_count == 0 && max_replica_count > 0 && desired_replica_count == 0 && ready_replica_count == 0

These are display labels computed from deployment fields; they are not new backend Deployment.State enum values.

Deployment shapes

Deployment shapes are the primary way to configure deployments. They’re pre-configured templates optimized for speed, cost, or efficiency, including hardware, quantization, and other performance factors.

Fast – Low latency for interactive workloads
Throughput – Cost-per-token at scale for high-volume workloads
Minimal – Lowest cost for testing or light workloads

Usage:

# List available shapes
firectl deployment-shape-version list --base-model <model-id>

# Create with a shape (shorthand)
firectl deployment create accounts/fireworks/models/deepseek-v3 --deployment-shape throughput

# Create with full shape ID
firectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \
  --deployment-shape accounts/fireworks/deploymentShapes/llama-v3p3-70b-instruct-fast

# View shape details
firectl deployment-shape-version get <full-deployment-shape-version-id>

Need even better performance with tailored optimizations? Contact our team.

Managing & configuring deployments

Basic management

# List all deployments
firectl deployment list

# Check deployment status
firectl deployment get <DEPLOYMENT_ID>

# Delete a deployment
firectl deployment delete <DEPLOYMENT_ID>

By default, deployments scale to zero if unused for 1 hour. Deployments with min replicas set to 0 are automatically deleted after 7 days of no traffic.

When a deployment is scaled to zero, requests return a 503 error immediately while the deployment scales up. Your application should implement retry logic to handle this. See Scaling from zero behavior for implementation details.

GPU hardware

Choose GPU type with --accelerator-type:

NVIDIA_A100_80GB
NVIDIA_H100_80GB
NVIDIA_H200_141GB

GPU availability varies by region. See Hardware selection guide→

Autoscaling

Control replica counts, scale timing, and load targets for your deployment. See the Autoscaling guide for configuration options.

Multiple GPUs per replica

Use multiple GPUs to improve latency and throughput:

firectl deployment create <MODEL_NAME> --accelerator-count 2

More GPUs = faster generation. Note that scaling is sub-linear (2x GPUs ≠ 2x performance).

Advanced

Speculative decoding - Speed up text generation using draft models or n-gram speculation
Quantization - Reduce model precision (e.g., FP16 to FP8) to improve speeds and reduce costs by 30-50%
Performance benchmarking - Measure and optimize your deployment’s performance with load testing
Managing default deployments - Control which deployment handles queries when using just the model name
Publishing deployments - Make your deployment accessible to other Fireworks users

Next steps

Autoscaling

Configure autoscaling for optimal cost and performance

Upload custom models

Deploy your own models from Hugging Face

Quantization

Reduce costs with model quantization

Regions

Choose deployment regions for optimal latency

Reserved capacity

Purchase reserved GPUs for guaranteed capacity

Fine-tuning

Fine-tune models for your specific use case

​Creating & querying deployments

​Check current placement

​Change placement

​Code examples

​Deployment status states

​Deployment shapes

​Managing & configuring deployments

​Basic management

​GPU hardware

​Autoscaling

​Multiple GPUs per replica

​Advanced

​Next steps

Autoscaling

Upload custom models

Quantization

Regions

Reserved capacity

Fine-tuning

Creating & querying deployments

Check current placement

Change placement

Code examples

Deployment status states

Deployment shapes

Managing & configuring deployments

Basic management

GPU hardware

Autoscaling

Multiple GPUs per replica

Advanced

Next steps