Configure and manage on-demand deployments on dedicated GPUs
New to deployments? Start with our Deployments Quickstart to deploy and query your first model in minutes, then return here to learn about configuration options.
On-demand deployments give you dedicated GPUs for your models, providing several advantages over serverless:
Better performance – Lower latency, higher throughput, and predictable performance unaffected by other users
No hard rate limits – Only limited by your deployment’s capacity
Cost-effective at scale – Cheaper under high utilization. Unlike serverless models (billed per token), on-demand deployments are billed by GPU-second.
Broader model selection – Access models not available on serverless
Custom models – Upload your own models (for supported architectures) from Hugging Face or elsewhere
Need higher GPU quotas or want to reserve capacity? Contact us.
# This command returns your accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID> - save it for queryingfirectl deployment create accounts/fireworks/models/<MODEL_NAME> --wait
See Deployment shapes below to optimize for speed, throughput, or cost.Query your deployment:After creating a deployment, query it using this format:
Copy
Ask AI
accounts/<ACCOUNT_ID>/deployments/<DEPLOYMENT_ID>
You can find your deployment name anytime with firectl deployment list and firectl deployment get <DEPLOYMENT_ID>.
Deployment shapes are the primary way to configure deployments. They’re pre-configured templates optimized for speed, cost, or efficiency, including hardware, quantization, and other performance factors.
Fast – Low latency for interactive workloads
Throughput – Cost-per-token at scale for high-volume workloads
Minimal – Lowest cost for testing or light workloads
Usage:
Copy
Ask AI
# List available shapesfirectl deployment-shape-version list --base-model <model-id># Create with a shape (shorthand)firectl deployment create accounts/fireworks/models/deepseek-v3 --deployment-shape throughput# Create with full shape IDfirectl deployment create accounts/fireworks/models/llama-v3p3-70b-instruct \ --deployment-shape accounts/fireworks/deploymentShapes/llama-v3p3-70b-instruct-fast# View shape detailsfirectl deployment-shape-version get <full-deployment-shape-version-id>
Need even better performance with tailored optimizations? Contact our team.
# List all deploymentsfirectl deployment list# Check deployment statusfirectl deployment get <DEPLOYMENT_ID># Delete a deploymentfirectl deployment delete <DEPLOYMENT_ID>
By default, deployments scale to zero if unused for 1 hour. Deployments with min replicas set to 0 are automatically deleted after 7 days of no traffic.
When a deployment is scaled to zero, requests return a 503 error immediately while the deployment scales up. Your application should implement retry logic to handle this. See Scaling from zero behavior for implementation details.