Fireworks AI Docs home page
Search...
⌘K
Ask AI
Community
Status
Dashboard
Dashboard
Search...
Navigation
Deployment & Infrastructure
How does billing and scaling work for on-demand GPU deployments?
Documentation
SDKs
CLI
API Reference
Model Library
FAQ
Changelog
Account & Access
Company account access
Close account
Multiple accounts login
GitHub authentication email
LinkedIn authentication email
Billing & Pricing
Pricing structure
Fine-tuned model fees
Bulk usage discounts
Serverless discounts
Credits & billing system
Account suspension reasons
$1 credit depleted
Missing credits issue
Invoice vs credits
Credit receipts
Models API billing
Serverless prompt caching billing
Input image pricing
Deployment & Infrastructure
Performance optimization
Performance benchmarking
Model latency ranges
Performance factors
Performance best practices
Serverless latency guarantees
Serverless SLAs
Serverless quotas
Fine-tuned serverless costs
Model removal notice
Serverless timeout issues
System scaling
Auto scaling support
Throughput capacity
Request handling factors
Autoscaling cost impact
On-demand rate limits
On-demand billing
GPU deployment billing
GPU selection guide
Custom model deployment issues
Deployment performance expectations
Performance consultation
Single replica optimization
Models & Inference
Custom base models
Serverless model availability
Model availability requests
Llama 3.1 405B quantization
API batching & load balancing
Request handling capacity
Safety filter controls
Token limit controls
Streaming performance metrics
FLUX multiple images
FLUX image-to-image
FLUX custom LoRA
SDXL ControlNet sizing
Fine-tuning
Fine-tuning service
Fine-tuning model support
Fine-tuned model access
firectl invalid ID errors
Llama 3.1 LoRA deployment
Security & Compliance
Data encryption at rest
Data encryption in transit
Client-side encryption options
Security policy documentation
LLM model guardrails
Private network connections
Security certifications
Support & General
General support
Performance support
Deployment regions
Support options
Support process
Enterprise support
Enterprise support Slack
Enterprise support tiers & SLAs
Enterprise tier quotas
Deployment & Infrastructure
How does billing and scaling work for on-demand GPU deployments?
Copy page
On-demand GPU deployments have unique billing and scaling characteristics compared to serverless deployments:
Billing
:
Charges start when the server begins accepting requests
Billed by GPU-second
for each active instance
Costs accumulate even if there are no active API calls
Scaling options
:
Supports
autoscaling
from 0 to multiple GPUs
Each additional GPU
adds to the billing rate
Can handle unlimited requests within the GPU’s capacity
Management requirements
:
Not fully serverless; requires some manual management
Manually delete deployments
when no longer needed
Or configure autoscaling to
scale down to 0
during inactive periods
Cost control tips
:
Regularly
monitor active deployments
Delete unused deployments
to avoid unnecessary costs
Consider
serverless options
for intermittent usage
Use
autoscaling to 0
to optimize costs during low-demand times
Was this page helpful?
Yes
No
How does billing work for on-demand deployments?
Previous
Which accelerator/GPU should I use?
Next
Assistant
Responses are generated using AI and may contain mistakes.