Fireworks AI Docs home page
Search...
⌘K
Ask AI
Community
Status
Dashboard
Dashboard
Search...
Navigation
Deployment & Infrastructure
What factors affect the number of simultaneous requests that can be handled?
Documentation
SDKs
CLI
API Reference
Model Library
FAQ
Changelog
Account & Access
Company account access
Close account
Multiple accounts login
GitHub authentication email
LinkedIn authentication email
Billing & Pricing
Pricing structure
Fine-tuned model fees
Bulk usage discounts
Serverless discounts
Credits & billing system
Account suspension reasons
$1 credit depleted
Missing credits issue
Invoice vs credits
Credit receipts
Models API billing
Serverless prompt caching billing
Input image pricing
Deployment & Infrastructure
Performance optimization
Performance benchmarking
Model latency ranges
Performance factors
Performance best practices
Serverless latency guarantees
Serverless SLAs
Serverless quotas
Fine-tuned serverless costs
Model removal notice
Serverless timeout issues
System scaling
Auto scaling support
Throughput capacity
Request handling factors
Autoscaling cost impact
On-demand rate limits
On-demand billing
GPU deployment billing
GPU selection guide
Custom model deployment issues
Deployment performance expectations
Performance consultation
Single replica optimization
Models & Inference
Custom base models
Serverless model availability
Model availability requests
Llama 3.1 405B quantization
API batching & load balancing
Request handling capacity
Safety filter controls
Token limit controls
Streaming performance metrics
FLUX multiple images
FLUX image-to-image
FLUX custom LoRA
SDXL ControlNet sizing
Fine-tuning
Fine-tuning service
Fine-tuning model support
Fine-tuned model access
firectl invalid ID errors
Llama 3.1 LoRA deployment
Security & Compliance
Data encryption at rest
Data encryption in transit
Client-side encryption options
Security policy documentation
LLM model guardrails
Private network connections
Security certifications
Support & General
General support
Performance support
Deployment regions
Support options
Support process
Enterprise support
Enterprise support Slack
Enterprise support tiers & SLAs
Enterprise tier quotas
Deployment & Infrastructure
What factors affect the number of simultaneous requests that can be handled?
Copy page
The request handling capacity is influenced by multiple factors:
Model size and type
Number of GPUs
allocated to the deployment
GPU type
(e.g., A100 vs. H100)
Prompt size
and
generation token length
Deployment type
(serverless vs. on-demand)
Was this page helpful?
Yes
No
What’s the supported throughput?
Previous
How does autoscaling affect my costs?
Next
Assistant
Responses are generated using AI and may contain mistakes.