Fireworks AI Docs home page
Search...
⌘K
Ask AI
Community
Status
Dashboard
Dashboard
Search...
Navigation
Models & Inference
What quantization format is used for the Llama 3.1 405B model?
Documentation
SDKs
CLI
API Reference
Model Library
FAQ
Changelog
Account & Access
Company account access
Close account
Multiple accounts login
GitHub authentication email
LinkedIn authentication email
Billing & Pricing
Pricing structure
Fine-tuned model fees
Bulk usage discounts
Serverless discounts
Credits & billing system
Account suspension reasons
$1 credit depleted
Missing credits issue
Invoice vs credits
Credit receipts
Models API billing
Serverless prompt caching billing
Input image pricing
Deployment & Infrastructure
Performance optimization
Performance benchmarking
Model latency ranges
Performance factors
Performance best practices
Serverless latency guarantees
Serverless SLAs
Serverless quotas
Fine-tuned serverless costs
Model removal notice
Serverless timeout issues
System scaling
Auto scaling support
Throughput capacity
Request handling factors
Autoscaling cost impact
On-demand rate limits
On-demand billing
GPU deployment billing
GPU selection guide
Custom model deployment issues
Deployment performance expectations
Performance consultation
Single replica optimization
Models & Inference
Custom base models
Serverless model availability
Model availability requests
Llama 3.1 405B quantization
API batching & load balancing
Request handling capacity
Safety filter controls
Token limit controls
Streaming performance metrics
FLUX multiple images
FLUX image-to-image
FLUX custom LoRA
SDXL ControlNet sizing
Fine-tuning
Fine-tuning service
Fine-tuning model support
Fine-tuned model access
firectl invalid ID errors
Llama 3.1 LoRA deployment
Security & Compliance
Data encryption at rest
Data encryption in transit
Client-side encryption options
Security policy documentation
LLM model guardrails
Private network connections
Security certifications
Support & General
General support
Performance support
Deployment regions
Support options
Support process
Enterprise support
Enterprise support Slack
Enterprise support tiers & SLAs
Enterprise tier quotas
Models & Inference
What quantization format is used for the Llama 3.1 405B model?
Copy page
The
Llama 3.1 405B model
uses the
FP8 quantization format
, which:
Closely matches
Meta’s reference implementation
Provides further details in the model description at
fireworks.ai/models/fireworks/llama-v3p1-405b-instruct
Has a general quantization methodology documented in our
Quantization blog
Note
:
BF16 precision
will be available soon for on-demand deployments.
Was this page helpful?
Yes
No
There’s a model I would like to use that isn’t available on Fireworks. Can I request it?
Previous
Does the API support batching and load balancing?
Next
Assistant
Responses are generated using AI and may contain mistakes.