What quantization format is used for the Llama 3.1 405B model? - Fireworks AI Docs

The Llama 3.1 405B model uses the FP8 quantization format, which:

Closely matches Meta’s reference implementation
Provides further details in the model description at fireworks.ai/models/fireworks/llama-v3p1-405b-instruct
Has a general quantization methodology documented in our Quantization blog

Note: BF16 precision will be available soon for on-demand deployments.

There’s a model I would like to use that isn’t available on Fireworks. Can I request it?

Does the API support batching and load balancing?