By default, models on dedicated deployments are served using 16-bit floating-point (FP16) precision. Quantization reduces the number of bits
used to serve the model, improving performance and reducing cost to serve. However, this can change model numerics
which may introduce small changes to the output.
Take a look at our blog post for a detailed treatment of how
quantization affects model quality.
Quantizing a model
A model can be quantized to 8-bit floating-point (FP8) precision.
firectl prepare-model <MODEL_ID>
firectl prepare-model <MODEL_ID>
import os
import requests
ACCOUNT_ID = os.environ.get("FIREWORKS_ACCOUNT_ID")
API_KEY = os.environ.get("FIREWORKS_API_KEY")
MODEL_ID = "<YOUR_MODEL_ID>" # The ID of the model you want to prepare
response = requests.post(
f"https://api.fireworks.ai/v1/accounts/{ACCOUNT_ID}/models/{MODEL_ID}:prepare",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"precision": "FP8"
}
)
print(response.json())
This is an additive process that enables creating deployments with additional precisions. The original FP16 checkpoint is still available for use.
You can check on the status of preparation by running:
firectl get model <MODEL_ID>
firectl get model <MODEL_ID>
import os
import requests
ACCOUNT_ID = os.environ.get("FIREWORKS_ACCOUNT_ID")
API_KEY = os.environ.get("FIREWORKS_API_KEY")
MODEL_ID = "<YOUR_MODEL_ID>" # The ID of the model you want to get
response = requests.get(
f"https://api.fireworks.ai/v1/accounts/{ACCOUNT_ID}/models/{MODEL_ID}",
headers={
"Authorization": f"Bearer {API_KEY}"
}
)
print(response.json())
and checking if the state is still in PREPARING
. A successfully prepared model will have the desired precision added
to the Precisions
list.
Creating an FP8 deployment
By default, creating a dedicated deployment will use the FP16 checkpoint. To see what precisions are available for a
model, run:
firectl get model <MODEL_ID>
firectl get model <MODEL_ID>
import os
import requests
ACCOUNT_ID = os.environ.get("FIREWORKS_ACCOUNT_ID")
API_KEY = os.environ.get("FIREWORKS_API_KEY")
MODEL_ID = "<YOUR_MODEL_ID>" # The ID of the model you want to get
response = requests.get(
f"https://api.fireworks.ai/v1/accounts/{ACCOUNT_ID}/models/{MODEL_ID}",
headers={
"Authorization": f"Bearer {API_KEY}"
}
)
print(response.json())
The Precisions
field will indicate what precisions the model has been prepared for.
To use the quantized FP8 checkpoint, pass the --precision
flag:
firectl create deployment <MODEL> --accelerator-type NVIDIA_H100_80GB --precision FP8
firectl create deployment <MODEL> --accelerator-type NVIDIA_H100_80GB --precision FP8
import os
import requests
ACCOUNT_ID = os.environ.get("FIREWORKS_ACCOUNT_ID")
API_KEY = os.environ.get("FIREWORKS_API_KEY")
# The ID of the model you want to deploy.
# The model must be prepared for FP8 precision.
MODEL_ID = "<YOUR_MODEL_ID>"
DEPLOYMENT_NAME = "My FP8 Deployment"
response = requests.post(
f"https://api.fireworks.ai/v1/accounts/{ACCOUNT_ID}/deployments",
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"displayName": DEPLOYMENT_NAME,
"baseModel": MODEL_ID,
"acceleratorType": "NVIDIA_H100_80GB",
"precision": "FP8",
}
)
print(response.json())
Quantized deployments can only be served using H100 GPUs.