Overview
This guide shows how to deploy Fireworks AI inference on Amazon SageMaker, including:- Preparing your AWS account and checking SageMaker quotas
- Packaging and uploading your model to Amazon S3
- Pushing the Fireworks Docker image to Amazon ECR
- Creating a SageMaker endpoint (single replica, multi-replica, or sharded)
- Testing the endpoint via SageMaker Runtime
Make sure that you use the same region for all steps in this guide (S3 bucket, ECR repository, and SageMaker endpoint).
Prerequisites
- Fireworks AI Docker image and metering key; please reach out to aws@fireworks.ai for more information
- AWS account with permissions for SageMaker, ECR, and S3
- AWS CLI installed and configured
- Docker Desktop installed and running
- jq for JSON filtering (macOS:
brew install jq
) - git-lfs for large model files (macOS:
brew install git-lfs
thengit lfs install
) - Optional: uv for Python virtual envs and execution (
curl -LsSf https://astral.sh/uv/install.sh | sh
)
This guide uses placeholder values for AWS account ID, region, and bucket name (denoted in square brackets). Please replace them with your own values.
Step 1: Obtain the Fireworks Docker image and metering key
- Reach out to aws@fireworks.ai to receive a link to the Fireworks AI Docker image, metering key, and to set up billing.
- Keep the metering key secure; you will set it as an environment variable when deploying.
Store the metering key in a secrets manager or CI/CD secret store.
Step 2: Verify SageMaker GPU service quotas
Check your SageMaker quotas for GPU instances in your target region. Make sure to replace[YOUR_REGION]
with your actual region. For best compatibility, instance types with A100, H100, or H200 GPUs are recommended (ml.p4d*
, ml.p5*
).
ml.p5*
for H100/H200).
Insufficient quota will cause endpoint deployment failures. Request quota increases in advance if needed.
Step 3: Create an S3 bucket and upload model files
- Create a bucket in your region:
- Download model files from Hugging Face (example: Qwen/Qwen3-8B):
- Inside the model directory (e.g. in
Qwen3-8B
), create afireworks.json
describing your model configuration:
- Inside the model directory (e.g. in
Qwen3-8B
), upload your model files to S3:
- [OPTIONAL] Add a speculator model for speculative decoding:
draft
directory:
Qwen3-8B/draft
), create a fireworks.json
:
Step 6
).
Step 4: Create an IAM role for SageMaker
- In the AWS Console, open IAM → Roles → Create role.
- Select AWS service → SageMaker → SageMaker - Execution.
- Keep the default policy
AmazonSageMakerFullAccess
and continue. - Name the role (for example,
SageMakerFireworksRole
) and create it. - Open the role’s Summary → Add permissions → Create inline policy → JSON, then paste the following. Replace
[BUCKET_NAME]
with your bucket name (nos3://
).
S3FireworksModelAccess
) and create it.
Step 5: Push the Fireworks Docker image to ECR
- Create an ECR repository in your target region:
- Tag the Fireworks Docker image downloaded in Step 1:
- Push the image to ECR:
Step 6: Deploy the SageMaker endpoint
You can deploy a multi-replica or sharded endpoint. Please refer to the scripts below for more details.- Local environment setup: env_setup.sh
- Multi-replica deployment script: deploy_multi_gpu_replicated.py
- Sharded deployment script: deploy_multi_gpu_sharded.py
- Speculative decoding deployment script: deploy_spec_decode.py
Run the
env_setup.sh
script to set up your local environment and add FIREWORKS_METERING_KEY
to your environment before running the deployment scripts.Step 7: Test the endpoint
Once deployed, you can test your SageMaker endpoint with the following script: test_endpoint.pyYou should see successful responses for both completions and chat APIs.
Troubleshooting
Quota or capacity errors
- Symptom: Endpoint creation fails with quota or capacity messages
- Fix: Verify
p4
,p5
, or other GPU endpoint quotas. Request increases and retry
ECR authentication failures
- Symptom:
docker push
fails with permission denied - Fix: Re-run ECR login and confirm repository URI, region, and account ID
S3 access denied
- Symptom: Model fails to download during container startup
- Fix: Ensure the IAM role inline policy includes your bucket and
/*
object path; Ensure that you are pointing yours3_model_path
to themodel.tar.gz
file
Next steps
- Integrate your application with the SageMaker endpoint via your preferred SDK
- Reach out to your Fireworks AI contact for support with optimizing your deployment to your specific workload