Resource types

The SDK currently supports three types of resources: LLM, Dataset, and SupervisedFineTuningJob.

LLM

class LLM()

Properties:

  • deployment_name str - The full name of the deployment (e.g., accounts/my-account/deployments/my-custom-deployment)
  • deployment_display_name str - The display name of the deployment, defaults to the filename where the LLM was instantiated unless otherwise specified
  • deployment_url str - The URL to view the deployment in the Fireworks dashboard
  • temperature float - The temperature for generation
  • model str - The model associated with this LLM (e.g., accounts/fireworks/models/llama-v3p2-3b-instruct)
  • base_deployment_name str - If a LoRA addon, the deployment name of the base model deployment
  • peft_base_model str - If this is a LoRA addon, the base model identifier (e.g., accounts/fireworks/models/llama-v3p2-3b-instruct)
  • addons_enabled bool - Whether LoRA addons are enabled for this LLM
  • model_id str - The identifier used under the hood to query this model (e.g., accounts/my-account/deployedModels/my-deployed-model-abcdefg)

Instantiation

The LLM(*args, **kwargs) class constructor initializes a new LLM instance.

from fireworks.client import LLM
from datetime import timedelta

# Basic usage with required parameters
llm = LLM(
    model="accounts/fireworks/models/llama-v3p2-3b-instruct",
    deployment_type="auto"
)

# Advanced usage with optional parameters
llm = LLM(
    model="accounts/fireworks/models/llama-v3p2-3b-instruct",
    deployment_type="on-demand",
    deployment_name="my-custom-deployment",
    accelerator_type="NVIDIA_H100_80GB",
    min_replica_count=1,
    max_replica_count=3,
    scale_up_window=timedelta(seconds=30),
    scale_down_window=timedelta(minutes=10),
    enable_metrics=True
)

Required Arguments

  • model str - The model identifier to use (e.g., accounts/fireworks/models/llama-v3p2-3b-instruct)
  • deployment_type str - The type of deployment to use. Must be one of:
    • "serverless": Uses Fireworks’ shared serverless infrastructure
    • "on-demand": Uses dedicated resources for your deployment
    • "auto": Automatically selects the most cost-effective option (recommended for experimentation)

Optional Arguments

Deployment Configuration

  • deployment_name str, optional - Name to identify the deployment. If not provided, Fireworks will auto-generate one. If a deployment with the same name already exists, the SDK will try and re-use it.
  • deployment_display_name str, optional - Display name for the deployment. Defaults to the filename where the LLM was instantiated. If a deployment with the same display name and model already exists, the SDK will try and re-use it.
  • base_deployment_name str, optional - Base deployment name for LoRA addons. If not provided, will try to find a base model deployment that can be reused.

Authentication & API

  • api_key str, optional - Your Fireworks API key
  • base_url str, optional - Base URL for API calls. Defaults to “https://api.fireworks.ai/inference/v1
  • max_retries int, optional - Maximum number of retry attempts. Defaults to 3

Scaling Configuration

  • scale_up_window timedelta, optional - Time to wait before scaling up after increased load. Defaults to 1 second
  • scale_down_window timedelta, optional - Time to wait before scaling down after decreased load. Defaults to 1 minute
  • scale_to_zero_window timedelta, optional - Time of inactivity before scaling to zero. Defaults to 5 minutes

Hardware & Performance

  • accelerator_type str, optional - Type of GPU accelerator to use
  • region str, optional - Region for deployment
  • min_replica_count int, optional - Minimum number of replicas
  • max_replica_count int, optional - Maximum number of replicas
  • replica_count int, optional - Fixed number of replicas
  • accelerator_count int, optional - Number of accelerators per replica
  • precision str, optional - Model precision (e.g., “FP16”, “FP8”)
  • max_batch_size int, optional - Maximum batch size for inference

Advanced Features

  • enable_addons bool, optional - Enable LoRA addons support
  • draft_token_count int, optional - Number of tokens to generate per step for speculative decoding
  • draft_model str, optional - Model to use for speculative decoding
  • ngram_speculation_length int, optional - Length of previous input sequence for N-gram speculation
  • long_prompt_optimized bool, optional - Optimize for long prompts
  • temperature float, optional - Sampling temperature for generation

Monitoring & Metrics

  • enable_metrics bool, optional - Enable metrics collection. Currently supports time to last token for non-streaming requests.

Additional Configuration

  • description str, optional - Description of the deployment
  • cluster str, optional - Cluster identifier
  • enable_session_affinity bool, optional - Enable session affinity
  • direct_route_api_keys list[str], optional - List of API keys for direct routing
  • direct_route_type str, optional - Type of direct routing

create_supervised_fine_tuning_job()

Creates a new supervised fine-tuning job and blocks until it is ready. See the SupervisedFineTuningJob section for details on the parameters.

Returns:

  • An instance of SupervisedFineTuningJob.
job = llm.create_supervised_fine_tuning_job(
    name="my-fine-tuning-job",
    dataset_or_id=dataset,
    epochs=3,
    learning_rate=1e-5
)

delete_deployment()

Deletes the deployment associated with this LLM instance if one exists.

Arguments:

  • ignore_checks bool, optional - Whether to ignore safety checks. Defaults to False.
llm.delete_deployment(ignore_checks=True)

get_time_to_last_token_mean()

Returns the mean time to last token for non-streaming requests. If no metrics are available, returns None.

Returns:

  • A float representing the mean time to last token, or None if no metrics are available.
time_to_last_token_mean = llm.get_time_to_last_token_mean()

with_deployment_type()

Returns a new LLM instance with the specified deployment type.

Arguments:

  • deployment_type str - The deployment type to use (“serverless”, “on-demand”, or “auto”)

Returns:

  • A new LLM instance with the specified deployment type
# Create a new LLM with different deployment type
serverless_llm = llm.with_deployment_type("serverless")
on_demand_llm = llm.with_deployment_type("on-demand")

with_temperature()

Returns a new LLM instance with the specified temperature.

Arguments:

  • temperature float - The temperature for generation

Returns:

  • A new LLM instance with the specified temperature
# Create a new LLM with different temperature
creative_llm = llm.with_temperature(1.0)
deterministic_llm = llm.with_temperature(0.0)

chat.completions.create() and chat.completions.acreate()

Creates a chat completion using the LLM. These methods are OpenAI compatible and follow the same interface as described in the OpenAI Chat Completions API. Use create() for synchronous calls and acreate() for asynchronous calls.

Note: The Fireworks chat completions API includes additional request and response fields beyond the standard OpenAI API. See the Fireworks Chat Completions API reference for the complete set of available parameters and response fields.

Arguments:

  • messages list - A list of messages comprising the conversation so far
  • stream bool, optional - Whether to stream the response. Defaults to False
  • response_format dict, optional - An object specifying the format that the model must output
  • reasoning_effort str, optional - How much effort the model should put into reasoning
  • max_tokens int, optional - The maximum number of tokens to generate
  • temperature float, optional - Sampling temperature between 0 and 2. If not provided, uses the LLM’s default temperature. Note that temperature can also be set once during LLM instantiation if preferred
  • tools list, optional - A list of tools the model may call
  • extra_headers dict, optional - Additional headers to include in the request
  • **kwargs - Additional parameters supported by the OpenAI API

Returns:

  • ChatCompletion when stream=False (default)
  • Generator[ChatCompletionChunk, None, None] when stream=True (sync version)
  • AsyncGenerator[ChatCompletionChunk, None] when stream=True (async version)

For details on the ChatCompletion object structure, see the OpenAI Chat Completion Object documentation. For the ChatCompletionChunk object structure used in streaming, see the OpenAI Chat Streaming documentation.

import asyncio
from fireworks.client import LLM

llm = LLM(
    model="accounts/fireworks/models/llama-v3p2-3b-instruct",
    deployment_type="auto"
)

# Synchronous usage
response = llm.chat.completions.create(
    messages=[
        {"role": "user", "content": "Hello, world!"}
    ]
)
print(response.choices[0].message.content)

# Synchronous streaming
for chunk in llm.chat.completions.create(
    messages=[
        {"role": "user", "content": "Tell me a story"}
    ],
    stream=True
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Asynchronous usage
async def main():
    response = await llm.chat.completions.acreate(
        messages=[
            {"role": "user", "content": "Hello, world!"}
        ]
    )
    print(response.choices[0].message.content)

    # Async streaming
    async for chunk in await llm.chat.completions.acreate(
        messages=[
            {"role": "user", "content": "Tell me a story"}
        ],
        stream=True
    ):
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="")

asyncio.run(main())

Dataset

The Dataset class provides a convenient way to manage datasets for fine-tuning on Fireworks. It offers smart features like automatic naming and uploading of datasets. You do not instantiate a Dataset object directly. Instead, you create a Dataset object by using one of the class methods below.

Properties:

  • name str - The name of the dataset

from_list()

@classmethod
from_list(data: list)

Creates a Dataset from a list of training examples. Each example should be compatible with OpenAI’s chat completion format.

from fireworks.client import Dataset

# Create dataset from a list of examples
examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"},
            {"role": "assistant", "content": "Paris."}
        ]
    }
]
dataset = Dataset.from_list(examples)

from_file()

@classmethod
from_file(path: str)

Creates a Dataset from a local JSONL file. The file should contain training examples in OpenAI’s chat completion format.

from fireworks.client import Dataset

# Create dataset from a JSONL file
dataset = Dataset.from_file("path/to/training_data.jsonl")

from_string()

@classmethod
from_string(data: str)

Creates a Dataset from a string containing JSONL-formatted training examples.

from fireworks.client import Dataset

# Create dataset from a JSONL string
jsonl_data = """
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi there!"}]}
{"messages": [{"role": "user", "content": "What is 1+1?"}, {"role": "assistant", "content": "2"}]}
"""
dataset = Dataset.from_string(jsonl_data)

sync()

Uploads the dataset to Fireworks if it doesn’t already exist. This method automatically:

  1. Checks if a dataset with the same content hash already exists
  2. If it exists, skips the upload to avoid duplicates
  3. If it doesn’t exist, creates and uploads the dataset to Fireworks
  4. Validates the dataset after upload
from fireworks.client import Dataset

# Create dataset and sync it to Fireworks
dataset = Dataset.from_file("path/to/training_data.jsonl")
dataset.sync()

# The dataset is now available on Fireworks and ready for fine-tuning

delete()

Deletes the dataset from Fireworks.

dataset = Dataset.from_file("path/to/training_data.jsonl")

dataset.delete()

Data Format

The Dataset class expects data in OpenAI’s chat completion format. Each training example should be a JSON object with a messages array containing message objects. Each message object should have:

  • role: One of "system", "user", or "assistant"
  • content: The message content as a string

Example format:

{
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
        {"role": "assistant", "content": "Paris."}
    ]
}

SupervisedFineTuningJob

The SupervisedFineTuningJob class manages fine-tuning jobs on Fireworks. It provides a convenient interface for creating, monitoring, and managing fine-tuning jobs.

class SupervisedFineTuningJob()

Properties:

  • output_model str - The identifier of the output model (e.g., accounts/my-account/models/my-finetuned-model)
  • output_llm LLM - An LLM instance associated with the output model

Instantiation

You do not need to directly instantiate a SupervisedFineTuningJob object. Instead, you should use the .create_supervised_fine_tuning_job() method on the LLM object and pass in the following required and optional arguments.

Required Arguments

  • name str - A unique name for the fine-tuning job
  • llm LLM - The LLM instance to fine-tune
  • dataset_or_id Union[Dataset, str] - The dataset to use for fine-tuning, either as a Dataset object or dataset ID

Optional Arguments

Training Configuration

  • epochs int, optional - Number of training epochs
  • learning_rate float, optional - Learning rate for training
  • lora_rank int, optional - Rank for LoRA fine-tuning
  • jinja_template str, optional - Template for formatting training examples
  • early_stop bool, optional - Whether to enable early stopping
  • max_context_length int, optional - Maximum context length for the model
  • base_model_weight_precision str, optional - Precision for base model weights
  • batch_size int, optional - Batch size for training

Hardware Configuration

  • accelerator_type str, optional - Type of GPU accelerator to use
  • accelerator_count int, optional - Number of accelerators to use
  • is_turbo bool, optional - Whether to use turbo mode for faster training
  • region str, optional - Region for deployment
  • nodes int, optional - Number of nodes to use

Evaluation & Monitoring

  • evaluation_dataset str, optional - Dataset ID to use for evaluation
  • eval_auto_carveout bool, optional - Whether to automatically carve out evaluation data
  • wandb_config WandbConfig, optional - Configuration for Weights & Biases integration

Job Management

  • id str, optional - Job ID (auto-generated if not provided)
  • api_key str, optional - API key for authentication
  • state JobState, optional - Current state of the job
  • create_time datetime, optional - Time when the job was created
  • update_time datetime, optional - Time when the job was last updated
  • created_by str, optional - User who created the job
  • output_model str, optional - ID of the output model

wait_for_completion()

Polls the job status until it is complete and returns the job object.

job = job.wait_for_completion()

delete()

Deletes the job.

job.delete()