Checkpoints and Resume

TL;DR

If you launch training through a cookbook recipe (rl_loop, sft_loop, dpo_loop, orpo_loop, igpo_loop), you don’t have to call any checkpoint APIs yourself. Set two config fields and the recipe handles save, resume, and promote:

dcp_save_interval=N — save resumable checkpoints every N steps
output_model_id="my-model" — promote the final checkpoint to a deployable Fireworks model

Rerunning with the same log_path resumes from the last saved checkpoint automatically.

from training.recipes.sft_loop import Config, main
from training.utils import InfraConfig, WeightSyncConfig

cfg = Config(
    log_path="./my_training",
    base_model="accounts/fireworks/models/qwen3-8b",
    dataset="data.jsonl",
    tokenizer_model="Qwen/Qwen3-8B",
    output_model_id="qwen3-8b-finetuned",
    weight_sync=WeightSyncConfig(dcp_save_interval=10),
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    ),
)
main(cfg)

# Interrupted? Run again with the same config — it picks up automatically.
main(cfg)

That’s the full surface most users need. The rest of this page covers config knobs, manual promotion via the CLI, and (under Advanced internals) what the recipe is doing under the hood.

dcp_save_interval defaults to 0 (off). Without setting it to a positive value, training cannot be resumed from intermediate steps.

Config fields

Field	Type	Default	Description
`log_path`	`str`	(required)	Directory for the recipe’s local bookkeeping (`dataloader.json`) and logs
`weight_sync.dcp_save_interval`	`int`	`0`	Save a resumable (DCP) checkpoint every N steps. `0` = off.
`output_model_id`	`str \| None`	`None`	If set, promote the final checkpoint to this Fireworks model ID at the end of training
`init_from_checkpoint`	`str \| None`	`None`	Load weights from another job (`"job-id:checkpoint-name"`). Step counter resets to 0.

Resume

Automatic (same log_path)

Just rerun with the same log_path and the recipe resumes. It queries the control plane for the newest resumable checkpoint on the trainer job and reloads weights and optimizer state. The step counter and the cookbook’s data_consumed counter are restored from dataloader.json in log_path.

From another job

config = Config(
    log_path="./new_run",
    init_from_checkpoint="i44pvd4syzg8hjfk:step-4",  # job_id:checkpoint_name
    ...
)

Loads weights from the specified job, resets step to 0. Mutually exclusive with automatic resume.

Promoting a checkpoint manually

If you want to promote an arbitrary checkpoint after training (not just the final one), use the cookbook’s promote script:

export FIREWORKS_API_KEY=...

python promote_checkpoint.py \
    --job-id <trainer-job-id> \
    --output-model-id my-fine-tuned-model \
    --base-model accounts/fireworks/models/qwen3-8b

By default the script promotes the newest promotable checkpoint on the job. Pass --checkpoint-name <name> to promote a specific one. You can also call the API directly — see Saving and Loading — Promoting.

Advanced internals

Most users can stop reading here. The sections below cover what the recipe does internally — useful only if you’re forking a recipe, calling the SDK directly, or debugging a checkpoint that doesn’t promote. The full SDK-level reference lives in Saving and Loading.

What gets saved, where

The recipe interacts with two surfaces:

Surface	Owns	Source of truth for
Control plane (`FireworksClient.list_checkpoints(job_id)`)	All remote checkpoint blobs (DCP and sampler)	What checkpoints exist, their type, and whether each is promotable
`{log_path}/dataloader.json`	Local file	The cookbook’s `data_consumed` counter per checkpoint name (no server-side representation)

There is no checkpoints.jsonl registry — the control plane is queried at resume / promote time.

Two axes: resumable and promotable

When the recipe saves a checkpoint, it picks two independent capabilities:

Axis	What it writes	Resumes?	Promotes to a model?
`resumable=True`	DCP (weights + optimizer)	Yes	No
`promotable=True`	Sampler weights (HF format)	No	Yes
Both	DCP + sampler	Yes	Yes

Periodic saves use resumable=True only. The final save uses both. For LoRA RL runs, WeightSyncer.save_and_hotload already produces a promotable row each step, so the recipe’s final promotion picks that up without an extra sampler write.

Forking a recipe

If you fork rl_loop.py (or another ported recipe) and need to drive checkpointing yourself, instantiate TrainingCheckpoints:

from training.utils.checkpoints import TrainingCheckpoints

ckpt = TrainingCheckpoints(
    policy,           # ReconnectableClient
    rlor_mgr,         # TrainerJobManager (control-plane client)
    trainer_id=policy_job_id,
    log_path=cfg.log_path,
    lora_rank=cfg.lora_rank,
)

# Resume on startup
resume_info = ckpt.resume(
    init_from_checkpoint=cfg.init_from_checkpoint,
    warm_start_from_adapter=cfg.warm_start_from_adapter,
)
step_offset = resume_info.step if resume_info else 0

# Periodic save
ckpt.save(f"step-{step}", resumable=True, promotable=False, data_consumed=count)

# Final save + promote
ckpt.save(f"step-{step}", resumable=True, promotable=True, data_consumed=count)
if cfg.output_model_id:
    ckpt.promote_latest(cfg.output_model_id, cfg.base_model)

The class is intentionally thin — it forwards save_state / save_weights_for_sampler_ext / promote_checkpoint to the SDK and uses the control plane as the source of truth for resume and promotion. The full API surface those calls expose is documented in Saving and Loading.

Checkpoint kinds

This subsection is the canonical reference for checkpoint kinds and promotability across the stack — other pages link here. Three separate layers of the stack each have their own “type”, and confusing them is the usual reason a promotion fails. They are not synonyms:

Layer	Where	Values	What it controls
Cookbook	`TrainingCheckpoints.save(resumable=, promotable=)`	two booleans	Which of DCP / sampler blob (or both) gets saved
SDK	`save_weights_for_sampler_ext(checkpoint_type=...)`	`"base"`, `"delta"`	Whether the sampler blob is full weights or an `arc_v2` delta over the previous base (LoRA ignores this — full adapter is always saved)
Server	`checkpointType` on each control-plane row	`TRAINING`, `TRAINING_LORA`, `INFERENCE_BASE`, `INFERENCE_LORA`, `INFERENCE_ARC_V2`	Detected from blob contents. The first two are resumable; `INFERENCE_BASE` and `INFERENCE_LORA` are promotable; `INFERENCE_ARC_V2` (delta on full-param) is not.

When the cookbook saves with promotable=True, it always calls the SDK with checkpoint_type="base", which the server detects as INFERENCE_BASE (full-param) or INFERENCE_LORA (LoRA). Both are promotable. The non-promotable INFERENCE_ARC_V2 only happens if you bypass the cookbook and call save_weights_for_sampler_ext("delta") on a full-parameter run.

Promotability cheat sheet

“Promotable” means the server will accept the blob for promotion — i.e. the checkpoint shows promotable=True in list_checkpoints. To actually promote, you need the checkpoint name plus source_job_id and base_model.

How it was saved	LoRA promotable	Full-param promotable
`TrainingCheckpoints.save(resumable=True, promotable=False)`	No (DCP only)	No (DCP only)
`TrainingCheckpoints.save(promotable=True)`	Yes	Yes
`save_weights_for_sampler_ext(checkpoint_type="base")`	Yes	Yes
`save_weights_for_sampler_ext(checkpoint_type="delta")`	Yes (server always stores full adapter)	No
`WeightSyncer.save_and_hotload()` — first save	Yes	Yes
`WeightSyncer.save_and_hotload()` — later saves	Yes	No

For SDK-level details on each row (full method signatures, base-vs-delta semantics, weight-sync lifecycle), see Saving and Loading.

Saving and Loading — SDK-level reference for save / load / promote
WeightSyncer reference — weight-sync lifecycle
Cookbook RL — full GRPO walkthrough

Get Started

Serverless

Deployments

Models & Inference

Fine Tuning

Fire Pass

Administration

Security & Compliance

Integrations

TL;DR

Config fields

Resume

Automatic (same log_path)

From another job

Promoting a checkpoint manually

Advanced internals

What gets saved, where

Two axes: resumable and promotable

Forking a recipe

Checkpoint kinds

Promotability cheat sheet

Get Started

Serverless

Deployments

Models & Inference

Fine Tuning

Fire Pass

Administration

Security & Compliance

Integrations

Documentation Index

​TL;DR

​Config fields

​Resume

​Automatic (same log_path)

​From another job

​Promoting a checkpoint manually

​Advanced internals

​What gets saved, where

​Two axes: resumable and promotable

​Forking a recipe

​Checkpoint kinds

​Promotability cheat sheet

​Related guides

TL;DR

Config fields

Resume

Automatic (same log_path)

From another job

Promoting a checkpoint manually

Advanced internals

What gets saved, where

Two axes: resumable and promotable

Forking a recipe

Checkpoint kinds

Promotability cheat sheet

Related guides