Documentation Index
Fetch the complete documentation index at: https://fireworks.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
What this is
This guide walks through GRPO (Group Relative Policy Optimization) training using the cookbook’s rl_loop recipe. GRPO samples multiple completions per prompt, scores them with a reward function, and uses group reward statistics for policy gradient updates.
Architecture
The RL recipe always uses a policy trainer plus an inference deployment. Add a reference trainer when your setup needs reference logprobs:
| Component | Role |
|---|
| Policy trainer | Trainable model — runs forward_backward_custom + optim_step |
| Reference trainer | Optional frozen copy — provides KL/reference logprobs (--forward-only) when infra.ref_training_shape_id is set |
| Deployment | Sampling completions via DeploymentSampler (client-side tokenized) |
Using the recipe
The simplest way to run GRPO is via the cookbook’s Config + main:
from training.recipes.rl_loop import Config, main
from training.utils import DeployConfig, InfraConfig, WeightSyncConfig, WandBConfig
cfg = Config(
log_path="./grpo_logs",
base_model="accounts/fireworks/models/qwen3-8b",
dataset="/path/to/gsm8k.jsonl",
max_rows=200,
epochs=1,
completions_per_prompt=4,
max_completion_tokens=1024,
temperature=1.0,
max_seq_len=4096,
policy_loss="grpo", # or "importance_sampling", "dapo", "dro", "gspo", "cispo"
infra=InfraConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
ref_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
),
deployment=DeployConfig(
deployment_id="grpo-serving",
tokenizer_model="Qwen/Qwen3-8B",
),
weight_sync=WeightSyncConfig(weight_sync_interval=1),
wandb=WandBConfig(entity="my-team", project="grpo-experiment"),
)
main(cfg)
The recipe handles resource provisioning, rollout scheduling, reference logprobs, checkpointing, and cleanup automatically.
Policy loss variants
policy_loss | Description |
|---|
"grpo" | REINFORCE + KL penalty (default) |
"importance_sampling" | Off-policy ratio weighting with optional clipping |
"reinforce" | Vanilla REINFORCE |
"dapo" | Dynamic advantage with asymmetric PPO clipping |
"dro" | Distributionally robust off-policy objective |
"gspo" | Sequence-level clipped PPO |
"cispo" | Clipped importance sampling policy optimization |
Step-by-step (API-level)
For teams that need full control beyond what the recipe provides, here is the API-level flow.
Provision resources with setup_infra
training.utils.rl.setup_infra is the cookbook’s single entrypoint for shape
resolution, trainer/deployment provisioning, weight sync wiring, and
trainer/deployment re-attach. It requests the policy trainer first, links the
deployment, then waits for readiness in parallel. Recipes pass a config + two booleans
(needs_reference, needs_inference) and get back an Infra bundle of wired
trainer clients. Teams that fork training/recipes/rl_loop.py should reuse
setup_infra rather than re-wiring the lower-level helpers below.
import os
import transformers
from fireworks.training.sdk import (
TrainerJobManager, DeploymentManager, DeploymentSampler, WeightSyncer,
AdaptiveConcurrencyController,
)
from training.utils import (
InfraConfig, DeployConfig, ResourceCleanup, WeightSyncScope,
)
from training.utils.rl import setup_infra
api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")
rlor_mgr = TrainerJobManager(api_key=api_key, base_url=base_url)
deploy_mgr = DeploymentManager(api_key=api_key, base_url=base_url)
base_model = "accounts/fireworks/models/qwen3-8b"
infra_cfg = InfraConfig(
training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
ref_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
)
deploy_cfg = DeployConfig(
deployment_id="grpo-serving",
tokenizer_model="Qwen/Qwen3-8B",
weight_sync_scope=WeightSyncScope.PER_TRAINER, # default
)
with ResourceCleanup(rlor_mgr, deploy_mgr) as cleanup:
infra = setup_infra(
rlor_mgr=rlor_mgr,
deploy_mgr=deploy_mgr,
base_model=base_model,
infra_cfg=infra_cfg,
deploy_cfg=deploy_cfg,
lora_rank=0,
needs_reference=True, # KL baseline
needs_inference=True, # rollouts
role_prefix="grpo",
api_key=api_key,
cleanup=cleanup, # scope-exit: cancel trainers, scale deployment to 0
)
policy = infra.policy # ReconnectableClient (policy trainer)
reference = infra.reference # ReconnectableClient (forward-only) or LoRA shared handle
inference_model = infra.inference_model
tokenizer = transformers.AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
sampler = DeploymentSampler(
inference_url=deploy_mgr.inference_url,
model=inference_model,
api_key=api_key,
tokenizer=tokenizer,
concurrency_controller=AdaptiveConcurrencyController(initial_window=16),
)
See Weight sync for WeightSyncScope.PER_TRAINER (default) vs PER_DEPLOYMENT. For the full setup_infra contract, lower-level building blocks, and implementation rationale, see the cookbook’s dev skill: skills/dev/.
Training loop
import asyncio
tracker = WeightSyncer(
policy_client=policy.inner,
deploy_mgr=deploy_mgr,
deployment_id="grpo-serving",
base_model=base_model,
hotload_timeout=600,
first_checkpoint_type="base",
)
for row in dataset:
input_messages = [m for m in row["messages"] if m.get("role") != "assistant"]
completions = asyncio.run(
sampler.sample_with_tokens(messages=input_messages, n=4, max_tokens=512)
)
rewards = [score(c) for c in completions]
if len(set(rewards)) == 1:
continue
datums = build_grpo_datums(completions)
ref_fwd = reference.forward(datums, "cross_entropy")
ref_logprobs = [list(x["logprobs"].data) for x in ref_fwd.loss_fn_outputs]
loss_fn = make_grpo_loss_fn(rewards, ref_logprobs, kl_beta=0.001)
policy.forward_backward_custom(datums, loss_fn)
policy.optim_step(
tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
)
tracker.save_and_hotload(f"step-{step:05d}")
See Loss Functions for make_grpo_loss_fn and build_grpo_datums implementations.
Pipeline overlap
Sampling and training overlap within policy windows controlled by weight_sync_interval. All prompts in a window sample concurrently; results train as they arrive. At window boundaries the pipeline drains, weights sync to the deployment, and the next window samples against the updated weights.
weight_sync_interval | Behavior |
|---|
1 (default) | No overlap — sample, train, sync, repeat |
N > 1 | N-step windows with overlap inside, sync at boundaries |
0 | No syncs — the deployment keeps the base weights for the entire run. Useful for debugging or ablations, not standard RL training. |
Operational guidance
deployment.tokenizer_model is required — the API raises ValueError if not set.
- Set
infra.training_shape_id — training shapes are the launch path for cookbook trainers.
- Set
infra.ref_training_shape_id when you want a reference trainer — if it is unset, the recipe skips reference-model provisioning entirely.
- Skip prompts with uniform rewards (all correct or all wrong) — they provide no learning signal.
- Track reward distributions and KL every step to catch objective drift early.
- When configured, the reference trainer uses
--forward-only — never call optim_step on it.
- Sampling is async under the hood:
DeploymentSampler.sample_with_tokens() issues n concurrent n=1 requests, so synchronous scripts should wrap it with asyncio.run(...).
- DCP checkpoints are disabled by default (
dcp_save_interval=0). If you need to resume training from a checkpoint, explicitly set dcp_save_interval to a positive value in your WeightSyncConfig.
Common pitfalls
- Reward normalization bugs can destabilize GRPO updates quickly — verify advantage computation.
- Reference/policy tokenizer mismatch invalidates KL estimates — always use the same
base_model.
- Logprob alignment: Trainer returns N-1 logprobs for N tokens. Inference returns N logprobs where the first is
None. Use inference[1:] to align.
This page covers rl_loop (GRPO and its policy_loss variants — dapo, gspo, cispo, dro, importance_sampling, reinforce). One sibling RL recipe ships in the cookbook alongside it:
- IGPO (
training.recipes.igpo_loop) — Information Gain Policy Optimization for multi-turn agent trajectories. Adds turn-level IG rewards on top of the GRPO machinery; same policy_loss variants apply.
For runnable examples and rationale, see the recipe sources directly in the public cookbook repo. Implementation depth (RL internals, weight-sync state machine, hotload triage) lives in the skills/dev/ skill.
Async RL (experimental)
training.recipes.async_rl_loop overlaps rollout and training so sampling and gradient steps run on separate workers concurrently — the trainer no longer waits for a full batch of rollouts before stepping. You write a single async function:
async def rollout_fn(sample_prompt: dict) -> RolloutSample | None:
...
The recipe handles everything else: the outer loop, batching, weight sync, off-policy staleness bounds, and the rollout/train scheduler. No backward-compatibility guarantee — config fields and the rollout protocol may change between releases. For the full rollout_fn / RolloutSample contract, scheduler details, and tuning guidance, see the cookbook’s skills/dev/references/rl/async-rl.md skill.