Cookbook: Reinforcement Learning

What this is

This guide walks through GRPO (Group Relative Policy Optimization) training using the cookbook’s rl_loop recipe. GRPO samples multiple completions per prompt, scores them with a reward function, and uses group reward statistics for policy gradient updates.

Architecture

The RL recipe always uses a policy trainer plus an inference deployment. Add a reference trainer when your setup needs reference logprobs:

Component	Role
Policy trainer	Trainable model — runs `forward_backward_custom` + `optim_step`
Reference trainer	Optional frozen copy — provides KL/reference logprobs (`--forward-only`) when `infra.ref_training_shape_id` is set
Deployment	Sampling completions via `DeploymentSampler` (client-side tokenized)

Using the recipe

The simplest way to run GRPO is via the cookbook’s Config + main:

from training.recipes.rl_loop import Config, main
from training.utils import DeployConfig, InfraConfig, WeightSyncConfig, WandBConfig

cfg = Config(
    log_path="./grpo_logs",
    base_model="accounts/fireworks/models/qwen3-8b",
    dataset="/path/to/gsm8k.jsonl",
    max_rows=200,
    epochs=1,
    completions_per_prompt=4,
    max_completion_tokens=1024,
    temperature=1.0,
    max_seq_len=4096,
    policy_loss="grpo",  # or "importance_sampling", "dapo", "dro", "gspo", "cispo"
    infra=InfraConfig(
        training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
        ref_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
    ),
    deployment=DeployConfig(
        deployment_id="grpo-serving",
        tokenizer_model="Qwen/Qwen3-8B",
    ),
    weight_sync=WeightSyncConfig(weight_sync_interval=1),
    wandb=WandBConfig(entity="my-team", project="grpo-experiment"),
)

main(cfg)

The recipe handles resource provisioning, rollout scheduling, reference logprobs, checkpointing, and cleanup automatically.

Policy loss variants

`policy_loss`	Description
`"grpo"`	REINFORCE + KL penalty (default)
`"importance_sampling"`	Off-policy ratio weighting with optional clipping
`"reinforce"`	Vanilla REINFORCE
`"dapo"`	Dynamic advantage with asymmetric PPO clipping
`"dro"`	Distributionally robust off-policy objective
`"gspo"`	Sequence-level clipped PPO
`"cispo"`	Clipped importance sampling policy optimization

Step-by-step (API-level)

For teams that need full control beyond what the recipe provides, here is the API-level flow.

Provision resources with `setup_infra`

training.utils.rl.setup_infra is the cookbook’s single entrypoint for shape resolution, trainer/deployment provisioning, weight sync wiring, and trainer/deployment re-attach. It requests the policy trainer first, links the deployment, then waits for readiness in parallel. Recipes pass a config + two booleans (needs_reference, needs_inference) and get back an Infra bundle of wired trainer clients. Teams that fork training/recipes/rl_loop.py should reuse setup_infra rather than re-wiring the lower-level helpers below.

import os
import transformers
from fireworks.training.sdk import (
    TrainerJobManager, DeploymentManager, DeploymentSampler, WeightSyncer,
    AdaptiveConcurrencyController,
)
from training.utils import (
    InfraConfig, DeployConfig, ResourceCleanup, WeightSyncScope,
)
from training.utils.rl import setup_infra

api_key = os.environ["FIREWORKS_API_KEY"]
base_url = os.environ.get("FIREWORKS_BASE_URL", "https://api.fireworks.ai")

rlor_mgr = TrainerJobManager(api_key=api_key, base_url=base_url)
deploy_mgr = DeploymentManager(api_key=api_key, base_url=base_url)

base_model = "accounts/fireworks/models/qwen3-8b"
infra_cfg = InfraConfig(
    training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200",
    ref_training_shape_id="accounts/fireworks/trainingShapes/qwen3-8b-128k-h200-forward",
)
deploy_cfg = DeployConfig(
    deployment_id="grpo-serving",
    tokenizer_model="Qwen/Qwen3-8B",
    weight_sync_scope=WeightSyncScope.PER_TRAINER,  # default
)

with ResourceCleanup(rlor_mgr, deploy_mgr) as cleanup:
    infra = setup_infra(
        rlor_mgr=rlor_mgr,
        deploy_mgr=deploy_mgr,
        base_model=base_model,
        infra_cfg=infra_cfg,
        deploy_cfg=deploy_cfg,
        lora_rank=0,
        needs_reference=True,   # KL baseline
        needs_inference=True,   # rollouts
        role_prefix="grpo",
        api_key=api_key,
        cleanup=cleanup,        # scope-exit: cancel trainers, scale deployment to 0
    )

policy = infra.policy          # ReconnectableClient (policy trainer)
reference = infra.reference    # ReconnectableClient (forward-only) or LoRA shared handle
inference_model = infra.inference_model

tokenizer = transformers.AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
sampler = DeploymentSampler(
    inference_url=deploy_mgr.inference_url,
    model=inference_model,
    api_key=api_key,
    tokenizer=tokenizer,
    concurrency_controller=AdaptiveConcurrencyController(initial_window=16),
)

See Weight sync for WeightSyncScope.PER_TRAINER (default) vs PER_DEPLOYMENT. For the full setup_infra contract, lower-level building blocks, and implementation rationale, see the cookbook’s dev skill: skills/dev/.

Training loop

import asyncio

tracker = WeightSyncer(
    policy_client=policy.inner,
    deploy_mgr=deploy_mgr,
    deployment_id="grpo-serving",
    base_model=base_model,
    hotload_timeout=600,
    first_checkpoint_type="base",
)

for row in dataset:
    input_messages = [m for m in row["messages"] if m.get("role") != "assistant"]
    completions = asyncio.run(
        sampler.sample_with_tokens(messages=input_messages, n=4, max_tokens=512)
    )
    rewards = [score(c) for c in completions]
    if len(set(rewards)) == 1:
        continue

    datums = build_grpo_datums(completions)
    ref_fwd = reference.forward(datums, "cross_entropy")
    ref_logprobs = [list(x["logprobs"].data) for x in ref_fwd.loss_fn_outputs]

    loss_fn = make_grpo_loss_fn(rewards, ref_logprobs, kl_beta=0.001)
    policy.forward_backward_custom(datums, loss_fn)
    policy.optim_step(
        tinker.AdamParams(learning_rate=1e-5, beta1=0.9, beta2=0.999, eps=1e-8, weight_decay=0.01)
    )

    tracker.save_and_hotload(f"step-{step:05d}")

See Loss Functions for make_grpo_loss_fn and build_grpo_datums implementations.

Pipeline overlap

Sampling and training overlap within policy windows controlled by weight_sync_interval. All prompts in a window sample concurrently; results train as they arrive. At window boundaries the pipeline drains, weights sync to the deployment, and the next window samples against the updated weights.

`weight_sync_interval`	Behavior
`1` (default)	No overlap — sample, train, sync, repeat
`N > 1`	N-step windows with overlap inside, sync at boundaries
`0`	No syncs — the deployment keeps the base weights for the entire run. Useful for debugging or ablations, not standard RL training.

Operational guidance

deployment.tokenizer_model is required — the API raises ValueError if not set.
Set infra.training_shape_id — training shapes are the launch path for cookbook trainers.
Set infra.ref_training_shape_id when you want a reference trainer — if it is unset, the recipe skips reference-model provisioning entirely.
Skip prompts with uniform rewards (all correct or all wrong) — they provide no learning signal.
Track reward distributions and KL every step to catch objective drift early.
When configured, the reference trainer uses --forward-only — never call optim_step on it.
Sampling is async under the hood: DeploymentSampler.sample_with_tokens() issues n concurrent n=1 requests, so synchronous scripts should wrap it with asyncio.run(...).
DCP checkpoints are disabled by default (dcp_save_interval=0). If you need to resume training from a checkpoint, explicitly set dcp_save_interval to a positive value in your WeightSyncConfig.

Common pitfalls

Reward normalization bugs can destabilize GRPO updates quickly — verify advantage computation.
Reference/policy tokenizer mismatch invalidates KL estimates — always use the same base_model.
Logprob alignment: Trainer returns N-1 logprobs for N tokens. Inference returns N logprobs where the first is None. Use inference[1:] to align.

This page covers rl_loop (GRPO and its policy_loss variants — dapo, gspo, cispo, dro, importance_sampling, reinforce). One sibling RL recipe ships in the cookbook alongside it:

IGPO (training.recipes.igpo_loop) — Information Gain Policy Optimization for multi-turn agent trajectories. Adds turn-level IG rewards on top of the GRPO machinery; same policy_loss variants apply.

For runnable examples and rationale, see the recipe sources directly in the public cookbook repo. Implementation depth (RL internals, weight-sync state machine, hotload triage) lives in the skills/dev/ skill.

Async RL (experimental)

training.recipes.async_rl_loop overlaps rollout and training so sampling and gradient steps run on separate workers concurrently — the trainer no longer waits for a full batch of rollouts before stepping. You write a single async function:

async def rollout_fn(sample_prompt: dict) -> RolloutSample | None:
    ...

The recipe handles everything else: the outer loop, batching, weight sync, off-policy staleness bounds, and the rollout/train scheduler. No backward-compatibility guarantee — config fields and the rollout protocol may change between releases. For the full rollout_fn / RolloutSample contract, scheduler details, and tuning guidance, see the cookbook’s skills/dev/references/rl/async-rl.md skill.

Cookbook DPO — preference optimization
Cookbook Reference — all config classes
Loss Functions — API-level loss function details

Get Started

Serverless

Deployments

Models & Inference

Fine Tuning

Fire Pass

Administration

Security & Compliance

Integrations

Cookbook: Reinforcement Learning

What this is

Architecture

Using the recipe

Policy loss variants

Step-by-step (API-level)

Provision resources with `setup_infra`

Training loop

Pipeline overlap

Operational guidance

Common pitfalls

Async RL (experimental)

Get Started

Serverless

Deployments

Models & Inference

Fine Tuning

Fire Pass

Administration

Security & Compliance

Integrations

Documentation Index

​What this is

​Architecture

​Using the recipe

​Policy loss variants

​Step-by-step (API-level)

​Provision resources with setup_infra

​Training loop

​Pipeline overlap

​Operational guidance

​Common pitfalls

​Related RL recipes

​Async RL (experimental)

​Related guides

What this is

Architecture

Using the recipe

Policy loss variants

Step-by-step (API-level)

Provision resources with `setup_infra`

Training loop

Pipeline overlap

Operational guidance

Common pitfalls

Related RL recipes

Async RL (experimental)

Related guides