What this is
The cookbook’straining.recipes.distillation_loop trains one student from its own rollouts while frozen teacher deployments score those same responses.
Use it when you want recipe-managed trainer provisioning, student sampling, teacher scoring, checkpointing, and cleanup for distillation experiments.
Modes
| Mode | Use when | Teacher signal | Training loss |
|---|---|---|---|
sampled_reverse_kl | You want OPD-style sampled-token distillation | Teacher logprob on each sampled response token | importance_sampling |
topk_forward_kl | You want sparse SDFT soft labels from teacher top-K tokens | Teacher top_logprobs=K per response position | cross_entropy with [N, K] targets |
sampled_reverse_kl is the default. The student samples on policy, the teacher scores the sampled tokens, and the recipe trains on the dense per-token gap:
topk_forward_kl, set distill_mode=DistillMode.TOPK_FORWARD_KL and sdft_top_k.
Current limits and logprobs
The distillation recipe depends on the public inferencelogprobs response:
| Field or request option | Meaning |
|---|---|
top_k | Request-side sampling filter. It limits which next-token logits remain eligible for sampling and redistributes probability mass over that set. |
sampling_mask | Optional request flag for generated tokens. It can return the count or token IDs still eligible after sampling filters such as top_p and top_k. |
logprob | Model logprob for the returned token before sampling-temperature and sampling-filter renormalization. In the legacy response, this is token_logprobs. |
sampling_logprob | Generation-only logprob of the sampled token after temperature and sampling filters are applied. Use this when comparing against the distribution that actually sampled the token. |
top_logprobs | Response option for returning likely alternatives at each position. The public inference API currently caps this at 5, so sdft_top_k must be at most 5. |
top_k and top_logprobs are different knobs: top_k changes sampling; top_logprobs only controls how many alternatives are returned in the response.
Minimal example
teacher_model is a base model resource, the recipe creates a frozen teacher deployment for scoring. If it is already an inference model or deployment resource, the recipe uses it directly.
Multi-teacher runs
Setmulti_teacher=MultiTeacherConfig(...) when you have more than one teacher.
With sampled_reverse_kl, multi-teacher OPD is routed: each dataset row is scored by exactly one teacher selected by the configured route key, defaulting to teacher. With topk_forward_kl, every configured teacher can score the sampled response and the recipe blends sparse top-K probability mass using TeacherConfig.blend_weight.
Dataset contract
Rows are JSONL objects. The only required field ismessages, the student-visible OpenAI-style chat prompt.
Optional fields:
| Field | Use |
|---|---|
teacher | Default route key for routed sampled reverse-KL MOPD. The value must match a configured TeacherConfig.route_value, or the teacher model when route_value is unset. |
teacher_messages | Teacher-side prompt for privileged-context scoring. If omitted, the teacher scores under messages. |
expected_answer | Optional metadata for eval callbacks and smoke checks. |
TeacherConfig.tokenizer_model when you want the recipe to validate teacher tokenizers against DeployConfig.tokenizer_model.
Examples
The cookbook includes distillation examples undertraining/examples/distillation:
| Example | Path | Use |
|---|---|---|
| Privileged-context OPD/SDFT | gsm8k_privileged | Student sees the problem; teacher can see privileged solution context. |
| Routed MOPD smoke | routed_mopd/train_two_teacher_lora.py | Tiny generated dataset with two route labels and a LoRA student. |
Next steps
- Cookbook Reference - config classes and common recipe fields
- Loss Functions - built-in and custom Training API losses
- Weight sync - how updated weights reach serving deployments