What this is
This is the default lifecycle for research loops: bootstrap a trainer and deployment, run iterative updates, export checkpoints, sync weights to the deployment, then sample through it for realistic evaluation.Workflow
- Create resources: a deployment (
DeploymentManager) and a service-mode trainer (TrainerJobManager). - Connect a training client from your Python loop.
- Run train steps:
forward_backward_custom+optim_stepin a loop. - Save checkpoints at regular intervals using base/delta pattern.
- Weight-sync the checkpoint to your serving deployment.
- Sample and evaluate through the deployment endpoint.
- Record metrics and decide whether to continue or branch experiments.
End-to-end example
The only training-shape input you choose below is the shape ID. The SDK resolves the versioned reference for you before launch.1. Bootstrap
2. Train step with custom objective
3. Checkpoint, weight sync, and evaluate
Operational guidance
- Service mode supports both full-parameter and LoRA tuning. Set
lora_rank=0for full-parameter or a positive integer (e.g.16,64) for LoRA, and matchcreate_training_client(lora_rank=...)accordingly. - Use
checkpoint_type="base"for the first checkpoint, then"delta"for subsequent ones to reduce save/transfer time. DeploymentSampler.sample_with_tokens()is async — useawaitin async code orasyncio.run(...)from synchronous scripts.- Keep checkpoint intervals predictable so evaluation comparisons are stable.
- Store the exact prompt set used for each evaluation sweep for reproducibility.
Common pitfalls
- Sampling from trainer internals instead of deployment endpoints can skew results — always evaluate through the serving path.
- Missing checkpoint-to-deployment traceability makes rollback risky — log checkpoint names alongside metrics.
- Stale deployments: Always verify the weight-synced checkpoint identity matches what you expect before sampling.
Related guides
- Loss Functions — built-in and custom loss function patterns
- Saving and Loading — checkpoint types and weight sync details
- DeploymentSampler reference — sampling API details
- WeightSyncer reference — weight sync lifecycle