- Tinker — flat per-token billing, no cross-turn cache (re-prefill every turn)
- Fireworks Dedicated — on-demand GPU-hour billing; the cache savings show up as more work per hour, not as a discounted token rate
Performance and benchmarking notes
Dedicated trainer vs pooled/serverless resourcing
Tinker runs training jobs on a pooled/serverless GPU fleet, which lets a single job burst onto many more GPUs than you would dedicate to a replica on Fireworks. That burst is what makes individual Tinker steps feel fast — but it also caps the maximum training speed you can buy: you cannot pay to scale beyond the pool’s per-job allocation, and you cannot reserve isolated capacity. Fireworks dedicated trainers take the opposite trade-off: predictable, isolated execution with no shared-pool queueing or noisy-neighbor variance, and the ability to scale wall-clock time and cost independently by adjusting replica count. If you want faster steps on dedicated, increase replica count and parallelize work. For large model training or longer rollouts, we have consistently found the dedicated setup like ours is cheaper overall and can also be faster depending on the customer’s resourcing needs.Context-length benchmarking caveat
Benchmark comparisons are only apples-to-apples when truncation policy and effective context length are matched. If one system truncates>32k samples
and another does not, the non-truncating run is doing more work and will
appear slower.
Replica count is a speed/cost knob
Users can trade cost and wall-clock time by scaling replicas. A quick back-of-envelope estimate:Check utilization before scaling
Fireworks Dedicated is billed by GPU-hour, so low rollout traffic can make a job look slow or expensive even when the deployment has spare capacity. Before adding replicas, first confirm whether the inference deployment is saturated or waiting for more work from your rollout client. Useful signals:- Per-request performance metrics: log Fireworks response metrics such as
prompt tokens, cached prompt tokens, time to first token, and total server
processing time from your rollout client. Non-streaming requests include
these in response headers; for streaming requests, set
perf_metrics_in_responseto include them in the final response chunk. - Deployment-level metrics: export Prometheus-style metrics for request rate, prompt and cached-token rates, queue latency, KV-cache usage, and concurrent request count. Low request/concurrency metrics with low queueing usually mean the deployment can accept more traffic.
- Training API efficiency hints: when available, monitor
trainer/training_efficiency/.../effective_batch_fill_ratio:lastandtrainer/training_efficiency/.../trainer_waiting_for_work:last. These are returned in themetricsdict on yourforward/forward_backwardresponses, not on the deployment dashboard. Low batch fill or a trainer-waiting-for-work signal usually points to the rollout side not feeding the trainer fast enough. See Reading Training API efficiency metrics below for how to access and interpret them.
max_concurrent_rollouts
and the Training API deployment replica guidance.
Reading Training API efficiency metrics
The twotrainer/training_efficiency/... metrics are returned in the metrics
dict on your forward / forward_backward responses. They do not appear on
inference deployment dashboards, the per-request and deployment-level signals
above are separate.
effective_batch_fill_ratio:last: the number of tokens in a batch divided by the maximum possible. 1.0 means fully saturated; consistently low values across steps indicate under-filling.trainer_waiting_for_work:last: how much time the trainer (GPU) sat idle since the last op, i.e. the gap betweenforwardcalls. More waiting means the trainer is starved for work.
max_concurrent_rollouts) before adding deployment replicas.
How the numbers come together
Tinker (the cost customers describe)
Each turn re-prefills the full accumulated context: …where is the initial prompt (system + tools + task), is the context added per turn (model response + tool result), and is the turn count. This is quadratic in .Fireworks Dedicated — GPU-hour billing
Dedicated deployments are billed per GPU-second, so the prefix cache shows up as higher effective throughput rather than a discount on per-token rates. Across one episode, each unique token is prefilled at most once — the rest of the prompt is served from the prefix cache and contributes essentially no GPU work. The uncached portion that actually hits prefill is: On a saturated cluster: Because cached tokens contribute essentially nothing to wall-clock work, the cluster’s effective $/M token rate falls as utilization rises. For continuous RL training, where rollouts run at sustained pace, dedicated is typically the cheapest path at scale.The calculator’s dedicated path uses saturated throughput estimates as
defaults. A small, lightly-loaded test deployment will look more expensive
per token than these numbers because the cluster is paid for whether it’s
busy or idle. Tune the throughput inputs in the Advanced panel to match
your actual rollout pace.
What’s covered
The calculator currently includes the four models for which Tinker publishes per-token rates:| Model | Tinker prefill / sample (per 1M) |
|---|---|
| Kimi K2.6 (128K) | 12.81 |
| Kimi K2.5 (128K) | 12.81 |
| Qwen3.5-397B-A17B (256K) | 10.00 |
| GPT-OSS-120B (128K) | 1.54 |
snippets/multi-turn-cost-calculator.jsx — update there if
either side’s pricing changes.
FAQ
What is the fastest way to reduce wall-clock time?
Increase replicas and overlap sampling/training where your workflow allows it. Those are usually the most direct levers for shortening end-to-end cycle time.How should I compare costs between providers?
Use matched assumptions for context length, truncation policy, and effective resource allocation. The calculator at the top of this page handles the math once you plug in your episode shape — be sure to also align truncation policy and effective context window between providers before drawing conclusions.Sources
- Tinker pricing: thinkingmachines.ai/tinker
- Fireworks GPU-hour pricing: fireworks.ai/pricing
- Related: RFT Cost Estimator — same idea, but for the training-side bill (Fireworks GPU-hour, no comparison column).