Every model in Stav's catalogue is served in two tiers. The tier describes the trade-off you're making — not the underlying quantization technique.
client.chat.completions.create(model="qwen3-32b:precision", ...)
client.chat.completions.create(model="qwen3-32b:fast", ...)
client.chat.completions.create(model="qwen3-32b", ...) # implicit :precision
client.chat.completions.create(model="auto:fast", ...) # router, biased to :fast
client.chat.completions.create(model="auto", ...) # router decides
The suffix is part of the model identifier in every Stav-served endpoint. You'll see it in the catalogue (/v1/public/models), in the route-preview response, in routing-log entries, and in inference-log billing rows.
What each tier means
:precision
Full-weight inference. FP16 or BF16 — the original published weights from the creator, no quantization applied. Best for complex reasoning, legal drafting, medical analysis, code generation, anything where subtle quality differences matter.
Higher €/1M tokens. Higher latency. Higher accuracy on benchmarks that probe nuance.
:fast
Quantized weights. Typically FP8 on H100s (W8A8 native), or AWQ / GPTQ INT4 for very large models that don't fit in a single H100 at FP8. Best for summarisation, classification, translation, high-volume batch processing, simple Q&A, anything where throughput and unit economics matter more than the last 0.5% of benchmark score.
Lower €/1M tokens (typically 40–60% lower than the same model's :precision). Higher throughput (1.5–2× tokens/sec on the same hardware). Lower time-to-first-token.
For every catalogue model, Stav publishes a side-by-side comparison: MMLU-Pro, GPQA, HumanEval, MT-Bench scores for both tiers, along with the quality drop observed in our own validation gates. The numbers are visible at /models/<slug> on the public catalogue.
Why two tiers and not three
We considered a :balanced middle tier and rejected it. Two tiers map cleanly to two use cases: "I want maximum quality" and "I want maximum efficiency". A third tier creates ambiguity ("when would I pick balanced over fast?") without adding meaningful customer value. The internal quantization behind :fast can evolve as techniques improve — the customer-facing contract stays stable.
We also considered hiding tiers entirely (always serve quantized, never tell the customer). Rejected — transparency is a brand value, and customers in regulated sectors will mandate full-precision inference for compliance reasons. They need to be able to specify :precision explicitly and audit that it was honoured.
How the router uses the suffix
If you set model="auto" (no suffix), the Smart Router picks the tier from your team's cost / latency / quality weights:
auto:fast and auto:precision are hints, not hard constraints — they constrain the candidate pool to that tier before scoring, but the router still picks the best score within the constrained pool. Useful for known-cheap (auto:fast for background data enrichment) or known-deep (auto:precision for production-critical reasoning) workloads.
Hardware-aware defaults
Stav's GPU pool is heterogeneous. NVIDIA H100 SXM5 (Green Mountain DC2) supports FP8 natively at hardware speed; older GPU types may not. The :fast endpoint of a given model maps to the highest-performance quantization that the model's deployed worker hardware supports.
For most catalogue models on H100, :fast is FP8 W8A8. For very large models (>70B parameters) that don't fit in one H100 at FP8, :fast is AWQ INT4. For Apple Silicon BYO deployments, :fast is MLX INT4 or INT8. The customer-facing label is the same — :fast — and the unit pricing reflects the underlying cost; you never need to know which quantization is running.
Defaults summary
The implicit default for an explicit model name is :precision. The reasoning — if you named a specific model, you probably care about output quality and don't want a silent quantization surprise. The implicit default for auto is whichever tier the router decides serves the team's weights best.
Validation
Every :fast endpoint passes a quality gate before going live. We run MMLU-Pro, GPQA, HumanEval, and a domain-specific subset of MT-Bench against both tiers and reject :fast endpoints whose quality drop exceeds the per-model threshold (typically <1% on MMLU, <2% on coding benchmarks). The validation results are published alongside the catalogue entry.
If a :fast endpoint underperforms its :precision peer in production traffic — measured via the routing log's per-endpoint quality signal — it's flagged for re-quantization. You can prefer :precision for your whole team via Smart Router weights at any time.
Next steps
- Smart Router — how the router picks tier when you send
auto. - Sovereignty — the sovereignty tier is orthogonal to the inference tier; you can have
sovereign:fastorrouted:precisionin any combination.