Tier suffix (:fast vs :precision) — Stav Docs

Two inference tiers per model — :fast for speed and cost, :precision for quality. The suffix is part of the model identifier.

Every model in Stav's catalogue is served in two tiers. The tier describes the trade-off you're making — not the underlying quantization technique.

client.chat.completions.create(model="qwen3-32b:precision", ...)
client.chat.completions.create(model="qwen3-32b:fast", ...)
client.chat.completions.create(model="qwen3-32b", ...)            # implicit :precision
client.chat.completions.create(model="auto:fast", ...)            # router, biased to :fast
client.chat.completions.create(model="auto", ...)                 # router decides

The suffix is part of the model identifier in every Stav-served endpoint. You'll see it in the catalogue (/v1/public/models), in the route-preview response, in routing-log entries, and in inference-log billing rows.

What each tier means

`:precision`

Full-weight inference. FP16 or BF16 — the original published weights from the creator, no quantization applied. Best for complex reasoning, legal drafting, medical analysis, code generation, anything where subtle quality differences matter.

Higher €/1M tokens. Higher latency. Higher accuracy on benchmarks that probe nuance.

`:fast`

Quantized weights. Typically FP8 on H100s (W8A8 native), or AWQ / GPTQ INT4 for very large models that don't fit in a single H100 at FP8. Best for summarisation, classification, translation, high-volume batch processing, simple Q&A, anything where throughput and unit economics matter more than the last 0.5% of benchmark score.

Lower €/1M tokens (typically 40–60% lower than the same model's :precision). Higher throughput (1.5–2× tokens/sec on the same hardware). Lower time-to-first-token.

For every catalogue model, Stav publishes a side-by-side comparison: MMLU-Pro, GPQA, HumanEval, MT-Bench scores for both tiers, along with the quality drop observed in our own validation gates. The numbers are visible at /models/<slug> on the public catalogue.

Why two tiers and not three

We considered a :balanced middle tier and rejected it. Two tiers map cleanly to two use cases: "I want maximum quality" and "I want maximum efficiency". A third tier creates ambiguity ("when would I pick balanced over fast?") without adding meaningful customer value. The internal quantization behind :fast can evolve as techniques improve — the customer-facing contract stays stable.

We also considered hiding tiers entirely (always serve quantized, never tell the customer). Rejected — transparency is a brand value, and customers in regulated sectors will mandate full-precision inference for compliance reasons. They need to be able to specify :precision explicitly and audit that it was honoured.

How the router uses the suffix

If you set model="auto" (no suffix), the Smart Router picks the tier from your team's cost / latency / quality weights:

auto:fast and auto:precision are hints, not hard constraints — they constrain the candidate pool to that tier before scoring, but the router still picks the best score within the constrained pool. Useful for known-cheap (auto:fast for background data enrichment) or known-deep (auto:precision for production-critical reasoning) workloads.

Hardware-aware defaults

Stav's GPU pool is heterogeneous. NVIDIA H100 SXM5 (Green Mountain DC2) supports FP8 natively at hardware speed; older GPU types may not. The :fast endpoint of a given model maps to the highest-performance quantization that the model's deployed worker hardware supports.

For most catalogue models on H100, :fast is FP8 W8A8. For very large models (>70B parameters) that don't fit in one H100 at FP8, :fast is AWQ INT4. For Apple Silicon BYO deployments, :fast is MLX INT4 or INT8. The customer-facing label is the same — :fast — and the unit pricing reflects the underlying cost; you never need to know which quantization is running.

Defaults summary

The implicit default for an explicit model name is :precision. The reasoning — if you named a specific model, you probably care about output quality and don't want a silent quantization surprise. The implicit default for auto is whichever tier the router decides serves the team's weights best.

Validation

Every :fast endpoint passes a quality gate before going live. We run MMLU-Pro, GPQA, HumanEval, and a domain-specific subset of MT-Bench against both tiers and reject :fast endpoints whose quality drop exceeds the per-model threshold (typically <1% on MMLU, <2% on coding benchmarks). The validation results are published alongside the catalogue entry.

If a :fast endpoint underperforms its :precision peer in production traffic — measured via the routing log's per-endpoint quality signal — it's flagged for re-quantization. You can prefer :precision for your whole team via Smart Router weights at any time.

Next steps

Smart Router — how the router picks tier when you send auto.
Sovereignty — the sovereignty tier is orthogonal to the inference tier; you can have sovereign:fast or routed:precision in any combination.

Cost-heavy (e.g. `cost=60, latency=20, quality=20`)	Prefers `:fast` endpoints
Quality-heavy (e.g. `cost=10, latency=20, quality=70`)	Prefers `:precision` endpoints
Balanced	Router uses prompt-complexity classification: simple prompts → `:fast`, complex → `:precision`

`qwen3-32b`	The `:precision` endpoint of `qwen3-32b`
`qwen3-32b:precision`	Same as above, explicit
`qwen3-32b:fast`	The `:fast` endpoint of `qwen3-32b`
`auto`	Router decides model AND tier from team weights
`auto:fast`	Router decides model, constrained to `:fast` endpoints
`auto:precision`	Router decides model, constrained to `:precision` endpoints