Set model="auto" and Stav picks the right model for each request. The decision is policy-driven, transparent, and audit-logged.
Why route at all
Every team has prompts that don't deserve a frontier model. A status-page summary doesn't need Claude Sonnet's reasoning depth; an "extract the date from this string" call doesn't need 70 billion parameters. Always-on frontier-routing is the most common source of overpaid inference in any AI-heavy organisation.
Stav's Smart Router collapses the trade-off. Simple queries go to fast, cheap models. Complex reasoning goes to large frontier models. Sensitive prompts stay on EU-sovereign infrastructure regardless of cost. You pay for what the prompt actually needs, not for what your worst-case prompt might need.
The published research is consistent on the savings: RouteLLM (UC Berkeley / LMSYS, ICLR 2025) reports 85% cost reduction on MT-Bench and 45% on MMLU versus always-on GPT-4, holding 95% of GPT-4's response quality. Stav's router targets the same operating point on a wider catalogue.
How to use it
Replace the model identifier with auto:
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "..."}],
)
That's the whole API surface. Everything else — streaming, tool calling, structured output, embeddings — behaves identically.
How the routing decision is made
The router scores every eligible model on three dimensions, weighted by your team's policy:
- Cost — €/1M tokens (lower is better).
- Latency — typical time-to-first-token + tokens/second for the prompt shape (faster is better).
- Quality — the model's score on the relevant benchmark family for this prompt (higher is better).
Weights are integers that sum to 100, configured per team in the Customer Portal under Connect → Smart Router. A team with weights cost=10, latency=30, quality=60 will lean toward Claude Sonnet for everything; a team with cost=60, latency=30, quality=10 will lean toward Llama 4 Scout or Qwen3. Most teams sit between those poles.
On top of the scores, the router applies:
The selected model is whichever survives all the filters and scores highest on the weighted sum.
Reading the decision
The response carries the selection rationale in headers:
X-Stav-Selected-Model: meta-llama/llama-4-maverick:precision
X-Stav-Sovereignty-Tier: sovereign
X-Stav-Routing-Reason: Cost weight (30%) ≈ quality weight (40%). Score: 78.4/100.
X-Request-Id: req-7a3f9c2e1b8d4a5f
If you want the structured form for logging, hit the route preview endpoint with the same prompt:
curl https://api.stav.ai/v1/route-preview \
-H "Authorization: Bearer $STAV_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"auto","messages":[{"role":"user","content":"..."}]}'
Response (truncated):
{
"selected_model_slug": "meta-llama/llama-4-maverick:precision",
"selected_model_name": "Llama 4 Maverick",
"model_type": "sovereign_oss",
"sovereignty_badge": "sovereign",
"provider": "Stav · Green Mountain DC2",
"reason": "Cost weight (30%) ≈ quality weight (40%). Score: 78.4/100.",
"estimated_cost_per_1k_tokens_eur": 0.0015,
"alternatives": [
{"slug": "qwen/qwen3-32b:fast", "score": 72.1, "sovereignty_badge": "sovereign"},
{"slug": "anthropic/claude-sonnet-4-5", "score": 68.9, "sovereignty_badge": "routed"}
]
}
The preview endpoint is a dry run — it computes the decision without executing the inference. Use it to debug routing config, to log the chosen model for compliance, or to expose the router decision in your own UI.
Tier hints
You can hint a tier preference without locking to a specific model:
client.chat.completions.create(model="auto:fast", ...) # prefer :fast endpoints
client.chat.completions.create(model="auto:precision", ...) # prefer :precision endpoints
auto:fast constrains the candidate pool to :fast quantization tiers (FP8, AWQ INT4) and picks the best score within. auto:precision does the same for full-weight endpoints. Useful for high-volume background workloads (auto:fast) and for nuanced reasoning (auto:precision). See for the full mapping.
Hard constraints versus soft preferences
Some routing decisions are hard rules, not preferences. The router will never:
- Route a request from a team with
sovereignty_required: trueto a routed-commercial endpoint, regardless of weights or scores. - Route to an endpoint in a blocked country.
- Route to a
retiredendpoint. - Route to an endpoint whose backing workers are not in
readystate.
Soft preferences (cost / latency / quality weights, the +15 sovereignty bonus, endpoint-lifecycle multipliers) only change the score — they don't filter the candidate pool.
Observability
Every routing decision lands in the Routing Log surface, available in the Admin Portal at /admin/routing-log. The log captures the full candidate set, per-model score components, the selected worker, and the reason. Teams' own decisions are visible in the Customer Portal under Monitor → Routing.
If a decision surprises you, the log is the canonical answer. Open a ticket if the log makes the decision look wrong on inspection — surprises here are usually a misconfigured weight set or a stale model-catalogue benchmark, and we'd rather catch both early.
What's coming next
The current router (v1.0) uses the weighted scoring described above. The v2.0 roadmap adds:
- DeBERTa-classified prompt complexity / domain / sensitivity as additional score inputs, so simple prompts route to
:fastautomatically without needing the team to set extreme weights. - Semantic cache that short-circuits routing for prompts semantically equivalent to a recent answered prompt.
- Per-tenant calibration that learns the team's quality threshold from production traffic instead of static benchmark scores.
The customer-facing contract (model="auto", the headers, the preview endpoint) stays stable across versions.
Next steps
- Tier suffix —
:fastvs:precisionand how the router uses tier hints. - Sovereignty — why
sovereignty_preferenceis on by default.