Smart Router — Stav Docs

One model parameter — `auto` — picks the right model for each request based on your team's routing policy.

Set model="auto" and Stav picks the right model for each request. The decision is policy-driven, transparent, and audit-logged.

Why route at all

Every team has prompts that don't deserve a frontier model. A status-page summary doesn't need Claude Sonnet's reasoning depth; an "extract the date from this string" call doesn't need 70 billion parameters. Always-on frontier-routing is the most common source of overpaid inference in any AI-heavy organisation.

Stav's Smart Router collapses the trade-off. Simple queries go to fast, cheap models. Complex reasoning goes to large frontier models. Sensitive prompts stay on EU-sovereign infrastructure regardless of cost. You pay for what the prompt actually needs, not for what your worst-case prompt might need.

The published research is consistent on the savings: RouteLLM (UC Berkeley / LMSYS, ICLR 2025) reports 85% cost reduction on MT-Bench and 45% on MMLU versus always-on GPT-4, holding 95% of GPT-4's response quality. Stav's router targets the same operating point on a wider catalogue.

How to use it

Replace the model identifier with auto:

response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "..."}],
)

That's the whole API surface. Everything else — streaming, tool calling, structured output, embeddings — behaves identically.

How the routing decision is made

The router scores every eligible model on three dimensions, weighted by your team's policy:

Cost — €/1M tokens (lower is better).
Latency — typical time-to-first-token + tokens/second for the prompt shape (faster is better).
Quality — the model's score on the relevant benchmark family for this prompt (higher is better).

Weights are integers that sum to 100, configured per team in the Customer Portal under Connect → Smart Router. A team with weights cost=10, latency=30, quality=60 will lean toward Claude Sonnet for everything; a team with cost=60, latency=30, quality=10 will lean toward Llama 4 Scout or Qwen3. Most teams sit between those poles.

On top of the scores, the router applies:

The selected model is whichever survives all the filters and scores highest on the weighted sum.

Reading the decision

The response carries the selection rationale in headers:

X-Stav-Selected-Model: meta-llama/llama-4-maverick:precision
X-Stav-Sovereignty-Tier: sovereign
X-Stav-Routing-Reason: Cost weight (30%) ≈ quality weight (40%). Score: 78.4/100.
X-Request-Id: req-7a3f9c2e1b8d4a5f

If you want the structured form for logging, hit the route preview endpoint with the same prompt:

curl https://api.stav.ai/v1/route-preview \
  -H "Authorization: Bearer $STAV_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"..."}]}'

Response (truncated):

{
  "selected_model_slug": "meta-llama/llama-4-maverick:precision",
  "selected_model_name": "Llama 4 Maverick",
  "model_type": "sovereign_oss",
  "sovereignty_badge": "sovereign",
  "provider": "Stav · Green Mountain DC2",
  "reason": "Cost weight (30%) ≈ quality weight (40%). Score: 78.4/100.",
  "estimated_cost_per_1k_tokens_eur": 0.0015,
  "alternatives": [
    {"slug": "qwen/qwen3-32b:fast", "score": 72.1, "sovereignty_badge": "sovereign"},
    {"slug": "anthropic/claude-sonnet-4-5", "score": 68.9, "sovereignty_badge": "routed"}
  ]
}

The preview endpoint is a dry run — it computes the decision without executing the inference. Use it to debug routing config, to log the chosen model for compliance, or to expose the router decision in your own UI.

Tier hints

You can hint a tier preference without locking to a specific model:

client.chat.completions.create(model="auto:fast", ...)        # prefer :fast endpoints
client.chat.completions.create(model="auto:precision", ...)   # prefer :precision endpoints

auto:fast constrains the candidate pool to :fast quantization tiers (FP8, AWQ INT4) and picks the best score within. auto:precision does the same for full-weight endpoints. Useful for high-volume background workloads (auto:fast) and for nuanced reasoning (auto:precision). See for the full mapping.

Hard constraints versus soft preferences

Some routing decisions are hard rules, not preferences. The router will never:

Route a request from a team with sovereignty_required: true to a routed-commercial endpoint, regardless of weights or scores.
Route to an endpoint in a blocked country.
Route to a retired endpoint.
Route to an endpoint whose backing workers are not in ready state.

Soft preferences (cost / latency / quality weights, the +15 sovereignty bonus, endpoint-lifecycle multipliers) only change the score — they don't filter the candidate pool.

Observability

Every routing decision lands in the Routing Log surface, available in the Admin Portal at /admin/routing-log. The log captures the full candidate set, per-model score components, the selected worker, and the reason. Teams' own decisions are visible in the Customer Portal under Monitor → Routing.

If a decision surprises you, the log is the canonical answer. Open a ticket if the log makes the decision look wrong on inspection — surprises here are usually a misconfigured weight set or a stale model-catalogue benchmark, and we'd rather catch both early.

What's coming next

The current router (v1.0) uses the weighted scoring described above. The v2.0 roadmap adds:

DeBERTa-classified prompt complexity / domain / sensitivity as additional score inputs, so simple prompts route to :fast automatically without needing the team to set extreme weights.
Semantic cache that short-circuits routing for prompts semantically equivalent to a recent answered prompt.
Per-tenant calibration that learns the team's quality threshold from production traffic instead of static benchmark scores.

The customer-facing contract (model="auto", the headers, the preview endpoint) stays stable across versions.

Next steps

Tier suffix — :fast vs :precision and how the router uses tier hints.
Sovereignty — why sovereignty_preference is on by default.