Every time your AI sorts, classifies, routes, or labels — ][whitebox][ runs the same decision through multiple models, measures their agreement, reads the log-probs, and tells your code how sure it actually is. Below threshold, it escalates to a human. Always, it leaves an audit trail.
A single LLM call gives you an answer. It does not tell you whether the model is guessing. ][whitebox][ runs your decision n times across m models, reads logprobs where available, and treats agreement-under-perturbation as the only honest measure of certainty.
When the wave is steady, ship. When it shakes, escalate. No more "the AI said so."
One client, same shape as your existing LLM SDK. Your prompt, your options, our verdict envelope.
We dispatch the same query in parallel — temperature jitter, model rotation, prompt re-orderings. Cheap models on the hot path, expensive ones only when consensus is shaky.
For OpenAI-compatible endpoints we extract token-level logprobs and weight each run by its own self-reported certainty.
Agreement × log-prob mass = a single confidence figure — and a full distribution your code can branch on.
Above your bar: ship. Below: route to a human queue with the full trace, the disagreement, and the choices laid out.
Every decision — runs, prompts, model versions, latencies, costs, the human's pick — recorded forever. Replay-able from any commit.
Set a threshold. Anything that lands below it is routed — with the full disagreement, the candidate options, and the model traces — to a queue your team already lives in. Slack, Linear, email, or our review UI. The verdict closes the loop: the human's pick is the answer of record, and your model fleet learns from the divergence.
][whitebox][ is a thin layer over the providers you already pay for. No proxy, no model lock-in. Works on any LLM call where the result space is finite — classifications, routing, extraction, moderation, judging.
We don't train anything. Your stack stays your stack — we orchestrate it.
Decisions live in your warehouse. We hold a hash and a pointer.
Pin model versions. Re-run any historical decision against today's fleet to detect drift.
Cheap models first. Expensive models only on disagreement. Budget caps per route.
Pipe ground-truth labels back in; ][whitebox][ tracks per-model accuracy by category, over time.
Confidence is measured, not asked. Models cannot self-report their way past the threshold.
Free for the first 100k decisions. No credit card. The audit trail starts the moment you install.