observability for ai decisions · v0.4 preview

Don't ship a black box.
Ship an open box.

Every time your AI sorts, classifies, routes, or labels — ][whitebox][ runs the same decision through multiple models, measures their agreement, reads the log-probs, and tells your code how sure it actually is. Below threshold, it escalates to a human. Always, it leaves an audit trail.

drop-in
replace one openai call
model-agnostic
openai · anthropic · google · oss
audit
every decision, every run, forever
whitebox · live decision feed
prod · us-east-1
whitebox
confidence
0.0%
the method

Confidence is a behavior, not a number.

A single LLM call gives you an answer. It does not tell you whether the model is guessing. ][whitebox][ runs your decision n times across m models, reads logprobs where available, and treats agreement-under-perturbation as the only honest measure of certainty.

When the wave is steady, ship. When it shakes, escalate. No more "the AI said so."

01

Wrap the call

One client, same shape as your existing LLM SDK. Your prompt, your options, our verdict envelope.

02

Fan out · n runs · m models

We dispatch the same query in parallel — temperature jitter, model rotation, prompt re-orderings. Cheap models on the hot path, expensive ones only when consensus is shaky.

03

Read the log-probs

For OpenAI-compatible endpoints we extract token-level logprobs and weight each run by its own self-reported certainty.

04

Score the wave

Agreement × log-prob mass = a single confidence figure — and a full distribution your code can branch on.

05

Threshold + escalate

Above your bar: ship. Below: route to a human queue with the full trace, the disagreement, and the choices laid out.

06

Audit, always

Every decision — runs, prompts, model versions, latencies, costs, the human's pick — recorded forever. Replay-able from any commit.

playground

Run a decision. Watch it doubt itself.

scenario
ambiguous product support ticket moderation
whitebox sandbox · no auth · runs are simulated client-side
[--:--:--] waiting · press run consensus to dispatch
provider
any
median latency
1.2s
cost / decision
$0.0061
audit retention
forever
human-in-the-loop

When the wave shakes, a human picks up.

Set a threshold. Anything that lands below it is routed — with the full disagreement, the candidate options, and the model traces — to a queue your team already lives in. Slack, Linear, email, or our review UI. The verdict closes the loop: the human's pick is the answer of record, and your model fleet learns from the divergence.

median t-to-review
1m 48s
agreement w/ human
93.4%
reviews / decision
0.07
cost saved vs. 100% human
$0.41 ea.
decision · 9f3a-21d8 · escalated 00:42 ago
human review
Categorize: "Exfoliating dish brush, bamboo handle, plant-based bristles, 250g"
personal_careai · 40%
household_cleaningai · 32%
kitchenai · 20%
groceryai · 8%
reviewer
m.alvarez
queue
ops-cat-04
sla
04:18
the sdk

One call. Same shape. More truth.

][whitebox][ is a thin layer over the providers you already pay for. No proxy, no model lock-in. Works on any LLM call where the result space is finite — classifications, routing, extraction, moderation, judging.

  • · Drop-in for OpenAI, Anthropic, Google, vLLM, Bedrock, Together
  • · Logprob ingestion where the API exposes them
  • · Replay from a decision id — same models, same seeds, same answer
  • · Runs on your VPC if compliance demands
  • · OpenTelemetry, Datadog, Sentry — every decision a span
python
typescript
curl
01
No new model

We don't train anything. Your stack stays your stack — we orchestrate it.

02
Zero retention by default

Decisions live in your warehouse. We hold a hash and a pointer.

03
Replayable forever

Pin model versions. Re-run any historical decision against today's fleet to detect drift.

04
Cost governor

Cheap models first. Expensive models only on disagreement. Budget caps per route.

05
Built for eval

Pipe ground-truth labels back in; ][whitebox][ tracks per-model accuracy by category, over time.

06
Honest by construction

Confidence is measured, not asked. Models cannot self-report their way past the threshold.

get a key

Open the box.

Free for the first 100k decisions. No credit card. The audit trail starts the moment you install.

npm i @whitebox/sdk pip install whitebox read docs