observability for ai decisions · v0.4 preview

Don't ship a black box.
Ship an open box.

Every time your AI sorts, classifies, routes, or labels — ][whitebox][ runs the same decision through multiple models, measures their agreement, reads the log-probs, and tells your code how sure it actually is. Below threshold, it escalates to a human. Always, it leaves an audit trail.

install ⌘ I try the playground ↓ read the docs ↗

drop-in

replace one openai call

model-agnostic

openai · anthropic · google · oss

audit

every decision, every run, forever

whitebox · live decision feed

prod · us-east-1

whitebox ›

confidence

0.0%

the method

Confidence is a behavior, not a number.

A single LLM call gives you an answer. It does not tell you whether the model is guessing. ][whitebox][ runs your decision n times across m models, reads logprobs where available, and treats agreement-under-perturbation as the only honest measure of certainty.

When the wave is steady, ship. When it shakes, escalate. No more "the AI said so."

Wrap the call

One client, same shape as your existing LLM SDK. Your prompt, your options, our verdict envelope.

Fan out · n runs · m models

We dispatch the same query in parallel — temperature jitter, model rotation, prompt re-orderings. Cheap models on the hot path, expensive ones only when consensus is shaky.

Read the log-probs

For OpenAI-compatible endpoints we extract token-level logprobs and weight each run by its own self-reported certainty.

Score the wave

Agreement × log-prob mass = a single confidence figure — and a full distribution your code can branch on.

Threshold + escalate

Above your bar: ship. Below: route to a human queue with the full trace, the disagreement, and the choices laid out.

Audit, always

Every decision — runs, prompts, model versions, latencies, costs, the human's pick — recorded forever. Replay-able from any commit.

playground

Run a decision. Watch it doubt itself.

scenario

ambiguous product support ticket moderation

input

runs · 7

threshold · 75%

whitebox sandbox · no auth · runs are simulated client-side

[--:--:--] waiting · press run consensus to dispatch

provider

any

median latency

1.2s

cost / decision

$0.0061

audit retention

forever

human-in-the-loop

When the wave shakes, a human picks up.

Set a threshold. Anything that lands below it is routed — with the full disagreement, the candidate options, and the model traces — to a queue your team already lives in. Slack, Linear, email, or our review UI. The verdict closes the loop: the human's pick is the answer of record, and your model fleet learns from the divergence.

median t-to-review

1m 48s

agreement w/ human

93.4%

reviews / decision

0.07

cost saved vs. 100% human

$0.41 ea.

decision · 9f3a-21d8 · escalated 00:42 ago

human review

Categorize: "Exfoliating dish brush, bamboo handle, plant-based bristles, 250g"

personal_careai · 40%

household_cleaningai · 32%

kitchenai · 20%

groceryai · 8%

reviewer

m.alvarez

queue

ops-cat-04

sla

04:18

the sdk

One call. Same shape. More truth.

][whitebox][ is a thin layer over the providers you already pay for. No proxy, no model lock-in. Works on any LLM call where the result space is finite — classifications, routing, extraction, moderation, judging.

· Drop-in for OpenAI, Anthropic, Google, vLLM, Bedrock, Together
· Logprob ingestion where the API exposes them
· Replay from a decision id — same models, same seeds, same answer
· Runs on your VPC if compliance demands
· OpenTelemetry, Datadog, Sentry — every decision a span

python

typescript

curl

No new model

We don't train anything. Your stack stays your stack — we orchestrate it.

Zero retention by default

Decisions live in your warehouse. We hold a hash and a pointer.

Replayable forever

Pin model versions. Re-run any historical decision against today's fleet to detect drift.

Cost governor

Cheap models first. Expensive models only on disagreement. Budget caps per route.

Built for eval

Pipe ground-truth labels back in; ][whitebox][ tracks per-model accuracy by category, over time.

Honest by construction

Confidence is measured, not asked. Models cannot self-report their way past the threshold.

get a key

Open the box.

Free for the first 100k decisions. No credit card. The audit trail starts the moment you install.

npm i @whitebox/sdk ⏎ pip install whitebox read docs ↗

Don't ship a black box. Ship an open box.