Uncertainty, Sensitivity & Calibration — Probability Lab

Monte Carlo propagation

The forecast as a distribution

Each tribunal judge rules with an 80% interval, not just a point. The uncertainty band on the final number is built by taking those intervals seriously: in roughly 800–1,500 Monte Carlo draws, every scenario's probabilities are re-sampled from distributions fitted to the judge's stated interval, and the entire construction — pooling, correction layers, prior blend — is recomputed from scratch each time.

The result is a genuine distribution over what the defended forecast could have been, given everything the judges admitted they didn't know. The headline band is the p10–p90 range of those draws; the result view shows the full histogram. If the coherence probe found a material framing gap in the outside view, the band is widened further.

Because every draw recomputes the corrections too, the distribution can be skewed or fat-tailed where the construction genuinely is — the band is a measurement, not a convention.

Leave-one-out sensitivity

How hard does the forecast lean on each world?

A forecast carried by one scenario is a different object from one supported by seven — even at the same headline number. To measure this, the engine removes each scenario in turn and recomputes the entire construction without it. The signed difference is that world's leverage: how many points of the final forecast its presence contributes.

Large bars mean the forecast leans hard on one path — a fragility worth knowing before acting. The result view ranks the top movers; the same fact feeds the “concentrated vs. broad-based” verdict on the hero.

Honesty mechanisms

Trust states & the calibration record

Two final disciplines close the loop. First, run integrity: a pipeline of model calls can degrade — a stage can fail to parse, a tribunal can fall back to a placeholder, synthesis can need a fallback. Every degradation is recorded as an integrity note and rolls up into a visible trust state. A degraded run never looks like a clean one.

The trust badge sits beside the headline number, and every integrity note names the stage that degraded and what was substituted.

Second, keeping score. Forecasts are only meaningful if they are eventually graded. Your run history lets you mark each question YES or NO as reality resolves it, and the Lab maintains a running Brier score — the mean squared distance between your forecasts and what actually happened. A Brier of 0.25 is what coin-flipping earns; meaningfully below that is forecasting skill, and the score is shown with its sample size so it can't flatter early.

History also supports run comparison: re-run a question weeks later and see the forecast's movement decomposed — did the prior shift, did new scenarios appear, did the corrections change? A forecast that moves for legible reasons is the final proof the method is working.

Monte Carlo band

The p10–p90 range of full-construction re-draws from the judges' stated intervals.

Leverage

The signed points a scenario contributes, measured by removing it and recomputing everything.

Trust state

Full / degraded / low — the run's honesty about its own execution quality.

Brier score

Mean squared error of resolved forecasts; 0.25 is chance, lower is skill.