The forecast as a distribution
Each tribunal judge rules with an 80% interval, not just a point. The uncertainty band on the final number is built by taking those intervals seriously: in roughly 800–1,500 Monte Carlo draws, every scenario's probabilities are re-sampled from distributions fitted to the judge's stated interval, and the entire construction — pooling, correction layers, prior blend — is recomputed from scratch each time.
The result is a genuine distribution over what the defended forecast could have been, given everything the judges admitted they didn't know. The headline band is the p10–p90 range of those draws; the result view shows the full histogram. If the coherence probe found a material framing gap in the outside view, the band is widened further.
How hard does the forecast lean on each world?
A forecast carried by one scenario is a different object from one supported by seven — even at the same headline number. To measure this, the engine removes each scenario in turn and recomputes the entire construction without it. The signed difference is that world's leverage: how many points of the final forecast its presence contributes.
Trust states & the calibration record
Two final disciplines close the loop. First, run integrity: a pipeline of model calls can degrade — a stage can fail to parse, a tribunal can fall back to a placeholder, synthesis can need a fallback. Every degradation is recorded as an integrity note and rolls up into a visible trust state. A degraded run never looks like a clean one.
Second, keeping score. Forecasts are only meaningful if they are eventually graded. Your run history lets you mark each question YES or NO as reality resolves it, and the Lab maintains a running Brier score — the mean squared distance between your forecasts and what actually happened. A Brier of 0.25 is what coin-flipping earns; meaningfully below that is forecasting skill, and the score is shown with its sample size so it can't flatter early.
History also supports run comparison: re-run a question weeks later and see the forecast's movement decomposed — did the prior shift, did new scenarios appear, did the corrections change? A forecast that moves for legible reasons is the final proof the method is working.
The p10–p90 range of full-construction re-draws from the judges' stated intervals.
The signed points a scenario contributes, measured by removing it and recomputing everything.
Full / degraded / low — the run's honesty about its own execution quality.
Mean squared error of resolved forecasts; 0.25 is chance, lower is skill.