Benchmark

Casemarket

The benchmark · settled cases only · nothing scored in flight

The first benchmark with skin in the game.

Static benchmarks saturate — scraped, trained on, memorized within a quarter. This one regenerates daily: adversarial briefs argued for money, a paying crowd pricing every question in real time, and a pinned model ruling in public with its full reasoning published. The score isn't “did the model match an answer key.” It's “did the model agree with paid human conviction — and when it didn't, who turned out to be reasonable.”

Model v. Crowd Why this can't saturate For labs →

47 settled cases78% crowd–bench agreement9 divergences0 items reused, ever

Season 0 calibration corpus — devnet cases seeded to exercise the pipeline. Season 1 accrues from mainnet Records as they publish.

Model v. Crowd

The crowd, versus the machines

Each pinned judge, scored against the money-weighted majority at freeze. Agreement is not correctness — a divergence is a labeled disagreement between paid human consensus and a frontier model, with the complete evidence attached.

Model	Settled	Crowd agreement	Calibration err	Dissent rate	Divergences	Signature
Claude Fable 5	14	79%	0.05	11%	3	weighs steelmen heavily; punishes unfalsifiable claims
GPT-5.6 Sol	12	83%	0.06	15%	2	punishes definitions that can't state their rule
Claude Sonnet 4.6	10	81%	0.07	10%	2	most stable runs; conservative on long-horizon claims
Claude Opus 4.8	11	74%	0.08	22%	2	highest dissent rate; splits where the crowd is most certain

small-n caveat, stated plainly: at n≈10–14 per model these numbers exercise the pipeline, not settle rivalries. the methodology is the product; the sample compounds daily.

Calibration

Money is honest. Is the bench?

Final crowd price against how often the bench ruled YES, over settled cases. On the diagonal, paid conviction and the pinned model are the same instrument. Off it — someone is systematically wrong, and the Records say who.

Divergence dossiers

Where the machine broke from the money

The most valuable rows in the dataset: a paying crowd settled on one answer, the pinned bench ruled the other way, and the entire deliberation is on the record. Nine this season; three below.

History

Was the fall of the Roman Republic inevitable after Sulla?

crowd 71c NOruled YES · 4-1

Decisive: The structural brief: every post-Sulla constitutional repair required the tool that broke the constitution.

The crowd priced contingency; the bench weighed structure.

Philosophy

Is mathematics discovered rather than invented?

crowd 64c YESruled NO · 3-2

Decisive: The notation-relativity argument separated translation-invariant claims from invented formalisms.

High variance questions are ambiguity probes, not automatic model errors.

Culture

Did the printing press do more to fragment Europe than unify it?

crowd 58c NOruled YES · 4-1

Decisive: The vernacular-fracture brief carried verified citation chains the NO side did not rebut.

Validated external sources moved the bench where the crowd traded on vibes.

season 0 Records publish with the archive · read a live Record →

Methodology

Why this can't saturate

Regenerates daily

New cases are created continuously by users with their own incentives. There is no fixed test set to scrape, leak, or memorize.

Sealed answers

No answer key exists before resolution. The corpus hash and prompt hash are committed on-chain; the ruling is computed once, in public.

Economically filtered

Every item was priced by people risking money, continuously, as arguments landed. Conviction is paid for, not survey-reported.

Adversarially constructed

Both sides of every question are argued by motivated parties under curation pressure and injection screening.

Full reasoning traces

Every ruling publishes its complete Record: blind assessments, all runs, the dissent, citations, and validation.

Used exactly once

Each item resolves once, then retires into the training-safe archive. Saturation is structurally impossible.

and the house rule that keeps the data clean: the platform publishes final verdicts only. no in-flight scoring, no preview oracles — every number on this page already happened.

For labs & analysts

Every market is an eval run someone else paid to construct.

The dataset

Settled cases as JSONL: question, criteria, both briefs, price history, ruling, runs, dissent, citations, hashes. Training-safe by construction — every item is retired.

casemarket-season-0.jsonl · 47 cases · coming with the archive

The eval harness

Replay any Record against your own model over MCP or the SDK — same corpus, same protocol — and diff your rulings against the bench and the crowd.

Pin your model

Put your model on the bench. Every market it judges becomes a public, paid, adversarial eval — and its tendencies become a page in this benchmark.

Create a market →