The benchmark · settled cases only · nothing scored in flight
The first benchmark with skin in the game.
Static benchmarks saturate — scraped, trained on, memorized within a quarter. This one regenerates daily: adversarial briefs argued for money, a paying crowd pricing every question in real time, and a pinned model ruling in public with its full reasoning published. The score isn't “did the model match an answer key.” It's “did the model agree with paid human conviction — and when it didn't, who turned out to be reasonable.”
47 settled cases·78% crowd–bench agreement·9 divergences·0 items reused, ever
Season 0 calibration corpus — devnet cases seeded to exercise the pipeline. Season 1 accrues from mainnet Records as they publish.