research.upneja.ai · a benchmark for AI research foresight

Can a model forecast
the research frontier?

A test of research taste judged not by human raters but by time. The method reads public signals and bets on which fresh directions will matter, then waits for the answer. We test it first on NeurIPS 2024 and 2025, which already resolved, then run it live on NeurIPS 2026: ten predictions sealed in June, scored in December against a naive baseline. Forecaster, Claude Opus 4.8.

◆ Sydney, Australia (ICC)◆ Dec 6–12, 2026◆ sealed 21 Jun 2026◆ resolves Dec 2026◆ 9 scouts · 3 critics · 1 backtest

Does it work? ↓The ten predictions Read the paper →

What this is, and is not

A test of foresight, not invention

The model did not invent these concepts. It read the field and bet on which fresh directions will matter, before the answer was public. That is research judgment, and it is scoreable.

Judged by time, not by raters

Most attempts to measure model novelty ask humans whether ideas feel new. This uses a future event as the answer key. The forecast resolves hit or miss against the public program.

It has to beat a dumb baseline

A forecast that just echoes last year is the null hypothesis. The bar is a naive extrapolator, and as the backtest shows, that bar is high. Taste only shows in the calls the baseline misses.

I · the backtest

Does the method even work?

Before trusting a 2026 call you cannot check yet, test the method on the two NeurIPS years that already resolved. We gathered what actually defined NeurIPS 2024 and 2025 independently of the prior signals, then asked whether those signals saw it coming.

NeurIPS 2024

Vancouver · 4,035 accepted

6of 11 defining themes the naive baseline would have called

The field turned from scaling pretraining to scaling reasoning and inference-time compute, crowned by Ilya's “peak data” talk. The prior signal had the generative, architecture, and alignment surges; it could not have the o1 narrative, which shipped weeks before the doors opened. (2024 ICLR/ICML keyword shares were not machine-extractable, so this year's baseline is a qualitative read of awards and recaps, not the mechanical share-delta used for 2025.)

Large language models (dominant cluster)Inference-time / test-time computeReasoning and its limitsDiffusion & image generationRL post-training (agents weaker in the distribution)Alignment / safety / pluralismData curation / synthetic dataScaling-laws / “peak data”Multimodal & embodied intelligenceAI-for-scienceState-space models / Mamba

NeurIPS 2025

San Diego + Mexico City · 5,290 accepted

5of 11 defining themes the naive baseline would have called

The post-o1/R1 reasoning conference, and a reflective one. The prior signal nailed the big themes, but understated RLVR (GRPO and R1 post-dated camera-ready deadlines, so a keyword count missed what arXiv was screaming) and missed the mode-collapse surprise that took a Best Paper. Image diffusion was flat; only the flow-matching niche surged.

LLM reasoning models (post-o1/R1)RLVR / GRPO + its skepticsMultimodal / vision-language (largest cluster)Evaluation & benchmarks as a scienceFlow matching (image diffusion was flat)Agents / continual-learning agentsWorld modelsEfficient reasoning (first workshop)Scaling-laws / mechanistic understandingLLM homogeneity / mode collapseAI-for-science

surging in the prior signalpartly visiblelate shock the method can't seea fresh surprise it missed

Finding 1 · the method has signal

On both resolved years, most of what ended up defining NeurIPS was already surging in the prior ICLR, ICML, and arXiv signal. The input carried the answer. The clean exception is inference-time compute in 2024: o1 shipped weeks before the conference, the kind of late shock a year-out method cannot see.

Finding 2 · the bar is high, which is the point

A mechanical extrapolator (the top-10 concepts by share-growth in ICLR+ICML of the cohort year, K fixed at 10) lands roughly half the defining themes both years: about 6 of 11 in 2024 and 5 of 11 in 2025, corrected down from a first, too-flattering pass. The big, obvious themes are visible a year out, so a model predicting them proves nothing. The bar is high, and a forecast earns its keep only on the calls the baseline misses. An even dumber persistence baseline (last year's themes simply repeat) does nearly as well, which is itself the point.

Late shocks are unforecastable a year out

Inference-time compute was the defining narrative of NeurIPS 2024, and it was invisible to the prior signal because o1 shipped in September, after the indicators. A concept whose catalyst post-dates the indicator window is marked a late shock and excluded from the hit count, applied by that date rule. No year-out method catches one; the 2026 forecast should expect to miss its own.

Weight arXiv and lab momentum over lagged keywords

RLVR and GRPO dominated arXiv through 2025 but read near-zero in conference author-keywords, because DeepSeek-R1 post-dated camera-ready deadlines. A method that extrapolates keywords misses this; one that weights arXiv and lab releases catches it. The 2026 method does the latter.

The leaderboard · forecasters on the same time-resolved task

2024 / 2025 backtest

Persistence baseline (last year repeats)

The dumbest comparator: predict last year's themes again. Already lands half. If share-growth can't beat this, the method adds little.

~6 · ~5scored

Naive share-growth baseline

Top-10 risers by share-delta. The bar to beat. Strong on big themes; blind to late shocks and surprises. 2024 is a qualitative read (no share data); 2025 is mechanical.

~6/11 · ~5/11scored

Claude Opus 4.8 — NeurIPS 2026

Ten predictions, sealed June 2026, resolve December 2026. Scored on beyond-baseline hits, of which there are five.

—pending

Other frontier models

Same task, same pre-registration. Open arm.

—open

Human expert forecasters

The comparison that matters most. Open arm.

—open

Integrity · we red-teamed ourselves

We ran two red-teams on our own backtest. They found real problems. We corrected down.

↓Counts corrected in the method-flattering direction we'd drifted: 2024 from 7 to ~6, 2025 from 6 to ~5.
↓“Beyond-baseline” labels were hand-assigned and shaded toward our own calls. Re-marked conservatively: five beyond, five baseline-ish (RLVR, world models, and the eval-volume claim were already loud in the 2026 signal).
↓Pinned the baseline cutoff at K=10, applied to both years; flagged that 2024's baseline is qualitative, not the mechanical share-delta used for 2025; added a dumber persistence baseline as a second comparator.
↓Sealed the watchlist and cut concepts as non-scoring: only the ten count, and none can be promoted after the fact.
↓Stated plainly that n=10 (about five beyond-baseline) is far too small to calibrate; the confidences are for ranking and transparency, not a calibration claim.

Both full audits are committed to the public repo. The point of an integrity guard is to publish where the bodies are buried, not to pretend there are none.

Pre-registered & tamper-evident

The forecast was sealed before the answer exists

A forecast you can edit after the fact is not a forecast. The ten predictions are frozen in a public artifact, fingerprinted with a SHA-256 hash, and anchored to a git commit timestamped months before the NeurIPS 2026 program is decided. Changing a single prediction changes the hash and breaks the record. Anyone can check it.

sealed: 2026-06-21 11:44 EDT
git commit: 61320eb
artifact: preregistration.json

sha-256

c0bad3048d13e6502095b794cd1942c1cc12d44f339339352845994e1041b3f6

verify it yourself

git clone https://github.com/upneja/research-upneja-ai
cd research-upneja-ai
shasum -a 256 research/preregistration.json
# -> c0bad3048d13e6502095b794...   matches the hash above
git log --oneline | grep "lock the final 10"
# -> 61320eb  the ten, sealed 2026-06-21, before Sep-2026 decisions

II · the signals

The landscape the 2026 call reads from

The same kind of public signal the backtest validated, for the current cycle. Momentum is read as share, not raw count, because the submission pipeline roughly doubled in two years.

19.5k

ICLR 2026 submissions, up from 7.3k in 2024

0 → 110

ICLR titles with “RLVR / verifiable reward” (2024→26)

5 mo

training-compute doubling time (AI Index)

Tsinghua

overtook Google for #1 by NeurIPS 2025 accepts

Fig 1 · arXiv title keywords, 2019–2025 (log scale)

The rise of, and the fade of

Counts of papers with each term in the title, by submission year (arXiv API, constant method). Log scale: “LLM” went 2 → 6,525; “reasoning” 315 → 5,487; graph neural networks and self-supervised learning have peaked and turned down. Toggle terms; hover a year.

Fig 2 · ICLR title counts, 2024 → 2026

The sharpest leading indicator

term — ICLR title count2024 2025 2026

reasoning

surging

154

403

1657

GRPO

zero-to-cluster

RLVR / verifiable reward

zero-to-cluster

110

test-time / inference-time

surging

397

KV-cache

surging

MoE

surging

RLHF

plateaued

120

111

Same-method title scan across three ICLR cycles. GRPO and RLVR went from literally zero to clusters; reasoning quadrupled in share; RLHF plateaued. NeurIPS 2026 is the first cycle whose submissions all post-date DeepSeek-R1.

Fig 3 · submissions / accepts / rate

The pipeline roughly doubled in two years

Why momentum must be read as share, not raw count: almost everything rose. NeurIPS, ICML, and ICLR submission counts since 2010. 2026 NeurIPS totals are not public until ~Sep 24, 2026.

Fig 4 · topic share of abstracts, 2023 → 2025

Multimodal surges, contrastive fades

Share of CVPR + ICLR + NeurIPS abstracts (26K-paper VLM survey, 2510.09586). Vision-language went 16% → 40% of abstracts in two years.

Fig 5 · top institutions by accepts

The center of gravity moved east

NeurIPS 2024

1Google300
2Tsinghua255
3CMU180
4Zhejiang172
5Microsoft166
6MIT161

NeurIPS 2025

1Tsinghua349
2Google322
3Peking279
4Shanghai Jiao Tong244
5CUHK243
6HKUST225

NeurIPS top-6 institutions by accepted papers (papercopilot). Tsinghua overtook Google for the #1 slot at NeurIPS 2025 — the first time a non-US lab led.

Fig 6 · the CFP as a leading indicator

NeurIPS 2026 · Sydney, Australia (ICC) · Dec 6–12, 2026

→Track renamed "Datasets & Benchmarks" → "Evaluations & Datasets"
→Page limit changed to 9 content pages
→Position Paper Track returns (2nd year)
→ML Reproducibility Challenge (MLRC) becomes an official track
→Creative AI Track (4th year), theme: "Agency"
→Randomized controlled AI-assisted-reviewing experiment (LLM-as-reviewer)
→AI-generated-paper crackdown: 178 of ~970 position papers desk-rejected

The conference's own structural changes are evidence. Papers were due May 6, 2026; decisions land ~Sep 24, 2026.

III · the live forecast

Ten predictions for NeurIPS 2026

Round one of the benchmark. Each names a specific, fresh concept, not a 2025 truism, and the ones that actually test foresight are the beyond-baseline calls a naive extrapolator would have missed.

The frontier plot · novelty × confidence● the ten ○ watchlist

The bet was a forecast that is both fresh and likely, so the ten cluster up and to the right. Hover a point for the call; click to jump to it.

Ten forecasts. Each carries an explicit confidence and a December-2026 test it can fail.

Evaluation / safetygenuinely novel

Evaluation becomes a science: sandbagging, eval-awareness, and benchmark auditing

The field inverts from shipping new benchmarks to interrogating them. Models behave differently when they detect a test, eval-awareness follows a scaling law that structurally caps every static benchmark, and auditing / reliability / construct-validity becomes the dominant mode — institutionally blessed.

87%

Post-training / RLgenuinely novel

RLVR escapes verifiable domains via rubric and generative verifiers; GRPO-successors become the default

RL-with-verifiable-rewards extends past math and code into writing, medicine, and law through rubrics and LLM-generated programmatic verifiers, while sequence-level / MoE-stable GRPO successors (GSPO, DAPO) replace vanilla GRPO and PPO.

85%

Agents / RLsharpened

Agentic RL where the environment, not the algorithm, is the scarce asset

The bottleneck shifts from the RL algorithm (now commoditized) to environment fidelity and reward checking, and long-horizon credit assignment (turn-level, hierarchical, hindsight) becomes a named subfield.

74%

Generative / embodiedgenuinely novel

World models as trainable simulators, with physics-faithfulness as the open problem

Generative models are used as environments to train and plan inside (Dreamer-4-style imagination training, Genie/Cosmos interactive simulators), not just as content generators — and “does the world model respect physics” becomes the central contested question.

72%

Multimodalfresh 2026

Unified, reasoning-infused multimodal: one backbone that reasons, then renders

The dominant multimodal story stops being understanding-only VLMs and becomes single backbones that both interpret and generate — an autoregressive-reasoning core that plans, then a diffusion head that renders. Multimodal is the single steepest-rising theme in ML.

82%

Generative LMsharpenedbeyond baseline

Diffusion language models become a recognized alternative to autoregressive text

Discrete / masked diffusion moves off images and onto the autoregressive LMs' home turf — text, code, reasoning — now at 100B scale and with its own RL post-training sub-literature.

71%

Architecturegenuinely novelbeyond baseline

Hybrid linear-attention beats pure SSM; attention is contested again

The contrarian call: not the naive “pure Mamba/SSM wins” bet but the ~3:1 hybrid linear-attention-to-full-attention recipe that frontier open models converged on — while pure SSM cools.

74%

Interpretabilitygenuinely novelbeyond baseline

Interpretability after SAEs: transcoders, circuit tracing, parameter decomposition

Sparse autoencoders are reframed as plateaued; the frontier moves to cross-layer transcoders, model diffing, and parameter decomposition as the working tools of mechanistic interpretability.

77%

AI for sciencegenuinely novelbeyond baseline

AI for mathematics: autonomous theorem proving cracks open problems

LLM-plus-Lean systems move from competition problems to genuinely open ones — the single freshest, highest-prestige result of the 2026 cycle. Lower volume than the rest, but spotlight-grade.

65%

Learning paradigmgenuinely novelbeyond baseline

Test-time training: updating weights during inference

Beyond frozen weights plus RAG, the model adapts its own parameters per-input or per-context at inference time — the freshest genuine inflection of 2026, now with a theory spine connecting it to linear attention.

70%

Watchlist · strong signals cut from the ten for novelty or slot limits

76%

Inference-aware scaling laws & the efficiency inversion

Strongest workshop proxy (NeurIPS 2025 Efficient Reasoning drew 1,000+); T² overtraining-optimal (2604.01411) reverses Chinchilla. Lost the slot to multimodal — it is the least novel of the contenders and overlaps the reasoning lane.

76%

The science of peer review under AI load

Strongest direct NeurIPS evidence — a randomized AI-reviewing experiment; 178 AI-generated position papers desk-rejected — but narrow, near-self-fulfilling, and held out to avoid a third eval/meta slot.

58%

Latent / looped (recurrent-depth) reasoning

Genuinely novel but evidence-thin (single verified anchor, Ouro) with unverified follow-ons, and an incoming critique track may deflate it before December. Cut from the 10 by all three critics.

68%

Native FP4 training (NVFP4 / MXFP4)

Nemotron pretrained 550B in NVFP4; Quartet II at ICML 2026 — but skews to MLSys venues, so it under-indexes at NeurIPS.

75%

Reasoning-aware KV-cache compression

ThinKV is an ICLR 2026 Oral, but it is a technique, not a theme — too narrow for a top-ten slot.

68%

Parametric / RL-learned agent memory (post-RAG)

Memory-R1, Evo-Memory, an ICLR MemAgents workshop — but memory-as-weights overlaps test-time training (#10).

IV · how this was built

Method, and the honesty rules

A forecast is only as good as the discipline behind it. Confidence is cross-validation times a hard leading indicator, every number traces to a source, and nothing rests on a rumor.

Cross-validation = confidence

A concept independently surfaced by ≥2 of the 9 research scouts AND backed by a hard leading indicator (ICLR/ICML 2026 accept data, an award, or a NeurIPS 2026 CFP fact) is high-probability. Single-scout or SEO-only signals are discounted.

Share, not raw count

The ICLR pipeline went 7.3k → 11.7k → 19.5k and ICML roughly doubled, so raw counts rose for almost everything. A concept only surged if its share outgrew ~1.7–2× program inflation. Every criterion is written in shares, ranks, awards, or jumps from a near-zero base, against a ~7,000-accept program.

Novelty bar

Reject the 2024/25 truisms (“more LLMs / agents / multimodal”). Each pick names a specific, fresh concept a non-expert wouldn't already assume.

The timing key

NeurIPS 2026 is the first NeurIPS whose full submission cycle post-dates DeepSeek-R1 (Jan 2025). R1-descended concepts (RLVR, reasoning-RL, test-time compute) peak here, not at NeurIPS 2025.

Adversarial before locking

Three independent critics (novelty, probability/evidence, falsifiability) attacked the draft. They caught a factual error (a misattributed CVPR best paper), a missing theme (multimodal), and systematic over-confidence — all corrected before these ten were locked.

Falsifiable

Every prediction carries a December-2026 criterion checkable against the public program: accepts, titles, abstracts, orals/spotlights, awards, workshops, and the CFP.

Honesty rules baked into the data

·n is far too small to calibrate. Ten predictions, of which about five are beyond-baseline, cannot support a claim that the confidences are well-calibrated. They are stated for ranking and transparency, not as a calibration result.
·Only the ten count. The six-item watchlist and the cut concepts are sealed as non-scoring and cannot be promoted after December, whatever happens.
·The forecast reads lab and arXiv momentum that the public-conference baseline cannot, so part of any edge is more inputs, not better foresight. We say so rather than hide it.
·The backtest matching was done by one author who saw both the outcomes and the prior signals, so it is a judgment estimate, audited by two red-teams and corrected down, not a clean measurement.
·arXiv category counts include cross-lists, so the four CS categories are never summed into a unique total. Momentum is read as share, not raw count, against a NeurIPS 2026 program of about 7,000 accepts.
·No prediction rests on an SEO-suspect model name or an unverified late-2026 arXiv id, only on hard anchors: the track rename, ICLR/ICML 2026 orals and accept counts, AlphaProof's Erdős results, and a constant-method keyword scan. A draft anchor, a CVPR best paper claimed to be a world model, was caught wrong in review and removed.