research.upneja.ai · a benchmark for AI research foresight
Can a model forecast the research frontier?
A test of research taste judged not by human raters but by time. The method reads public signals and bets on which fresh directions will matter, then waits for the answer. We test it first on NeurIPS 2024 and 2025, which already resolved, then run it live on NeurIPS 2026: ten predictions sealed in June, scored in December against a naive baseline. Forecaster, Claude Opus 4.8.
◆ Sydney, Australia (ICC)◆ Dec 6–12, 2026◆ sealed 21 Jun 2026◆ resolves Dec 2026◆ 9 scouts · 3 critics · 1 backtest
The model did not invent these concepts. It read the field and bet on which fresh directions will matter, before the answer was public. That is research judgment, and it is scoreable.
Judged by time, not by raters
Most attempts to measure model novelty ask humans whether ideas feel new. This uses a future event as the answer key. The forecast resolves hit or miss against the public program.
It has to beat a dumb baseline
A forecast that just echoes last year is the null hypothesis. The bar is a naive extrapolator, and as the backtest shows, that bar is high. Taste only shows in the calls the baseline misses.
I · the backtest
Does the method even work?
Before trusting a 2026 call you cannot check yet, test the method on the two NeurIPS years that already resolved. We gathered what actually defined NeurIPS 2024 and 2025 independently of the prior signals, then asked whether those signals saw it coming.
NeurIPS 2024
Vancouver · 4,035 accepted
6of 11 defining themes the naive baseline would have called
The field turned from scaling pretraining to scaling reasoning and inference-time compute, crowned by Ilya's “peak data” talk. The prior signal had the generative, architecture, and alignment surges; it could not have the o1 narrative, which shipped weeks before the doors opened. (2024 ICLR/ICML keyword shares were not machine-extractable, so this year's baseline is a qualitative read of awards and recaps, not the mechanical share-delta used for 2025.)
Large language models (dominant cluster)Inference-time / test-time computeReasoning and its limitsDiffusion & image generationRL post-training (agents weaker in the distribution)Alignment / safety / pluralismData curation / synthetic dataScaling-laws / “peak data”Multimodal & embodied intelligenceAI-for-scienceState-space models / Mamba
NeurIPS 2025
San Diego + Mexico City · 5,290 accepted
5of 11 defining themes the naive baseline would have called
The post-o1/R1 reasoning conference, and a reflective one. The prior signal nailed the big themes, but understated RLVR (GRPO and R1 post-dated camera-ready deadlines, so a keyword count missed what arXiv was screaming) and missed the mode-collapse surprise that took a Best Paper. Image diffusion was flat; only the flow-matching niche surged.
LLM reasoning models (post-o1/R1)RLVR / GRPO + its skepticsMultimodal / vision-language (largest cluster)Evaluation & benchmarks as a scienceFlow matching (image diffusion was flat)Agents / continual-learning agentsWorld modelsEfficient reasoning (first workshop)Scaling-laws / mechanistic understandingLLM homogeneity / mode collapseAI-for-science
surging in the prior signalpartly visiblelate shock the method can't seea fresh surprise it missed
Finding 1 · the method has signal
On both resolved years, most of what ended up defining NeurIPS was already surging in the prior ICLR, ICML, and arXiv signal. The input carried the answer. The clean exception is inference-time compute in 2024: o1 shipped weeks before the conference, the kind of late shock a year-out method cannot see.
Finding 2 · the bar is high, which is the point
A mechanical extrapolator (the top-10 concepts by share-growth in ICLR+ICML of the cohort year, K fixed at 10) lands roughly half the defining themes both years: about 6 of 11 in 2024 and 5 of 11 in 2025, corrected down from a first, too-flattering pass. The big, obvious themes are visible a year out, so a model predicting them proves nothing. The bar is high, and a forecast earns its keep only on the calls the baseline misses. An even dumber persistence baseline (last year's themes simply repeat) does nearly as well, which is itself the point.
Late shocks are unforecastable a year out
Inference-time compute was the defining narrative of NeurIPS 2024, and it was invisible to the prior signal because o1 shipped in September, after the indicators. A concept whose catalyst post-dates the indicator window is marked a late shock and excluded from the hit count, applied by that date rule. No year-out method catches one; the 2026 forecast should expect to miss its own.
Weight arXiv and lab momentum over lagged keywords
RLVR and GRPO dominated arXiv through 2025 but read near-zero in conference author-keywords, because DeepSeek-R1 post-dated camera-ready deadlines. A method that extrapolates keywords misses this; one that weights arXiv and lab releases catches it. The 2026 method does the latter.
The leaderboard · forecasters on the same time-resolved task
2024 / 2025 backtest
Persistence baseline (last year repeats)
The dumbest comparator: predict last year's themes again. Already lands half. If share-growth can't beat this, the method adds little.
~6 · ~5scored
Naive share-growth baseline
Top-10 risers by share-delta. The bar to beat. Strong on big themes; blind to late shocks and surprises. 2024 is a qualitative read (no share data); 2025 is mechanical.
~6/11 · ~5/11scored
Claude Opus 4.8 — NeurIPS 2026
Ten predictions, sealed June 2026, resolve December 2026. Scored on beyond-baseline hits, of which there are five.
—pending
Other frontier models
Same task, same pre-registration. Open arm.
—open
Human expert forecasters
The comparison that matters most. Open arm.
—open
Integrity · we red-teamed ourselves
We ran two red-teams on our own backtest. They found real problems. We corrected down.
↓Counts corrected in the method-flattering direction we'd drifted: 2024 from 7 to ~6, 2025 from 6 to ~5.
↓“Beyond-baseline” labels were hand-assigned and shaded toward our own calls. Re-marked conservatively: five beyond, five baseline-ish (RLVR, world models, and the eval-volume claim were already loud in the 2026 signal).
↓Pinned the baseline cutoff at K=10, applied to both years; flagged that 2024's baseline is qualitative, not the mechanical share-delta used for 2025; added a dumber persistence baseline as a second comparator.
↓Sealed the watchlist and cut concepts as non-scoring: only the ten count, and none can be promoted after the fact.
↓Stated plainly that n=10 (about five beyond-baseline) is far too small to calibrate; the confidences are for ranking and transparency, not a calibration claim.
Both full audits are committed to the public repo. The point of an integrity guard is to publish where the bodies are buried, not to pretend there are none.
Pre-registered & tamper-evident
The forecast was sealed before the answer exists
A forecast you can edit after the fact is not a forecast. The ten predictions are frozen in a public artifact, fingerprinted with a SHA-256 hash, and anchored to a git commit timestamped months before the NeurIPS 2026 program is decided. Changing a single prediction changes the hash and breaks the record. Anyone can check it.
git clone https://github.com/upneja/research-upneja-ai
cd research-upneja-ai
shasum -a 256 research/preregistration.json
# -> c0bad3048d13e6502095b794... matches the hash above
git log --oneline | grep "lock the final 10"
# -> 61320eb the ten, sealed 2026-06-21, before Sep-2026 decisions
II · the signals
The landscape the 2026 call reads from
The same kind of public signal the backtest validated, for the current cycle. Momentum is read as share, not raw count, because the submission pipeline roughly doubled in two years.
19.5k
ICLR 2026 submissions, up from 7.3k in 2024
0 → 110
ICLR titles with “RLVR / verifiable reward” (2024→26)
5 mo
training-compute doubling time (AI Index)
Tsinghua
overtook Google for #1 by NeurIPS 2025 accepts
Fig 1 · arXiv title keywords, 2019–2025 (log scale)
The rise of, and the fade of
Counts of papers with each term in the title, by submission year (arXiv API, constant method). Log scale: “LLM” went 2 → 6,525; “reasoning” 315 → 5,487; graph neural networks and self-supervised learning have peaked and turned down. Toggle terms; hover a year.
Fig 2 · ICLR title counts, 2024 → 2026
The sharpest leading indicator
term — ICLR title count2024 2025 2026
reasoning
surging
154
403
1657
GRPO
zero-to-cluster
0
0
73
RLVR / verifiable reward
zero-to-cluster
0
0
110
test-time / inference-time
surging
47
85
397
KV-cache
surging
1
29
60
MoE
surging
13
51
91
RLHF
plateaued
36
120
111
Same-method title scan across three ICLR cycles. GRPO and RLVR went from literally zero to clusters; reasoning quadrupled in share; RLHF plateaued. NeurIPS 2026 is the first cycle whose submissions all post-date DeepSeek-R1.
Fig 3 · submissions / accepts / rate
The pipeline roughly doubled in two years
Why momentum must be read as share, not raw count: almost everything rose. NeurIPS, ICML, and ICLR submission counts since 2010. 2026 NeurIPS totals are not public until ~Sep 24, 2026.
Fig 4 · topic share of abstracts, 2023 → 2025
Multimodal surges, contrastive fades
Share of CVPR + ICLR + NeurIPS abstracts (26K-paper VLM survey, 2510.09586). Vision-language went 16% → 40% of abstracts in two years.
Fig 5 · top institutions by accepts
The center of gravity moved east
NeurIPS 2024
1Google300
2Tsinghua255
3CMU180
4Zhejiang172
5Microsoft166
6MIT161
NeurIPS 2025
1Tsinghua349
2Google322
3Peking279
4Shanghai Jiao Tong244
5CUHK243
6HKUST225
NeurIPS top-6 institutions by accepted papers (papercopilot). Tsinghua overtook Google for the #1 slot at NeurIPS 2025 — the first time a non-US lab led.
Fig 6 · the CFP as a leading indicator
NeurIPS 2026 · Sydney, Australia (ICC) · Dec 6–12, 2026
→AI-generated-paper crackdown: 178 of ~970 position papers desk-rejected
The conference's own structural changes are evidence. Papers were due May 6, 2026; decisions land ~Sep 24, 2026.
III · the live forecast
Ten predictions for NeurIPS 2026
Round one of the benchmark. Each names a specific, fresh concept, not a 2025 truism, and the ones that actually test foresight are the beyond-baseline calls a naive extrapolator would have missed.
The frontier plot · novelty × confidence● the ten ○ watchlist
The bet was a forecast that is both fresh and likely, so the ten cluster up and to the right. Hover a point for the call; click to jump to it.
Ten forecasts. Each carries an explicit confidence and a December-2026 test it can fail.
01
Evaluation / safetygenuinely novel
Evaluation becomes a science: sandbagging, eval-awareness, and benchmark auditing
The field inverts from shipping new benchmarks to interrogating them. Models behave differently when they detect a test, eval-awareness follows a scaling law that structurally caps every static benchmark, and auditing / reliability / construct-validity becomes the dominant mode — institutionally blessed.
87%
02
Post-training / RLgenuinely novel
RLVR escapes verifiable domains via rubric and generative verifiers; GRPO-successors become the default
RL-with-verifiable-rewards extends past math and code into writing, medicine, and law through rubrics and LLM-generated programmatic verifiers, while sequence-level / MoE-stable GRPO successors (GSPO, DAPO) replace vanilla GRPO and PPO.
85%
03
Agents / RLsharpened
Agentic RL where the environment, not the algorithm, is the scarce asset
The bottleneck shifts from the RL algorithm (now commoditized) to environment fidelity and reward checking, and long-horizon credit assignment (turn-level, hierarchical, hindsight) becomes a named subfield.
74%
04
Generative / embodiedgenuinely novel
World models as trainable simulators, with physics-faithfulness as the open problem
Generative models are used as environments to train and plan inside (Dreamer-4-style imagination training, Genie/Cosmos interactive simulators), not just as content generators — and “does the world model respect physics” becomes the central contested question.
72%
05
Multimodalfresh 2026
Unified, reasoning-infused multimodal: one backbone that reasons, then renders
The dominant multimodal story stops being understanding-only VLMs and becomes single backbones that both interpret and generate — an autoregressive-reasoning core that plans, then a diffusion head that renders. Multimodal is the single steepest-rising theme in ML.
82%
06
Generative LMsharpenedbeyond baseline
Diffusion language models become a recognized alternative to autoregressive text
Discrete / masked diffusion moves off images and onto the autoregressive LMs' home turf — text, code, reasoning — now at 100B scale and with its own RL post-training sub-literature.
71%
07
Architecturegenuinely novelbeyond baseline
Hybrid linear-attention beats pure SSM; attention is contested again
The contrarian call: not the naive “pure Mamba/SSM wins” bet but the ~3:1 hybrid linear-attention-to-full-attention recipe that frontier open models converged on — while pure SSM cools.
74%
08
Interpretabilitygenuinely novelbeyond baseline
Interpretability after SAEs: transcoders, circuit tracing, parameter decomposition
Sparse autoencoders are reframed as plateaued; the frontier moves to cross-layer transcoders, model diffing, and parameter decomposition as the working tools of mechanistic interpretability.
77%
09
AI for sciencegenuinely novelbeyond baseline
AI for mathematics: autonomous theorem proving cracks open problems
LLM-plus-Lean systems move from competition problems to genuinely open ones — the single freshest, highest-prestige result of the 2026 cycle. Lower volume than the rest, but spotlight-grade.
65%
10
Learning paradigmgenuinely novelbeyond baseline
Test-time training: updating weights during inference
Beyond frozen weights plus RAG, the model adapts its own parameters per-input or per-context at inference time — the freshest genuine inflection of 2026, now with a theory spine connecting it to linear attention.
70%
Watchlist · strong signals cut from the ten for novelty or slot limits
76%
Inference-aware scaling laws & the efficiency inversion
Strongest workshop proxy (NeurIPS 2025 Efficient Reasoning drew 1,000+); T² overtraining-optimal (2604.01411) reverses Chinchilla. Lost the slot to multimodal — it is the least novel of the contenders and overlaps the reasoning lane.
76%
The science of peer review under AI load
Strongest direct NeurIPS evidence — a randomized AI-reviewing experiment; 178 AI-generated position papers desk-rejected — but narrow, near-self-fulfilling, and held out to avoid a third eval/meta slot.
58%
Latent / looped (recurrent-depth) reasoning
Genuinely novel but evidence-thin (single verified anchor, Ouro) with unverified follow-ons, and an incoming critique track may deflate it before December. Cut from the 10 by all three critics.
68%
Native FP4 training (NVFP4 / MXFP4)
Nemotron pretrained 550B in NVFP4; Quartet II at ICML 2026 — but skews to MLSys venues, so it under-indexes at NeurIPS.
75%
Reasoning-aware KV-cache compression
ThinKV is an ICLR 2026 Oral, but it is a technique, not a theme — too narrow for a top-ten slot.
68%
Parametric / RL-learned agent memory (post-RAG)
Memory-R1, Evo-Memory, an ICLR MemAgents workshop — but memory-as-weights overlaps test-time training (#10).
IV · how this was built
Method, and the honesty rules
A forecast is only as good as the discipline behind it. Confidence is cross-validation times a hard leading indicator, every number traces to a source, and nothing rests on a rumor.
Cross-validation = confidence
A concept independently surfaced by ≥2 of the 9 research scouts AND backed by a hard leading indicator (ICLR/ICML 2026 accept data, an award, or a NeurIPS 2026 CFP fact) is high-probability. Single-scout or SEO-only signals are discounted.
Share, not raw count
The ICLR pipeline went 7.3k → 11.7k → 19.5k and ICML roughly doubled, so raw counts rose for almost everything. A concept only surged if its share outgrew ~1.7–2× program inflation. Every criterion is written in shares, ranks, awards, or jumps from a near-zero base, against a ~7,000-accept program.
Novelty bar
Reject the 2024/25 truisms (“more LLMs / agents / multimodal”). Each pick names a specific, fresh concept a non-expert wouldn't already assume.
The timing key
NeurIPS 2026 is the first NeurIPS whose full submission cycle post-dates DeepSeek-R1 (Jan 2025). R1-descended concepts (RLVR, reasoning-RL, test-time compute) peak here, not at NeurIPS 2025.
Adversarial before locking
Three independent critics (novelty, probability/evidence, falsifiability) attacked the draft. They caught a factual error (a misattributed CVPR best paper), a missing theme (multimodal), and systematic over-confidence — all corrected before these ten were locked.
Falsifiable
Every prediction carries a December-2026 criterion checkable against the public program: accepts, titles, abstracts, orals/spotlights, awards, workshops, and the CFP.
Honesty rules baked into the data
·n is far too small to calibrate. Ten predictions, of which about five are beyond-baseline, cannot support a claim that the confidences are well-calibrated. They are stated for ranking and transparency, not as a calibration result.
·Only the ten count. The six-item watchlist and the cut concepts are sealed as non-scoring and cannot be promoted after December, whatever happens.
·The forecast reads lab and arXiv momentum that the public-conference baseline cannot, so part of any edge is more inputs, not better foresight. We say so rather than hide it.
·The backtest matching was done by one author who saw both the outcomes and the prior signals, so it is a judgment estimate, audited by two red-teams and corrected down, not a clean measurement.
·arXiv category counts include cross-lists, so the four CS categories are never summed into a unique total. Momentum is read as share, not raw count, against a NeurIPS 2026 program of about 7,000 accepts.
·No prediction rests on an SEO-suspect model name or an unverified late-2026 arXiv id, only on hard anchors: the track rename, ICLR/ICML 2026 orals and accept counts, AlphaProof's Erdős results, and a constant-method keyword scan. A draft anchor, a CVPR best paper claimed to be a world model, was caught wrong in review and removed.