EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments Paper • 2607.02440 • Published 1 day ago • 36
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling Paper • 2606.13473 • Published 23 days ago • 92
ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics Paper • 2606.10479 • Published 25 days ago • 19
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents Paper • 2606.05761 • Published 30 days ago • 19
SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects Paper • 2605.19587 • Published May 19 • 10
Draft-OPD: On-Policy Distillation for Speculative Draft Models Paper • 2605.29343 • Published May 28 • 36
Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Paper • 2605.22791 • Published May 21 • 33