CocoaBench: Evaluating Unified Digital Agents in the Wild Paper • 2604.11201 • Published 7 days ago • 33
Synthetic Sandbox for Training Machine Learning Engineering Agents Paper • 2604.04872 • Published 14 days ago • 14
Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents Paper • 2601.18217 • Published Jan 26 • 13
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following Paper • 2511.21662 • Published Nov 26, 2025 • 11
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following Paper • 2511.21662 • Published Nov 26, 2025 • 11
Scaling Agent Learning via Experience Synthesis Paper • 2511.03773 • Published Nov 5, 2025 • 83
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought Paper • 2511.02779 • Published Nov 4, 2025 • 60
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation Paper • 2511.01163 • Published Nov 3, 2025 • 32