Daily AI Papers — May 09, 2026

12 minute read

Published:

1. AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

Authors: Daniel Zheng, Ingrid von Glehn, Yori Zwols, Iuliya Beloshapka, Lars Buesing, Daniel M. Roy, Martin Wattenberg, Bogdan Georgiev, Tatiana Schmidt, Andrew Cowie, Fernanda Viegas, Dimitri Kanevsky (Google DeepMind)

Summary: Google DeepMind introduces an interactive AI workbench where mathematicians can leverage AI agents to pursue open-ended research — covering ideation, literature synthesis, and proof exploration. The system achieves a new high score on the FrontierMath benchmark, signaling a step change in AI-assisted mathematical discovery.

arXiv: arxiv.org/abs/2605.06651

Sources: HuggingFace Daily Papers, officechai.com, Nature.com, YouTube (JX6nCEsuBDw)

Why trending: High-profile Google DeepMind release with benchmark-breaking results on FrontierMath; earned media coverage in Nature and general AI press, plus a dedicated YouTube explainer.


2. EMO: Pretraining Mixture of Experts for Emergent Modularity

Authors: Ryan Wang, Akshita Bhagia, Sewon Min (Allen Institute for AI)

Summary: EMO shows that by pretraining with a Mixture-of-Experts objective that encourages emergent specialization, you can obtain a model whose expert subsets serve as standalone narrow specialists (code, math, domain knowledge) without losing general capability. This challenges the assumption that MoE routing only saves compute — it can also yield modular, deployable sub-models.

arXiv: arxiv.org/abs/2605.06663

Sources: HuggingFace Daily Papers, HuggingFace Blog (AllenAI), Emergent Mind

Why trending: AllenAI published an accompanying blog post on HuggingFace, and the modular LLM angle resonates strongly with practitioners seeking efficient deployment.


3. Prescriptive Scaling Laws for Data Constrained Training

Authors: Justin Lovelace, Christian Belardi, Srivatsa Kundurthy, Shriya Sudhakar, Kilian Q. Weinberger

Summary: The classical Chinchilla scaling law assumes unlimited unique training tokens, but data is now the bottleneck. This paper derives new prescriptive scaling laws that tell practitioners exactly how to allocate compute when tokens are scarce — including when and how much to repeat data — going beyond descriptive observation to actionable guidance.

arXiv: arxiv.org/abs/2605.01640

Sources: HuggingFace Daily Papers, mbrenndoerfer.com blog

Why trending: Directly addresses the “data wall” problem every serious LLM training team faces; prescriptive (not just descriptive) framing makes it immediately actionable.


4. TIDE: Every Layer Knows the Token Beneath the Context

Authors: Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Mehrdad Farajtabar, Minsik Cho

Summary: TIDE challenges the universal design choice of injecting token identity only once at the input embedding layer. By re-injecting token positional identity at every Transformer layer, it mitigates the “Rare Token Problem” (where Zipf-distributed rare tokens lose identity through depth) and improves long-context understanding without architectural overhaul.

arXiv: arxiv.org/abs/2605.06216

Sources: HuggingFace Daily Papers, Emergent Mind

Why trending: Simple architectural tweak with broad applicability; directly addresses a known failure mode in all current LLMs.


5. Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

Authors: Zhiyuan Zeng, Jiameng Huang, Zhangyue Yin, Jiashuo Liu, Ziniu Li, Bingrui Li, Yuhao Wu, Yining Zheng, Ge Zhang, Wenhao Huang, Xipeng Qiu

Summary: GRPO-style RL training for LLM reasoning is widely adopted, but this paper identifies a previously overlooked aggregation bias in how group rewards are computed — biasing gradients and hurting training stability. The fix (Balanced Aggregation) is lightweight and significantly improves final reasoning performance.

arXiv: arxiv.org/abs/2605.04077

Sources: HuggingFace Daily Papers

Why trending: GRPO is the de facto RL fine-tuning method for reasoning models; a bug-fix paper with strong results attracts immediate practitioner attention.


6. Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

Authors: Yuan Wang, Ouxiang Li, Yulong Xu, Borui Liao, Jiajun Liang, Jinghan Li, Meng Wang, Xintao Wang, Pengfei Wang, Kuien Liu, Xiang Wang

Summary: Current video reward models conflate reasoning and scoring, leading to noisy, inaccurate reward signals for generative video training. This paper decouples reasoning (understanding video quality dimensions) from scoring (emitting a reward signal), yielding sharper reward estimates that significantly improve post-training and test-time scaling for video generation.

arXiv: arxiv.org/abs/2605.05922

Sources: HuggingFace Daily Papers

Why trending: Video generation post-training is a hot frontier; cleaner reward models directly translate to better Sora/Wan-class models.


7. Verifier-Backed Hard Problem Generation for Mathematical Reasoning

Authors: Yuhang Lai, Jiazhan Feng, Yee Whye Teh, Ning Miao

Summary: LLMs are good at solving math problems but poor at generating new, genuinely hard, valid problems — a bottleneck for autonomous AI-driven curriculum generation. This paper uses formal verifiers as a backbone to guarantee problem validity while a generative model steers toward difficulty, enabling reliable hard problem synthesis at scale.

arXiv: arxiv.org/abs/2605.06660

Sources: arXiv cs.AI, Papers With Code

Why trending: Directly addresses the data flywheel problem for math reasoning models; combines symbolic verification with neural generation in a principled way.


8. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

Authors: Mingwei Xu, Hao Fang

Summary: Standard RLVR training for LLMs requires both positive and negative rollout examples, but negative samples are expensive and often uninformative. This paper shows that an implicit negative gradient signal can be derived purely from positive rollouts, enabling more sample-efficient policy optimization with comparable or better reasoning performance.

arXiv: arxiv.org/abs/2605.06650

Sources: arXiv cs.LG

Why trending: Sample efficiency in RL fine-tuning is a key pain point; removing the need for negative rollouts simplifies training pipelines.


9. The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

Authors: Chonghan Qin, Xiachong Feng, Ziyun Song, Xiaocheng Feng, Jing Xiong, Lingpeng Kong

Summary: LLMs are routinely prompted to adopt social roles from individual personas to institutional voices, yet it is unclear whether their internal representations encode this “granularity” dimension. This paper identifies a linear micro-to-macro latent direction in LLM activations and shows it can be steered to improve role-consistent generation.

arXiv: arxiv.org/abs/2605.06196

Sources: HuggingFace Daily Papers

Why trending: Mechanistic interpretability of role-playing behavior is directly relevant to persona-based AI systems and alignment research.


10. Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Authors: Hailey Onweller, Elias Lumer, Austin Huber, Pia Ramchandani, Vamse Kumar Subbiah, Corey Feld

Summary: Deep research agents like Gemini Deep Research and Perplexity cite hundreds of sources, but those citations are often inaccurate or unverifiable. This paper introduces a systematic framework to parse, trace, and evaluate source attribution fidelity in LLM research agents, revealing alarming rates of miscitation.

arXiv: arxiv.org/abs/2605.06635

Sources: arXiv cs.CL

Why trending: Hallucinated citations in research agents are a serious trust and liability issue; this is the first systematic evaluation framework for the problem.


11. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Authors: Yuxing Liu, Jianyu Wang, Tong Zhang

Summary: A surprising empirical finding: when fine-tuning LLMs with the same optimizer used during pretraining (rather than a default like AdamW), catastrophic forgetting is substantially reduced. The paper provides theoretical grounding for this “optimizer-model consistency” principle and shows it generalizes across architectures and tasks.

arXiv: arxiv.org/abs/2605.06654

Sources: arXiv cs.LG

Why trending: Immediately actionable insight for anyone running full fine-tuning pipelines; challenges the default of always using AdamW for finetuning.


12. The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

Authors: Siquan Li, Kaiqi Jiang, Jiacheng Sun, Tianyang Hu

Summary: The attention sink phenomenon — where initial tokens monopolize attention scores — is well-documented but mechanistically unexplained. This paper traces it to three structural causes: variance discrepancy across layers, “super neuron” activations, and dimension disparity in key/query projections, opening principled paths to mitigation.

arXiv: arxiv.org/abs/2605.06611

Sources: arXiv cs.LG

Why trending: Mechanistic understanding of attention sinks is prerequisite for fixing them; affects long-context performance in every transformer-based LLM.


13. MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

Authors: Zhexuan Wang, Xuebo Liu, Li Wang, Zifei Shan, Yutong Wang, Zhenxi Song, Min Zhang

Summary: In LLM multi-agent systems, each agent has a role-specific prompt, but these prompts are typically tuned independently, ignoring cross-agent interactions. MASPO jointly optimizes all agent prompts by modeling the MAS as a cooperative game, significantly improving collaborative task performance over per-agent baselines.

arXiv: arxiv.org/abs/2605.06623

Sources: arXiv cs.AI

Why trending: Multi-agent systems are rapidly proliferating; joint prompt optimization is an underexplored but high-leverage lever for MAS performance.


14. Crafting Reversible SFT Behaviors in Large Language Models

Authors: Yuping Lin, Pengfei He, Yue Xing, Yingqian Cui, Jiayuan Ding, Subhabrata Mukherjee, Hui Liu, Zhen Xiang

Summary: Supervised fine-tuning imprints behaviors into LLMs without any built-in mechanism to reverse them — a problem for alignment and model updates. This paper proposes a training method that deliberately structures new SFT behaviors into a reversible subnetwork, enabling clean removal without degrading other capabilities.

arXiv: arxiv.org/abs/2605.06632

Sources: arXiv cs.CL

Why trending: Model editing and alignment rollback are increasingly important as models are deployed and need behavioral updates post-deployment.


15. The Scaling Properties of Implicit Deductive Reasoning in Transformers

Authors: Enrico Vompa, Tanel Tammet

Summary: Using Horn clause theorem proving as a controlled testbed, this paper systematically studies how implicit deductive reasoning scales with Transformer depth, width, and training data. It finds that depth is the primary driver of reasoning depth, providing clean scaling laws for deductive capability.

arXiv: arxiv.org/abs/2605.04330

Sources: HuggingFace Daily Papers

Why trending: Clean empirical scaling laws for reasoning (not just loss) are rare; the controlled Horn clause setting makes results highly interpretable.


16. STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

Authors: Hanxiang Chao, Yihan Bai, Rui Sheng, Tianle Li, Yushi Sun

Summary: LLM agents with persistent memory stores face a critical failure mode: implicit conflicts where new observations contradict stored beliefs without explicit signaling. STALE introduces a benchmark and methodology for evaluating whether agents can detect and revise stale memories, revealing major gaps in current systems.

arXiv: arxiv.org/abs/2605.06527

Sources: arXiv cs.AI

Why trending: Long-term memory staleness is a critical unsolved problem for deployed personal AI agents; the benchmark fills a gap left by static fact-retrieval evaluations.


17. LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

Authors: Yijia Zheng, Marcel Worring

Summary: Multi-step agentic RAG is powerful but slow due to repeated LLM calls for planning and retrieval. LatentRAG moves reasoning into a latent space, enabling joint reasoning and retrieval in a single forward pass, drastically reducing latency while matching or exceeding multi-step accuracy.

arXiv: arxiv.org/abs/2605.06285

Sources: arXiv cs.IR

Why trending: Efficiency of agentic RAG is a major production concern; single-pass latent reasoning is an elegant departure from the dominant multi-step paradigm.


18. TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

Authors: Yuan Sui, Yulin Chen, Yibo Li, Xue Jiang, Yufei He, Yihong Dong, Xiaoxin He, Tianyu Gao, Bryan Hooi

Summary: Coding agents degrade over long trajectories through “agent drift” — either overthinking (redundant reasoning) or overacting (issuing unnecessary tool calls). TACT uses activation steering to suppress these failure modes at inference time, improving SWE-bench performance without any additional fine-tuning.

arXiv: arxiv.org/abs/2605.05980

Sources: arXiv cs.SE

Why trending: Coding agents are among the highest-value deployments of LLMs; activation steering as an inference-time fix (no retraining) is immediately applicable.


19. MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

Authors: Chunyu Li, Jingyi Kang, Ding Chen, Mengyuan Zhang, Jiajun Shen, Bo Tang, Xuanhe Zhou, Feiyu Xiong, Zhiyu Li

Summary: Standard rerankers in agent memory systems rely on semantic similarity, which fails when relevant memories require multi-hop reasoning to connect to the query. MemReranker introduces a reasoning-aware reranker that explicitly reasons over candidate memories before scoring, substantially improving long-term agent performance on complex tasks.

arXiv: arxiv.org/abs/2605.06132

Sources: arXiv cs.AI

Why trending: Agent memory retrieval is a critical bottleneck for long-horizon tasks; reasoning-aware reranking is a natural but previously underexplored direction.


20. GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs

Authors: Pranav Mantini, Shishir K. Shah

Summary: Vision-Language Models suffer from catastrophic forgetting when accumulating expertise across domains. GeoStack introduces a modular geometric framework that treats independently trained domain experts as composable building blocks — using quasi-Abelian structure to guarantee interference-free combination without full retraining.

arXiv: arxiv.org/abs/2605.06477

Sources: HuggingFace Daily Papers

Why trending: Continual learning and modular composition for VLMs is a persistent challenge; the geometric formulation provides a theoretically grounded approach to a practical problem.