Daily AI Papers — April 29, 2026
Published:
1. Recursive Multi-Agent Systems (RecursiveMAS)
Authors: Xiyuan Yang, Jiaru Zou, Rui Pan, Ruizhong Qiu, Pan Lu, Shizhe Diao, Jindong Jiang, Hanghang Tong, Tong Zhang, Markus J. Buehler, Jingrui He, James Zou
arxiv: arxiv.org/abs/2604.25917
Sources: HuggingFace Daily Papers (117 upvotes), arxiv cs.AI
Summary: RecursiveMAS extends the recursive/looped computation scaling principle from single models to multi-agent systems, connecting heterogeneous agents in a latent-space collaboration loop via a lightweight RecursiveLink module. An inner-outer loop learning algorithm enables whole-system co-optimization through shared gradient-based credit assignment, yielding 8.3% average accuracy improvement, 1.2–2.4× inference speedup, and 34.6–75.6% token reduction across 9 benchmarks spanning math, science, medicine, search, and code generation.
Why trending: Top-voted HuggingFace paper of the day (117 upvotes) from Stanford. Addresses a core open question in MAS: can agent collaboration itself be a scaling axis? Strong empirical results across diverse benchmarks.
2. Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs
Authors: Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, Cheng Tan
arxiv: arxiv.org/abs/2604.24819
Sources: HuggingFace Daily Papers (69 upvotes), arxiv cs.AI, OpenDataLab
Summary: This paper maps the LLM fine-tuning lifecycle onto the software development lifecycle: training data = source code, model training = compilation, benchmarking = unit testing, failure-driven data repair = debugging. The framework enables systematic diagnosis and targeted patching of model failures at the concept and reasoning-chain level, releasing a structured knowledge base, benchmark suite, and training corpus across 16 disciplines without degrading general capabilities.
Why trending: Novel framing that treats LLM training data as debuggable software. Comes with open resources across 16 domains. Addresses the feedback-loop problem in fine-tuning that has long frustrated practitioners.
3. DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
Authors: Jinxiang Meng, Shaoping Huang, Fangyu Lei, Jingyu Guo, Haoxiang Liu, Jiahao Su, Conghui He, Jingxuan Wei, Cheng Tan
arxiv: arxiv.org/abs/2604.25914
Sources: HuggingFace Daily Papers (36 upvotes), arxiv cs.AI
Summary: DV-World is a benchmark of 260 tasks designed to evaluate data visualization agents in realistic environments, covering native environmental grounding, cross-platform evolution (not just code-sandbox creation), and proactive intent alignment when user intent is imperfect. It addresses three key gaps in existing DV benchmarks: code-sandbox confinement, single-language creation-only tasks, and assumed perfect intent.
Why trending: Fills a practical gap in agentic benchmarking. Most existing DV benchmarks test code generation in sandboxes; this tests agents in real environments with ambiguous intent.
4. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
Authors: Lei Xiong, Kun Luo, Ziyi Xia
arxiv: arxiv.org/abs/2604.25256
Sources: HuggingFace Daily Papers (24 upvotes), arxiv cs.AI
Summary: AutoResearchBench evaluates AI agents’ capability in complex scientific literature discovery—a key step in autonomous research pipelines—testing whether agents can find right papers for knowledge exploration, evidence gathering, and claim verification. The benchmark tests multi-hop reasoning over literature graphs and realistic research workflows.
Why trending: Autonomous scientific research is a high-interest area; this provides infrastructure to measure how well current agents actually navigate scientific literature.
5. Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Authors: Shiyi Zhang, Yiji Cheng, Tiankai Hang
arxiv: arxiv.org/abs/2604.24625
Sources: HuggingFace Daily Papers (23 upvotes), arxiv cs.CV
Summary: Meta-CoT investigates what forms of Chain-of-Thought reasoning and training strategies can jointly enhance understanding granularity and generalization in unified multimodal models for image editing. The approach incorporates fine-grained structured CoT into the training process to improve both spatial understanding and out-of-distribution generalization for editing tasks.
Why trending: CoT has transformed text reasoning; this explores systematic extension to visual generation/editing—a timely question as unified multimodal models mature.
6. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement
Authors: Jiayi Guo, Linqing Wang, Jiangshan Wang
arxiv: arxiv.org/abs/2604.25636
Sources: HuggingFace Daily Papers (22 upvotes), arxiv cs.CV
Summary: Unified multimodal models (UMMs) can refine their own T2I outputs, but current refinement-via-editing (RvE) paradigms restrict the modification space to small edits. This paper proposes refinement via regeneration (RvR), where the UMM regenerates the full image conditioned on its prior output, dramatically enlarging the modification space and consistently outperforming RvE baselines.
Why trending: Clever insight that regeneration > editing for iterative image refinement; directly applicable to production UMM pipelines.
7. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Generation
Authors: Yupeng Zhou, Lianghua Huang, Zhifan Wu
arxiv: arxiv.org/abs/2604.25819
Sources: HuggingFace Daily Papers (13 upvotes), arxiv cs.CV
Summary: Mutual Forcing introduces a framework for fast autoregressive audio-video character generation that achieves long-horizon audio-video synchronization. It uses a two-stage training strategy (uni-modal pretraining then coupled joint generation) with a dual-mode self-evolution mechanism where audio and video generators mutually constrain each other during inference.
Why trending: Joint audio-video generation with tight synchronization at scale is an unsolved problem; dual-mode self-evolution is a novel training paradigm.
8. Step-Audio-R1.5 Technical Report
Authors: Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu
arxiv: arxiv.org/abs/2604.25719
Sources: HuggingFace Daily Papers (12 upvotes), arxiv cs.SD/cs.CL
Summary: Step-Audio-R1.5 extends Chain-of-Thought reasoning into the auditory domain for large audio language models, enabling complex acoustic and spoken language understanding via extended reasoning chains. The report details how the model moves beyond RLVR-only paradigms that dominate text-based reasoning to achieve robust spoken-task performance.
Why trending: Audio reasoning models are a rapidly growing niche; this is a technical report from a major lab showing CoT reasoning in audio is viable and practical.
9. Co-Director: Agentic Generative Video Storytelling
Authors: Yale Song, Yiwen Song, Nick Losier
arxiv: arxiv.org/abs/2604.24842
Sources: HuggingFace Daily Papers (8 upvotes), arxiv cs.CV/cs.AI
Summary: Co-Director formalizes video storytelling as a global optimization problem and addresses semantic drift and cascading failures that plague current chained-module agentic pipelines. A hierarchical multi-agent framework with structured prompting and inter-agent feedback enables coherent long-form video narrative generation from diffusion models.
Why trending: Long-form coherent video generation is an emerging frontier; hierarchical multi-agent control is a practical approach that addresses real production failures.
10. BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate
Authors: Arnon Mazza, Elad Levi
arxiv: arxiv.org/abs/2604.25203
Sources: HuggingFace Daily Papers (6 upvotes), arxiv cs.AI
Summary: BARRED (Boundary Alignment via Reinforced Reasoning and Debate) trains custom policy guardrails synthetically by using asymmetric debate between LLMs to generate labeled boundary cases, eliminating the need for expensive human annotation. This achieves both accuracy and inference efficiency compared to prompting or generic safety models.
Why trending: Custom safety guardrails are a real deployment bottleneck; synthetic debate-based labeling is a practical solution that scales without human annotators.
11. Toward Scalable Terminal Task Synthesis via Skill Graphs
Authors: Zhiyuan Fan, Tinghao Yu, Yuanjun Cai
arxiv: arxiv.org/abs/2604.25727
Sources: HuggingFace Daily Papers (6 upvotes), arxiv cs.AI
Summary: This work addresses the scarcity of diverse, high-quality trajectories for training terminal agents by organizing command-line skills into structured graphs. Skill graphs enable synthesis of diverse, compositional task instances for trajectory sampling, going beyond simple task count scaling to improve structural diversity and difficulty distribution.
Why trending: Terminal agents are increasingly used in agentic systems; training data scarcity is a real bottleneck that this addresses with a principled structural approach.
12. TCOD: Temporal Curriculum in On-Policy Distillation for Multi-Turn Autonomous Agents
Authors: Jiaqi Wang, Wenhao Zhang, Weijie Shi
arxiv: arxiv.org/abs/2604.24005
Sources: HuggingFace Daily Papers (6 upvotes), arxiv cs.AI/cs.LG
Summary: TCOD identifies a key failure mode of vanilla on-policy distillation (OPD) in multi-turn agent settings—error accumulation from temporal dependencies—and addresses it through a temporal curriculum that progressively increases task horizon during training. This enables reliable reasoning transfer from frontier models to smaller student agents for sequential tasks.
Why trending: Multi-turn agent distillation is underexplored compared to single-turn; temporal curriculum is an intuitive and effective fix for a real training failure mode.
13. MAIC-UI: Making Interactive Courseware with Generative UI
Authors: Shangqing Tu, Yanjia Li, Keyu Chen
arxiv: arxiv.org/abs/2604.25806
Sources: HuggingFace Daily Papers (4 upvotes), arxiv cs.AI/cs.HC
Summary: MAIC-UI enables educators without coding expertise to create interactive STEM simulations (rather than static presentations) using generative UI techniques. It addresses LLM tendency to generate non-interactive HTML and introduces pedagogical accuracy mechanisms to ensure educational correctness.
Why trending: Practical application of generative UI for education; addresses a real bottleneck (most AI code-gen tools produce static outputs) with a domain-specific solution.
14. V-GRPO: Online Reinforcement Learning for Denoising Generative Models
Authors: Bingda Tang, Yuhui Zhang, Xiaohan Wang
arxiv: arxiv.org/abs/2604.23380
Sources: HuggingFace Daily Papers (2 upvotes), arxiv cs.LG, ICLR 2026
Summary: V-GRPO provides a principled online RL framework for aligning denoising generative models (diffusion/flow) with human preferences or verifiable rewards, addressing the intractable likelihood problem that limits direct policy-gradient application. It demonstrates that online RL for diffusion models is more straightforward than previously assumed when using a variance-reduced group policy optimization approach.
Why trending: ICLR 2026 paper; aligning diffusion models with RL is an active research thrust with direct implications for safe and controllable image/video generation.
15. IAM: Identity-Aware Human Motion and Shape Joint Generation
Authors: Wenqi Jia, Zekun Li, Abhay Mittal
arxiv: arxiv.org/abs/2604.25164
Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.CV/cs.GR
Summary: IAM generates human motion sequences jointly with body shape, explicitly modeling how body morphology influences motion dynamics—a factor ignored by most existing text-to-motion models that assume identity-neutral canonical representations. This enables personalized and physically plausible human animation.
Why trending: Personalized avatar animation for games/film/VR is a growing application area; joint motion+shape modeling is the right step toward realistic character animation.
16. A Systematic Post-Train Framework for Video Generation
Authors: Zeyue Xue, Siming Fu, Jie Huang
arxiv: arxiv.org/abs/2604.25427
Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.CV
Summary: This paper bridges the gap between video diffusion model pretraining performance and real-world deployment requirements by introducing a systematic post-training framework that addresses prompt sensitivity, temporal inconsistency, and prohibitive inference cost. The framework provides structured recipes for adapting large pretrained video models to deployment constraints.
Why trending: Video generation deployment is a pressing industry challenge; systematic post-training approaches are immediately actionable for practitioners.
17. GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction
Authors: Hongxin Li, Yuntao Chen, Zhaoxiang Zhang
arxiv: arxiv.org/abs/2604.23941
Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.CV/cs.AI
Summary: GoClick is a lightweight model for GUI element grounding (locating UI elements from natural language instructions) optimized for resource-constrained deployment on mobile devices. It achieves competitive accuracy with significantly lower latency compared to cloud-based GUI agents, enabling on-device autonomous GUI interaction.
Why trending: On-device GUI automation is a key enabler for personal AI agents; GoClick’s efficiency-accuracy tradeoff addresses a practical deployment constraint.
18. AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
Authors: Hongxin Li, Xiping Wang, Jingran Su
arxiv: arxiv.org/abs/2604.24441
Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.CV/cs.AI
Summary: AutoGUI-v2 goes beyond reactive element matching to evaluate agents’ predictive mental model of interface dynamics and ability to foresee digital world state changes resulting from GUI interactions. This tests true digital autonomy rather than screenshot-based element identification.
Why trending: Comprehensive GUI benchmark addressing the gap between reactive matching and genuine interface understanding—a prerequisite for reliable digital agents.
19. Preferences of a Voice-First Nation: Large-Scale Pairwise TTS Evaluation for Indian Languages
Authors: Srija Anand, Ashwin Sankar, Ishvinder Sethi
arxiv: arxiv.org/abs/2604.21481
Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.SD/cs.CL
Summary: This paper presents a controlled multidimensional pairwise evaluation framework for multilingual TTS across Indian languages, addressing the high variance introduced by linguistic diversity and multidimensional speech perception in crowdsourced evaluation. The large-scale study provides preference data to guide TTS development for underrepresented language communities.
Why trending: Indian language NLP/speech is a significant underserved area; the evaluation framework addresses a methodological gap in multilingual TTS assessment.
20. Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
Authors: Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand
arxiv: arxiv.org/abs/2604.21523
Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.CV/cs.AI
Summary: This paper systematically evaluates the reliability of Vision-Language Models when used as automatic evaluators for image-to-text and text-to-image tasks, uncovering consistent blind spots where Evaluator VLMs fail to detect errors that humans catch. The findings challenge the growing practice of using VLMs as drop-in replacements for human evaluation.
Why trending: VLM-as-judge is increasingly used in production pipelines; revealing systematic blind spots has immediate implications for evaluation reliability and model safety.
