Daily AI Papers — April 29, 2026

12 minute read

Published: April 29, 2026

1. Recursive Multi-Agent Systems (RecursiveMAS)

Authors: Xiyuan Yang, Jiaru Zou, Rui Pan, Ruizhong Qiu, Pan Lu, Shizhe Diao, Jindong Jiang, Hanghang Tong, Tong Zhang, Markus J. Buehler, Jingrui He, James Zou

arxiv: arxiv.org/abs/2604.25917

Sources: HuggingFace Daily Papers (117 upvotes), arxiv cs.AI

Summary: RecursiveMAS extends the recursive/looped computation scaling principle from single models to multi-agent systems, connecting heterogeneous agents in a latent-space collaboration loop via a lightweight RecursiveLink module. An inner-outer loop learning algorithm enables whole-system co-optimization through shared gradient-based credit assignment, yielding 8.3% average accuracy improvement, 1.2–2.4× inference speedup, and 34.6–75.6% token reduction across 9 benchmarks spanning math, science, medicine, search, and code generation.

Why trending: Top-voted HuggingFace paper of the day (117 upvotes) from Stanford. Addresses a core open question in MAS: can agent collaboration itself be a scaling axis? Strong empirical results across diverse benchmarks.

2. Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs

Authors: Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, Cheng Tan

arxiv: arxiv.org/abs/2604.24819

Sources: HuggingFace Daily Papers (69 upvotes), arxiv cs.AI, OpenDataLab

Summary: This paper maps the LLM fine-tuning lifecycle onto the software development lifecycle: training data = source code, model training = compilation, benchmarking = unit testing, failure-driven data repair = debugging. The framework enables systematic diagnosis and targeted patching of model failures at the concept and reasoning-chain level, releasing a structured knowledge base, benchmark suite, and training corpus across 16 disciplines without degrading general capabilities.

Why trending: Novel framing that treats LLM training data as debuggable software. Comes with open resources across 16 domains. Addresses the feedback-loop problem in fine-tuning that has long frustrated practitioners.

3. DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Authors: Jinxiang Meng, Shaoping Huang, Fangyu Lei, Jingyu Guo, Haoxiang Liu, Jiahao Su, Conghui He, Jingxuan Wei, Cheng Tan

arxiv: arxiv.org/abs/2604.25914

Sources: HuggingFace Daily Papers (36 upvotes), arxiv cs.AI

Summary: DV-World is a benchmark of 260 tasks designed to evaluate data visualization agents in realistic environments, covering native environmental grounding, cross-platform evolution (not just code-sandbox creation), and proactive intent alignment when user intent is imperfect. It addresses three key gaps in existing DV benchmarks: code-sandbox confinement, single-language creation-only tasks, and assumed perfect intent.

Why trending: Fills a practical gap in agentic benchmarking. Most existing DV benchmarks test code generation in sandboxes; this tests agents in real environments with ambiguous intent.

4. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Authors: Lei Xiong, Kun Luo, Ziyi Xia

arxiv: arxiv.org/abs/2604.25256

Sources: HuggingFace Daily Papers (24 upvotes), arxiv cs.AI

Summary: AutoResearchBench evaluates AI agents’ capability in complex scientific literature discovery—a key step in autonomous research pipelines—testing whether agents can find right papers for knowledge exploration, evidence gathering, and claim verification. The benchmark tests multi-hop reasoning over literature graphs and realistic research workflows.

Why trending: Autonomous scientific research is a high-interest area; this provides infrastructure to measure how well current agents actually navigate scientific literature.

5. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

Authors: Shiyi Zhang, Yiji Cheng, Tiankai Hang

arxiv: arxiv.org/abs/2604.24625

Sources: HuggingFace Daily Papers (23 upvotes), arxiv cs.CV

Summary: Meta-CoT investigates what forms of Chain-of-Thought reasoning and training strategies can jointly enhance understanding granularity and generalization in unified multimodal models for image editing. The approach incorporates fine-grained structured CoT into the training process to improve both spatial understanding and out-of-distribution generalization for editing tasks.

Why trending: CoT has transformed text reasoning; this explores systematic extension to visual generation/editing—a timely question as unified multimodal models mature.

Authors: Jiayi Guo, Linqing Wang, Jiangshan Wang

arxiv: arxiv.org/abs/2604.25636

Sources: HuggingFace Daily Papers (22 upvotes), arxiv cs.CV

Summary: Unified multimodal models (UMMs) can refine their own T2I outputs, but current refinement-via-editing (RvE) paradigms restrict the modification space to small edits. This paper proposes refinement via regeneration (RvR), where the UMM regenerates the full image conditioned on its prior output, dramatically enlarging the modification space and consistently outperforming RvE baselines.

Why trending: Clever insight that regeneration > editing for iterative image refinement; directly applicable to production UMM pipelines.

7. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Generation

Authors: Yupeng Zhou, Lianghua Huang, Zhifan Wu

arxiv: arxiv.org/abs/2604.25819

Sources: HuggingFace Daily Papers (13 upvotes), arxiv cs.CV

Summary: Mutual Forcing introduces a framework for fast autoregressive audio-video character generation that achieves long-horizon audio-video synchronization. It uses a two-stage training strategy (uni-modal pretraining then coupled joint generation) with a dual-mode self-evolution mechanism where audio and video generators mutually constrain each other during inference.

Why trending: Joint audio-video generation with tight synchronization at scale is an unsolved problem; dual-mode self-evolution is a novel training paradigm.

8. Step-Audio-R1.5 Technical Report

Authors: Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu

arxiv: arxiv.org/abs/2604.25719

Sources: HuggingFace Daily Papers (12 upvotes), arxiv cs.SD/cs.CL

Summary: Step-Audio-R1.5 extends Chain-of-Thought reasoning into the auditory domain for large audio language models, enabling complex acoustic and spoken language understanding via extended reasoning chains. The report details how the model moves beyond RLVR-only paradigms that dominate text-based reasoning to achieve robust spoken-task performance.

Why trending: Audio reasoning models are a rapidly growing niche; this is a technical report from a major lab showing CoT reasoning in audio is viable and practical.

9. Co-Director: Agentic Generative Video Storytelling

Authors: Yale Song, Yiwen Song, Nick Losier

arxiv: arxiv.org/abs/2604.24842

Sources: HuggingFace Daily Papers (8 upvotes), arxiv cs.CV/cs.AI

Summary: Co-Director formalizes video storytelling as a global optimization problem and addresses semantic drift and cascading failures that plague current chained-module agentic pipelines. A hierarchical multi-agent framework with structured prompting and inter-agent feedback enables coherent long-form video narrative generation from diffusion models.

Why trending: Long-form coherent video generation is an emerging frontier; hierarchical multi-agent control is a practical approach that addresses real production failures.

10. BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

Authors: Arnon Mazza, Elad Levi

arxiv: arxiv.org/abs/2604.25203

Sources: HuggingFace Daily Papers (6 upvotes), arxiv cs.AI

Summary: BARRED (Boundary Alignment via Reinforced Reasoning and Debate) trains custom policy guardrails synthetically by using asymmetric debate between LLMs to generate labeled boundary cases, eliminating the need for expensive human annotation. This achieves both accuracy and inference efficiency compared to prompting or generic safety models.

Why trending: Custom safety guardrails are a real deployment bottleneck; synthetic debate-based labeling is a practical solution that scales without human annotators.

11. Toward Scalable Terminal Task Synthesis via Skill Graphs

Authors: Zhiyuan Fan, Tinghao Yu, Yuanjun Cai

arxiv: arxiv.org/abs/2604.25727

Sources: HuggingFace Daily Papers (6 upvotes), arxiv cs.AI

Summary: This work addresses the scarcity of diverse, high-quality trajectories for training terminal agents by organizing command-line skills into structured graphs. Skill graphs enable synthesis of diverse, compositional task instances for trajectory sampling, going beyond simple task count scaling to improve structural diversity and difficulty distribution.

Why trending: Terminal agents are increasingly used in agentic systems; training data scarcity is a real bottleneck that this addresses with a principled structural approach.

12. TCOD: Temporal Curriculum in On-Policy Distillation for Multi-Turn Autonomous Agents

Authors: Jiaqi Wang, Wenhao Zhang, Weijie Shi

arxiv: arxiv.org/abs/2604.24005

Sources: HuggingFace Daily Papers (6 upvotes), arxiv cs.AI/cs.LG

Summary: TCOD identifies a key failure mode of vanilla on-policy distillation (OPD) in multi-turn agent settings—error accumulation from temporal dependencies—and addresses it through a temporal curriculum that progressively increases task horizon during training. This enables reliable reasoning transfer from frontier models to smaller student agents for sequential tasks.

Why trending: Multi-turn agent distillation is underexplored compared to single-turn; temporal curriculum is an intuitive and effective fix for a real training failure mode.

13. MAIC-UI: Making Interactive Courseware with Generative UI

Authors: Shangqing Tu, Yanjia Li, Keyu Chen

arxiv: arxiv.org/abs/2604.25806

Sources: HuggingFace Daily Papers (4 upvotes), arxiv cs.AI/cs.HC

Summary: MAIC-UI enables educators without coding expertise to create interactive STEM simulations (rather than static presentations) using generative UI techniques. It addresses LLM tendency to generate non-interactive HTML and introduces pedagogical accuracy mechanisms to ensure educational correctness.

Why trending: Practical application of generative UI for education; addresses a real bottleneck (most AI code-gen tools produce static outputs) with a domain-specific solution.

14. V-GRPO: Online Reinforcement Learning for Denoising Generative Models

Authors: Bingda Tang, Yuhui Zhang, Xiaohan Wang

arxiv: arxiv.org/abs/2604.23380

Sources: HuggingFace Daily Papers (2 upvotes), arxiv cs.LG, ICLR 2026

Summary: V-GRPO provides a principled online RL framework for aligning denoising generative models (diffusion/flow) with human preferences or verifiable rewards, addressing the intractable likelihood problem that limits direct policy-gradient application. It demonstrates that online RL for diffusion models is more straightforward than previously assumed when using a variance-reduced group policy optimization approach.

Why trending: ICLR 2026 paper; aligning diffusion models with RL is an active research thrust with direct implications for safe and controllable image/video generation.

15. IAM: Identity-Aware Human Motion and Shape Joint Generation

Authors: Wenqi Jia, Zekun Li, Abhay Mittal

arxiv: arxiv.org/abs/2604.25164

Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.CV/cs.GR

Summary: IAM generates human motion sequences jointly with body shape, explicitly modeling how body morphology influences motion dynamics—a factor ignored by most existing text-to-motion models that assume identity-neutral canonical representations. This enables personalized and physically plausible human animation.

Why trending: Personalized avatar animation for games/film/VR is a growing application area; joint motion+shape modeling is the right step toward realistic character animation.

16. A Systematic Post-Train Framework for Video Generation

Authors: Zeyue Xue, Siming Fu, Jie Huang

arxiv: arxiv.org/abs/2604.25427

Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.CV

Summary: This paper bridges the gap between video diffusion model pretraining performance and real-world deployment requirements by introducing a systematic post-training framework that addresses prompt sensitivity, temporal inconsistency, and prohibitive inference cost. The framework provides structured recipes for adapting large pretrained video models to deployment constraints.

Why trending: Video generation deployment is a pressing industry challenge; systematic post-training approaches are immediately actionable for practitioners.

17. GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

Authors: Hongxin Li, Yuntao Chen, Zhaoxiang Zhang

arxiv: arxiv.org/abs/2604.23941

Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.CV/cs.AI

Summary: GoClick is a lightweight model for GUI element grounding (locating UI elements from natural language instructions) optimized for resource-constrained deployment on mobile devices. It achieves competitive accuracy with significantly lower latency compared to cloud-based GUI agents, enabling on-device autonomous GUI interaction.

Why trending: On-device GUI automation is a key enabler for personal AI agents; GoClick’s efficiency-accuracy tradeoff addresses a practical deployment constraint.

Authors: Hongxin Li, Xiping Wang, Jingran Su

arxiv: arxiv.org/abs/2604.24441

Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.CV/cs.AI

Summary: AutoGUI-v2 goes beyond reactive element matching to evaluate agents’ predictive mental model of interface dynamics and ability to foresee digital world state changes resulting from GUI interactions. This tests true digital autonomy rather than screenshot-based element identification.

Why trending: Comprehensive GUI benchmark addressing the gap between reactive matching and genuine interface understanding—a prerequisite for reliable digital agents.

19. Preferences of a Voice-First Nation: Large-Scale Pairwise TTS Evaluation for Indian Languages

Authors: Srija Anand, Ashwin Sankar, Ishvinder Sethi

arxiv: arxiv.org/abs/2604.21481

Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.SD/cs.CL

Summary: This paper presents a controlled multidimensional pairwise evaluation framework for multilingual TTS across Indian languages, addressing the high variance introduced by linguistic diversity and multidimensional speech perception in crowdsourced evaluation. The large-scale study provides preference data to guide TTS development for underrepresented language communities.

Why trending: Indian language NLP/speech is a significant underserved area; the evaluation framework addresses a methodological gap in multilingual TTS assessment.

Authors: Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand

arxiv: arxiv.org/abs/2604.21523

Sources: HuggingFace Daily Papers (1 upvote), arxiv cs.CV/cs.AI

Summary: This paper systematically evaluates the reliability of Vision-Language Models when used as automatic evaluators for image-to-text and text-to-image tasks, uncovering consistent blind spots where Evaluator VLMs fail to detect errors that humans catch. The findings challenge the growing practice of using VLMs as drop-in replacements for human evaluation.

Why trending: VLM-as-judge is increasingly used in production pipelines; revealing systematic blind spots has immediate implications for evaluation reliability and model safety.

Share on

Twitter Facebook LinkedIn

Alireza Shamsoshoara

Daily AI Papers — April 29, 2026

1. Recursive Multi-Agent Systems (RecursiveMAS)

2. Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs

3. DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

4. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

5. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

6. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement

7. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Generation

8. Step-Audio-R1.5 Technical Report

9. Co-Director: Agentic Generative Video Storytelling

10. BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

11. Toward Scalable Terminal Task Synthesis via Skill Graphs

12. TCOD: Temporal Curriculum in On-Policy Distillation for Multi-Turn Autonomous Agents

13. MAIC-UI: Making Interactive Courseware with Generative UI

14. V-GRPO: Online Reinforcement Learning for Denoising Generative Models

15. IAM: Identity-Aware Human Motion and Shape Joint Generation

16. A Systematic Post-Train Framework for Video Generation

17. GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

19. Preferences of a Voice-First Nation: Large-Scale Pairwise TTS Evaluation for Indian Languages

20. Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

Share on

You May Also Enjoy

Future Blog Post

Daily AI Papers — May 21, 2026

#1 — Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Daily AI Papers — May 20, 2026

#1 — When Vision Speaks for Sound

Daily AI Papers — May 19, 2026

1. SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

Alireza Shamsoshoara

1. Recursive Multi-Agent Systems (RecursiveMAS)

2. Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs

3. DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

4. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

5. Meta-CoT: Enhancing Granularity and Generalization in Image Editing

6. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement

7. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Generation

8. Step-Audio-R1.5 Technical Report

9. Co-Director: Agentic Generative Video Storytelling

10. BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

11. Toward Scalable Terminal Task Synthesis via Skill Graphs

12. TCOD: Temporal Curriculum in On-Policy Distillation for Multi-Turn Autonomous Agents

13. MAIC-UI: Making Interactive Courseware with Generative UI

14. V-GRPO: Online Reinforcement Learning for Denoising Generative Models

15. IAM: Identity-Aware Human Motion and Shape Joint Generation

16. A Systematic Post-Train Framework for Video Generation

17. GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

18. AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

19. Preferences of a Voice-First Nation: Large-Scale Pairwise TTS Evaluation for Indian Languages

20. Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models

Share on

You May Also Enjoy

Future Blog Post

Daily AI Papers — May 21, 2026

#1 — Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Daily AI Papers — May 20, 2026

#1 — When Vision Speaks for Sound

Daily AI Papers — May 19, 2026

1. SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution