Daily AI Papers — April 21, 2026

11 minute read

Published: April 21, 2026

1. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

Authors: Chenxi Zhao, Chen Zhu, Xiaokun Feng, Aiming Hao, Jiashu Zhu et al. (9 authors)
arXiv: arxiv.org/abs/2604.18168
Summary: Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation.
Sources: HuggingFace Daily Papers (85 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#1, 85 upvotes, 1 comments); strong community engagement; includes media demos (1)

2. OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Authors: Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li et al. (50 authors)
arXiv: arxiv.org/abs/2604.18486
Summary: Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts.
Sources: HuggingFace Daily Papers (65 upvotes, 2 comments)
Why trending: HuggingFace Daily Papers (#2, 65 upvotes, 2 comments); strong community engagement

3. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Authors: Guanting Dong, Junting Lu, Junjie Huang, Wanjun Zhong, Longxiang Liu et al. (20 authors)
arXiv: arxiv.org/abs/2604.18292
Summary: Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning.
Sources: HuggingFace Daily Papers (58 upvotes, 2 comments)
Why trending: HuggingFace Daily Papers (#3, 58 upvotes, 2 comments); strong community engagement

4. OpenGame: Open Agentic Coding for Games

Authors: Yilei Jiang, Jinyuan Hu, Qianyin Xiao, Yaozhi Zheng, Ruize Ma et al. (11 authors)
arXiv: arxiv.org/abs/2604.18394
Summary: Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across many files. While Large Language Models (LLMs) and code agents now solve isolated programming tasks with ease, they consistently stumble when asked to produce a fully playable game from a high-level design, collapsing under cross-file inconsistencies, broken scene wiring, and logical incoherence.
Sources: HuggingFace Daily Papers (49 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#4, 49 upvotes, 1 comments); includes media demos (1)

5. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Authors: Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu
arXiv: arxiv.org/abs/2604.18564
Summary: Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames.
Sources: HuggingFace Daily Papers (35 upvotes, 2 comments)
Why trending: HuggingFace Daily Papers (#5, 35 upvotes, 2 comments)

6. EasyVideoR1: Easier RL for Video Understanding

Authors: Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao et al. (9 authors)
arXiv: arxiv.org/abs/2604.16893
Summary: Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters.
Sources: HuggingFace Daily Papers (32 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#6, 32 upvotes, 1 comments)

7. When Can LLMs Learn to Reason with Weak Supervision?

Authors: Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi, Saadia Gabriel et al. (6 authors)
arXiv: arxiv.org/abs/2604.18574
Summary: Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision.
Sources: HuggingFace Daily Papers (18 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#7, 18 upvotes, 1 comments)

8. GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Authors: Wangjie Gan, Miao Pan, Linbo Xi, Wenqi Zhang, Jintao Chen et al. (7 authors)
arXiv: arxiv.org/abs/2604.14258
Summary: Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion.
Sources: HuggingFace Daily Papers (18 upvotes, 2 comments)
Why trending: HuggingFace Daily Papers (#8, 18 upvotes, 2 comments)

9. WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Authors: Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang et al. (19 authors)
arXiv: arxiv.org/abs/2604.18224
Summary: Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability.
Sources: HuggingFace Daily Papers (17 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#9, 17 upvotes, 1 comments)

10. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

Authors: Xirui Li, Ming Li, Derry Xu, Wei-Lin Chiang, Ion Stoica et al. (7 authors)
arXiv: arxiv.org/abs/2604.18543
Summary: Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand.
Sources: HuggingFace Daily Papers (17 upvotes, 0 comments)
Why trending: HuggingFace Daily Papers (#10, 17 upvotes, 0 comments)

11. SkillFlow: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Authors: Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng et al. (16 authors)
arXiv: arxiv.org/abs/2604.17308
Summary: As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time.
Sources: HuggingFace Daily Papers (15 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#11, 15 upvotes, 1 comments)

12. Crowded in B-Space: Calibrating Shared Directions for LoRA Merging

Authors: Yixuan Tang, Yi Yang
arXiv: arxiv.org/abs/2604.16826
Summary: Merging separately trained LoRA adapters is a practical alternative to joint multi-task training, but it often hurts performance. Existing methods usually treat the LoRA update ΔW = BA as a single object and do not distinguish the two LoRA matrices.
Sources: HuggingFace Daily Papers (14 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#12, 14 upvotes, 1 comments)

13. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Authors: Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei et al. (8 authors)
arXiv: arxiv.org/abs/2604.18584
Summary: Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems.
Sources: HuggingFace Daily Papers (6 upvotes, 0 comments)
Why trending: HuggingFace Daily Papers (#13, 6 upvotes, 0 comments)

14. GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

Authors: Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang et al. (18 authors)
arXiv: arxiv.org/abs/2604.17091
Summary: Long-horizon large language model (LLM) agents are fundamentally limited by context. As interactions become longer, tool descriptions, retrieved memories, and raw environmental feedback accumulate and push out the information needed for decision-making.
Sources: HuggingFace Daily Papers (6 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#14, 6 upvotes, 1 comments)

15. VoxMind: An End-to-End Agentic Spoken Dialogue System

Authors: Tianle Liang, Yifu Chen, Shengpeng Ji, Yijun Chen, Zhiyang Jia et al. (10 authors)
arXiv: arxiv.org/abs/2604.15710
Summary: Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope.
Sources: HuggingFace Daily Papers (6 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#15, 6 upvotes, 1 comments); includes media demos (1)

16. Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Authors: Eun Woo Im, Dhruv Madhwal, Vivek Gupta
arXiv: arxiv.org/abs/2604.13313
Summary: Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining.
Sources: HuggingFace Daily Papers (6 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#16, 6 upvotes, 1 comments)

17. Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

Authors: Leon Engländer, Sophia Althammer, Ahmet Üstün, Matthias Gallé, Tom Sherborne
arXiv: arxiv.org/abs/2604.17609
Summary: LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information.
Sources: HuggingFace Daily Papers (5 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#17, 5 upvotes, 1 comments)

18. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Authors: Junfu Pu, Yuxin Chen, Teng Wang, Ying Shan
arXiv: arxiv.org/abs/2604.11102
Summary: Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues.
Sources: HuggingFace Daily Papers (5 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#18, 5 upvotes, 1 comments)

19. Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

Authors: Qifan Zhang, Dongyang Ma, Tianqing Fang, Jia Li, Jing Tang et al. (8 authors)
arXiv: arxiv.org/abs/2604.18131
Summary: Most agents today ``self-evolve’’ by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops.
Sources: HuggingFace Daily Papers (4 upvotes, 0 comments)
Why trending: HuggingFace Daily Papers (#19, 4 upvotes, 0 comments)

20. Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Authors: Xiachong Feng, Deyi Yin, Xiaocheng Feng, Yi Jiang, Libo Qin et al. (12 authors)
arXiv: arxiv.org/abs/2604.17696
Summary: Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics.
Sources: HuggingFace Daily Papers (4 upvotes, 1 comments)
Why trending: HuggingFace Daily Papers (#20, 4 upvotes, 1 comments)

Share on

Twitter Facebook LinkedIn

Alireza Shamsoshoara

Daily AI Papers — April 21, 2026

1. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

2. OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

3. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

4. OpenGame: Open Agentic Coding for Games

5. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

6. EasyVideoR1: Easier RL for Video Understanding

7. When Can LLMs Learn to Reason with Weak Supervision?

8. GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

9. WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

10. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

11. SkillFlow: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

12. Crowded in B-Space: Calibrating Shared Directions for LoRA Merging

13. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

14. GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

15. VoxMind: An End-to-End Agentic Spoken Dialogue System

16. Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

17. Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity

18. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

19. Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

20. Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Share on

You May Also Enjoy

Future Blog Post

Daily AI Papers — June 03, 2026

1. OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

Daily AI Papers — June 02, 2026

1. Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Daily AI Papers — June 01, 2026

1. SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer