Daily AI Papers — April 21, 2026
Published:
1. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
- Authors: Chenxi Zhao, Chen Zhu, Xiaokun Feng, Aiming Hao, Jiashu Zhu et al. (9 authors)
- arXiv: arxiv.org/abs/2604.18168
- Summary: Few-step generation has been a long-standing goal, with recent one-step generation methods exemplified by MeanFlow achieving remarkable results. Existing research on MeanFlow primarily focuses on class-to-image generation.
- Sources: HuggingFace Daily Papers (85 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#1, 85 upvotes, 1 comments); strong community engagement; includes media demos (1)
2. OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
- Authors: Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li et al. (50 authors)
- arXiv: arxiv.org/abs/2604.18486
- Summary: Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts.
- Sources: HuggingFace Daily Papers (65 upvotes, 2 comments)
- Why trending: HuggingFace Daily Papers (#2, 65 upvotes, 2 comments); strong community engagement
3. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
- Authors: Guanting Dong, Junting Lu, Junjie Huang, Wanjun Zhong, Longxiang Liu et al. (20 authors)
- arXiv: arxiv.org/abs/2604.18292
- Summary: Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning.
- Sources: HuggingFace Daily Papers (58 upvotes, 2 comments)
- Why trending: HuggingFace Daily Papers (#3, 58 upvotes, 2 comments); strong community engagement
4. OpenGame: Open Agentic Coding for Games
- Authors: Yilei Jiang, Jinyuan Hu, Qianyin Xiao, Yaozhi Zheng, Ruize Ma et al. (11 authors)
- arXiv: arxiv.org/abs/2604.18394
- Summary: Game development sits at the intersection of creative design and intricate software engineering, demanding the joint orchestration of game engines, real-time loops, and tightly coupled state across many files. While Large Language Models (LLMs) and code agents now solve isolated programming tasks with ease, they consistently stumble when asked to produce a fully playable game from a high-level design, collapsing under cross-file inconsistencies, broken scene wiring, and logical incoherence.
- Sources: HuggingFace Daily Papers (49 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#4, 49 upvotes, 1 comments); includes media demos (1)
5. MultiWorld: Scalable Multi-Agent Multi-View Video World Models
- Authors: Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu
- arXiv: arxiv.org/abs/2604.18564
- Summary: Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames.
- Sources: HuggingFace Daily Papers (35 upvotes, 2 comments)
- Why trending: HuggingFace Daily Papers (#5, 35 upvotes, 2 comments)
6. EasyVideoR1: Easier RL for Video Understanding
- Authors: Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao et al. (9 authors)
- arXiv: arxiv.org/abs/2604.16893
- Summary: Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters.
- Sources: HuggingFace Daily Papers (32 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#6, 32 upvotes, 1 comments)
7. When Can LLMs Learn to Reason with Weak Supervision?
- Authors: Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi, Saadia Gabriel et al. (6 authors)
- arXiv: arxiv.org/abs/2604.18574
- Summary: Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision.
- Sources: HuggingFace Daily Papers (18 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#7, 18 upvotes, 1 comments)
8. GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
- Authors: Wangjie Gan, Miao Pan, Linbo Xi, Wenqi Zhang, Jintao Chen et al. (7 authors)
- arXiv: arxiv.org/abs/2604.14258
- Summary: Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion.
- Sources: HuggingFace Daily Papers (18 upvotes, 2 comments)
- Why trending: HuggingFace Daily Papers (#8, 18 upvotes, 2 comments)
9. WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
- Authors: Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang et al. (19 authors)
- arXiv: arxiv.org/abs/2604.18224
- Summary: Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability.
- Sources: HuggingFace Daily Papers (17 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#9, 17 upvotes, 1 comments)
10. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
- Authors: Xirui Li, Ming Li, Derry Xu, Wei-Lin Chiang, Ion Stoica et al. (7 authors)
- arXiv: arxiv.org/abs/2604.18543
- Summary: Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand.
- Sources: HuggingFace Daily Papers (17 upvotes, 0 comments)
- Why trending: HuggingFace Daily Papers (#10, 17 upvotes, 0 comments)
11. SkillFlow: Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
- Authors: Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng et al. (16 authors)
- arXiv: arxiv.org/abs/2604.17308
- Summary: As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time.
- Sources: HuggingFace Daily Papers (15 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#11, 15 upvotes, 1 comments)
12. Crowded in B-Space: Calibrating Shared Directions for LoRA Merging
- Authors: Yixuan Tang, Yi Yang
- arXiv: arxiv.org/abs/2604.16826
- Summary: Merging separately trained LoRA adapters is a practical alternative to joint multi-task training, but it often hurts performance. Existing methods usually treat the LoRA update ΔW = BA as a single object and do not distinguish the two LoRA matrices.
- Sources: HuggingFace Daily Papers (14 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#12, 14 upvotes, 1 comments)
13. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
- Authors: Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei et al. (8 authors)
- arXiv: arxiv.org/abs/2604.18584
- Summary: Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems.
- Sources: HuggingFace Daily Papers (6 upvotes, 0 comments)
- Why trending: HuggingFace Daily Papers (#13, 6 upvotes, 0 comments)
14. GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
- Authors: Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang et al. (18 authors)
- arXiv: arxiv.org/abs/2604.17091
- Summary: Long-horizon large language model (LLM) agents are fundamentally limited by context. As interactions become longer, tool descriptions, retrieved memories, and raw environmental feedback accumulate and push out the information needed for decision-making.
- Sources: HuggingFace Daily Papers (6 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#14, 6 upvotes, 1 comments)
15. VoxMind: An End-to-End Agentic Spoken Dialogue System
- Authors: Tianle Liang, Yifu Chen, Shengpeng Ji, Yijun Chen, Zhiyang Jia et al. (10 authors)
- arXiv: arxiv.org/abs/2604.15710
- Summary: Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope.
- Sources: HuggingFace Daily Papers (6 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#15, 6 upvotes, 1 comments); includes media demos (1)
16. Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
- Authors: Eun Woo Im, Dhruv Madhwal, Vivek Gupta
- arXiv: arxiv.org/abs/2604.13313
- Summary: Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining.
- Sources: HuggingFace Daily Papers (6 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#16, 6 upvotes, 1 comments)
17. Agents Explore but Agents Ignore: LLMs Lack Environmental Curiosity
- Authors: Leon Engländer, Sophia Althammer, Ahmet Üstün, Matthias Gallé, Tom Sherborne
- arXiv: arxiv.org/abs/2604.17609
- Summary: LLM-based agents are assumed to integrate environmental observations into their reasoning: discovering highly relevant but unexpected information should naturally lead to a model exploiting its own discoveries. We show that this assumption is false for current LLM-based agents, which struggle to reflect or react to unexpected information.
- Sources: HuggingFace Daily Papers (5 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#17, 5 upvotes, 1 comments)
18. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
- Authors: Junfu Pu, Yuxin Chen, Teng Wang, Ying Shan
- arXiv: arxiv.org/abs/2604.11102
- Summary: Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues.
- Sources: HuggingFace Daily Papers (5 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#18, 5 upvotes, 1 comments)
19. Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
- Authors: Qifan Zhang, Dongyang Ma, Tianqing Fang, Jia Li, Jing Tang et al. (8 authors)
- arXiv: arxiv.org/abs/2604.18131
- Summary: Most agents today ``self-evolve’’ by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops.
- Sources: HuggingFace Daily Papers (4 upvotes, 0 comments)
- Why trending: HuggingFace Daily Papers (#19, 4 upvotes, 0 comments)
20. Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play
- Authors: Xiachong Feng, Deyi Yin, Xiaocheng Feng, Yi Jiang, Libo Qin et al. (12 authors)
- arXiv: arxiv.org/abs/2604.17696
- Summary: Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics.
- Sources: HuggingFace Daily Papers (4 upvotes, 1 comments)
- Why trending: HuggingFace Daily Papers (#20, 4 upvotes, 1 comments)
