Daily AI Papers — April 16, 2026
Published:
1. ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
- Authors: Fei Tang, Zhiqiong Lu, Boxuan Zhang et al. (Zhejiang University)
- arXiv: 2604.11784
- Summary: ClawGUI is an open-source framework that addresses three critical gaps in GUI agent development: RL training infrastructure, standardized evaluation, and real-device deployment. ClawGUI-2B achieves 17.1% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0%.
- Why trending: First open-source GUI agent RL infrastructure with support for physical devices. 127 HF upvotes, 434 GitHub stars, strong community interest in autonomous GUI agents.
- Sources: HuggingFace (127 upvotes), arXiv, GitHub (434 stars)
2. GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
- Authors: Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin et al. (NUS)
- arXiv: 2604.07429
- Summary: A benchmark for evaluating MLLMs as game agents in browser environments with 34 games and 170 tasks. Results across 18 model-interface pairs show even the best agents fall far short of human capabilities on video games.
- Why trending: Addresses the need for standardized game agent evaluation. 105 HF upvotes, covered by neuronad.com, aimodels.fyi, and HuggingFace community.
- Sources: HuggingFace (105 upvotes), arXiv, neuronad.com, aimodels.fyi
3. Seedance 2.0: Advancing Video Generation for World Complexity
- Authors: Team Seedance (ByteDance), De Chen, Liyang Chen et al.
- arXiv: 2604.14148
- Summary: A native multi-modal audio-video generation model from ByteDance with a unified large-scale architecture for joint audio-video generation. Represents a major leap in video generation quality and world simulation complexity.
- Why trending: Major industry release from ByteDance. 104 HF upvotes, official ByteDance blog, multiple tech outlets covering it.
- Sources: HuggingFace (104 upvotes), ByteDance Official Blog, arXiv, seed.bytedance.com
4. RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
- Authors: Haozhe Wang, Cong Wei, Weiming Ren et al.
- arXiv: 2604.11626
- Summary: Proposes teaching reward models to produce explicit, multi-dimensional critiques before scoring, transforming them from passive evaluators into active reasoning agents. Demonstrates scaling visual generation at both training and test time.
- Why trending: Novel approach bridging reasoning and visual generation reward modeling. 92 HF upvotes.
- Sources: HuggingFace (92 upvotes), arXiv
5. KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance
- Authors: Linhao Yu, Tianmeng Yang, Siyu Ding et al. (Tianjin University)
- arXiv: 2604.12627
- Summary: A knowledge-guided RL framework that decomposes guidance into atomic knowledge points and uses Constrained Subset Search for compact training subsets. KnowRL-Nemotron-1.5B achieves 74.16 average accuracy across 8 reasoning benchmarks, setting SOTA at 1.5B scale.
- Why trending: Addresses reward sparsity in RLVR with a novel minimal-sufficient guidance approach. 92 HF upvotes, public code/data release.
- Sources: HuggingFace (92 upvotes), arXiv, GitHub
6. Rethinking On-Policy Distillation of Large Language Models
- Authors: Yaxuan Li, Yuxin Zuo, Bingxiang He et al. (Tsinghua NLP)
- arXiv: 2604.13016
- Summary: Systematic investigation of on-policy distillation dynamics, identifying two conditions for success: compatible thinking patterns between student/teacher, and genuinely new teacher capabilities. Proposes off-policy cold start and teacher-aligned prompt selection to recover failing OPD.
- Why trending: Provides deep mechanistic understanding of a core post-training technique from a top NLP group. 71 HF upvotes.
- Sources: HuggingFace (71 upvotes), arXiv, GitHub
7. SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments
- Authors: Dinging Li, Yingxiu Zhao, Xinrui Cheng et al.
- arXiv: 2604.14144
- Summary: Addresses the bottleneck of geometric annotation costs for spatial reasoning by proposing a self-evolving paradigm with deterministic geometric environments, enabling continuous improvement without expensive manual labeling.
- Why trending: Tackles a key challenge in embodied AI with a scalable self-evolving approach. 60 HF upvotes.
- Sources: HuggingFace (60 upvotes), arXiv
8. OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models
- Authors: Xiaomeng Hu, Yinger Zhang, Fei Huang et al.
- arXiv: 2604.10866
- Summary: A benchmark enabling evaluation of AI agents across hundreds of occupational domains (from ER triage to nuclear safety monitoring), using language world models to simulate professional environments where public benchmarks don’t exist.
- Why trending: Addresses the gap between narrow benchmarks and real professional work. 45 HF upvotes.
- Sources: HuggingFace (45 upvotes), arXiv
9. Toward Autonomous Long-Horizon Engineering for ML Research
- Authors: Guoxin Chen, Jie Chen, Lei Chen et al.
- arXiv: 2604.13018
- Summary: Explores autonomous systems that can conduct long-horizon ML research engineering tasks end-to-end, from problem identification to implementation and evaluation. Highlights current limitations and proposes solutions for autonomous ML research.
- Why trending: Highly relevant to the AI-for-AI-research movement. 28 HF upvotes.
- Sources: HuggingFace (28 upvotes), arXiv
10. Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
- Authors: Jiachen Zhu, Lingyu Yang, Rong Shan et al.
- arXiv: 2604.09574
- Summary: Proposes a benchmark evaluating whether mobile GUI agents can behave indistinguishably from humans, introducing humanization metrics beyond task completion to measure natural interaction patterns.
- Why trending: Novel angle on GUI agent evaluation focused on human-likeness. 28 HF upvotes.
- Sources: HuggingFace (28 upvotes), arXiv
11. SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
- Authors: Tianyi Wang, Yixia Li, Long Li et al.
- arXiv: 2604.08865
- Summary: Proposes Sequence-Level PPO that optimizes complete reasoning trajectories rather than token-level decisions, achieving better credit assignment for long-horizon reasoning tasks where standard PPO struggles.
- Why trending: Directly addresses a critical limitation in RL-based LLM training. 27 HF upvotes, covered on paperium.net and aimodels.fyi.
- Sources: HuggingFace (27 upvotes), arXiv, paperium.net, aimodels.fyi
12. Lyra 2.0: Explorable Generative 3D Worlds
- Authors: Tianchang Shen, Sherwin Bahmani, Kai He et al.
- arXiv: 2604.13036
- Summary: Combines camera-controlled video generation with feed-forward 3D reconstruction to create explorable 3D worlds. Enables a new paradigm for 3D scene creation through generative reconstruction.
- Why trending: Pushes boundaries on interactive 3D generation from video. 26 HF upvotes, covered on emergentmind.com.
- Sources: HuggingFace (26 upvotes), arXiv, emergentmind.com
13. Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model
- Authors: NVIDIA, Aakshita Chandiramani, Aaron Blakeman et al.
- arXiv: 2604.12374
- Summary: A 120B (12B active) parameter hybrid Mamba-Attention MoE model from NVIDIA. First to be pre-trained in NVFP4 and leverage LatentMoE, a new MoE architecture. Optimized for agentic reasoning with strong efficiency characteristics.
- Why trending: Major industry release from NVIDIA combining Mamba + Transformer + MoE. 26 HF upvotes.
- Sources: HuggingFace (26 upvotes), arXiv, NVIDIA
14. BERT-as-a-Judge: A Robust Alternative to Lexical Methods for LLM Evaluation
- Authors: Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Emmanuel Malherbe et al.
- arXiv: 2604.09497
- Summary: Proposes using BERT-scale models as judges for reference-based LLM evaluation, offering a more robust alternative to rigid lexical methods that can conflate model capability with formatting compliance.
- Why trending: Practical and efficient approach to a widespread pain point in LLM evaluation. 26 HF upvotes.
- Sources: HuggingFace (26 upvotes), arXiv
15. Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents
- Authors: Kangsan Kim, Minki Kang, Taeil Kim et al.
- arXiv: 2604.14004
- Summary: Studies how memory-based self-evolution in coding agents can transfer across task domains by leveraging shared infrastructural foundations like runtime environments and programming languages, enabling cross-domain memory utilization.
- Why trending: Directly relevant to the coding agent explosion; shows memories generalize across domains. 23 HF upvotes.
- Sources: HuggingFace (23 upvotes), arXiv
16. From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space
- Authors: Yuqiao Tan, Minzheng Wang, Bo Liu et al.
- arXiv: 2604.14142
Summary: Shows that RLVR’s potential is fundamentally bounded by the base model’s output distribution when optimizing P(y x). Proposes optimizing the marginal distribution P(y) in pre-train space to unlock capabilities beyond the base model’s existing distribution. - Why trending: Provides a fundamental theoretical insight into the limits of RLVR. 23 HF upvotes.
- Sources: HuggingFace (23 upvotes), arXiv
17. The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities
- Authors: Xuwei Ding, Skylar Zhai, Linxin Song et al.
- arXiv: 2604.10577
- Summary: Reveals that computer-use agents can be manipulated through seemingly benign user instructions to automate harmful actions, exposing a subtle but critical safety gap that existing evaluations miss by focusing only on explicit threats.
- Why trending: Highlights a critical blind spot in agent safety as CUAs become mainstream. 21 HF upvotes.
- Sources: HuggingFace (21 upvotes), arXiv
18. Target Policy Optimization
- Authors: Jean Kaddour
- arXiv: 2604.06159
- Summary: Proposes a new policy optimization framework that directly targets optimal policies rather than iteratively improving through proximal updates, potentially enabling more efficient RL training for LLMs.
- Why trending: Novel RL optimization method from a well-known researcher. 19 HF upvotes.
- Sources: HuggingFace (19 upvotes), arXiv
19. Exploration and Exploitation Errors Are Measurable for Language Model Agents
- Authors: Jaden Park, Jungtaek Kim, Jongwon Jeong et al.
- arXiv: 2604.13151
- Summary: Introduces methods to quantify exploration and exploitation errors in LLM agents, providing actionable diagnostics for understanding and improving agent decision-making in complex environments.
- Why trending: Fills a gap in understanding agent failure modes with measurable metrics. 18 HF upvotes.
- Sources: HuggingFace (18 upvotes), arXiv
20. Towards Long-horizon Agentic Multimodal Search
- Authors: Yifan Du, Zikang Liu, Jinbiao Peng et al.
- arXiv: 2604.12890
- Summary: Tackles the challenge of long-horizon multimodal search where agents must plan, retrieve, and reason across extended interaction sequences with diverse information sources including text, images, and video.
- Why trending: Addresses a frontier problem in agentic information retrieval. 16 HF upvotes.
- Sources: HuggingFace (16 upvotes), arXiv
Key Themes Today
- GUI & Game Agents Dominate — 5 of the top 20 papers focus on autonomous GUI/game agents (ClawGUI, GameWorld, Turing Test, OccuBench, Blind Spot of Agent Safety).
RL for LLM Reasoning — Multiple papers push RL-based training forward: KnowRL, SPPO, From P(y x) to P(y), Target Policy Optimization. - Distillation & Post-Training — On-policy distillation mechanics are deeply examined (Rethinking OPD, TIP).
- Industry Releases — ByteDance (Seedance 2.0) and NVIDIA (Nemotron 3 Super) both dropped major model papers.
- Agent Safety — Growing attention to safety vulnerabilities in autonomous computer-use agents.
Report generated 2026-04-16. Data sources: HuggingFace Daily Papers API (Apr 15-16), web search cross-referencing, arXiv abstracts.
