Daily AI Papers — April 10, 2026

10 minute read

Published: April 10, 2026

1. SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Authors: Ziyu Ma, Shidong Yang, Yuxiang Ji et al.
ArXiv: arxiv.org/abs/2604.08377
Summary: Introduces a framework for collective skill evolution in multi-user LLM agent ecosystems, treating cross-user interactions as the primary signal for improving reusable agent skills. SkillClaw enables skills to continuously improve post-deployment rather than remaining static.
Sources: HuggingFace (139 upvotes, #1), ArXiv, EmergentMind, blog coverage (blakecrosley.com)
Why trending: Addresses a key pain point in LLM agent systems — static skills. High community engagement and cross-platform visibility with blog discussion.

2. Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Authors: Qihan Ren, Peng Wang, Ruikun Cai et al.
ArXiv: arxiv.org/abs/2604.06628
Summary: Challenges the prevailing narrative that SFT memorizes while RL generalizes for LLM reasoning. Reveals a “dip-and-recovery” pattern where cross-domain performance degrades before recovering with extended training, suggesting prior reported failures may be under-optimization artifacts.
Sources: HuggingFace (124 upvotes, #2), ArXiv, OpenReview
Why trending: Directly challenges dominant assumptions in the LLM post-training community with rigorous conditional analysis. Important implications for training pipelines.

3. HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

Authors: Tencent Robotics X, HY Vision Team, Xumin Yu et al.
ArXiv: arxiv.org/abs/2604.07430
Summary: A family of foundation models for real-world embodied agents from Tencent, featuring 2B (edge) and 32B (complex reasoning) variants. Uses Mixture-of-Transformers (MoT) architecture for fine-grained spatial/temporal perception and embodied reasoning.
Sources: HuggingFace (105 upvotes, #3), ArXiv, EmergentMind, HuggingFace model hub (tencent/HY-Embodied-0.5)
Why trending: Major industry release from Tencent with open model weights. Bridges the gap between VLMs and embodied AI with practical deployment focus.

4. When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models (NUMINA)

Authors: Zhengyang Sun, Yu Chen, Xin Zhou et al.
ArXiv: arxiv.org/abs/2604.08546
Summary: Introduces NUMINA, a training-free framework that fixes object counting errors in text-to-video diffusion models by identifying prompt-layout inconsistencies via self-/cross-attention heads. Improves counting accuracy by up to 7.4% on Wan2.1 models.
Sources: HuggingFace (104 upvotes, #4), ArXiv, EmergentMind, nexu.io blog coverage
Why trending: Solves a highly visible and frustrating problem in video generation. Training-free approach makes it immediately applicable.

5. MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

Authors: Junyao Gao, Sibo Liu, Jiaxing Li et al.
ArXiv: arxiv.org/abs/2604.08364
Summary: Presents a novel pipeline for curating large-scale style datasets (MegaStyle-1.4M) by leveraging consistent text-to-image style mapping. Proposes style-supervised contrastive learning and trains a FLUX-based style-driven generation model.
Sources: HuggingFace (79 upvotes, #5), ArXiv
Why trending: Addresses the data bottleneck for style transfer with a scalable approach. 1.4M sample dataset is a significant resource for the creative AI community.

6. ClawBench: Can AI Agents Complete Everyday Online Tasks?

Authors: Yuxuan Zhang, Yubo Wang, Yipeng Zhu et al.
ArXiv: arxiv.org/abs/2604.08523
Summary: An evaluation framework of 153 real-world tasks across 144 live platforms (purchasing, booking, job applications). Tests demanding capabilities like document understanding, multi-step workflows, and form-filling that go beyond existing agent benchmarks.
Sources: HuggingFace (52 upvotes, #6), ArXiv, BenchLM.ai, ClawBench.com
Why trending: Fills a critical gap in agent evaluation with real-world, not simulated, tasks. Multiple benchmark platforms already tracking it.

7. LPM 1.0: Video-based Character Performance Model

Authors: Ailing Zeng, Casper Yang, Chauncey Ge et al.
ArXiv: arxiv.org/abs/2604.07823
Summary: A Large Performance Model for single-person full-duplex audio-visual conversational performance. Addresses the “performance trilemma” of expressiveness, real-time inference, and long-horizon identity stability in video character generation.
Sources: HuggingFace (35 upvotes, #7), ArXiv, AI Flash Report
Why trending: Tackles a fundamental challenge in digital characters/avatars. Full-duplex conversational performance is key for interactive applications.

8. KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Authors: Tongbo Chen, Zhengxi Lu, Zhan Xu et al.
ArXiv: arxiv.org/abs/2604.08455
Summary: An online benchmark for personalized mobile agents on Android emulation covering 192 tasks. Unlike prior work that treats preferences as static, KnowU-Bench requires agents to interactively elicit preferences and decide when to intervene or stay silent.
Sources: HuggingFace (33 upvotes, #8), ArXiv, AlphaXiv
Why trending: Benchmarks the next frontier of mobile agents — proactive personalization. Novel interactive evaluation paradigm.

9. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Authors: Chenyu Zhou, Huacan Chai, Wenteng Chen et al.
ArXiv: arxiv.org/abs/2604.08224
Summary: A comprehensive review arguing that modern LLM agent capabilities are increasingly built by reorganizing runtime infrastructure (memory, skills, protocols) rather than changing model weights. Frames this through the lens of cognitive artifact externalization.
Sources: HuggingFace (30 upvotes, #9), ArXiv
Why trending: Timely survey that captures the paradigm shift in agent engineering. Useful framing for researchers and practitioners building agent systems.

10. DMax: Aggressive Parallel Decoding for dLLMs

Authors: Zigeng Chen, Gongfan Fang, Xinyin Ma et al.
ArXiv: arxiv.org/abs/2604.08302
Summary: A new paradigm for efficient diffusion language models (dLLMs) that reformulates decoding as progressive self-refinement. Introduces On-Policy Uniform Training and Soft Parallel Decoding for aggressive parallelism while maintaining generation quality.
Sources: HuggingFace (28 upvotes, #10), ArXiv, EmergentMind
Why trending: Diffusion-based LLMs are gaining traction as an alternative to autoregressive models. DMax makes their parallel decoding practical.

11. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

Authors: Jianhui Liu, Haoze Sun, Wenbo Li et al.
ArXiv: arxiv.org/abs/2604.07296
Summary: An open-source data engine for generating high-quality spatial understanding data across five foundational tasks (measurement, relationships, camera perception, multi-view consistency). Uses 3D bounding boxes as fundamental primitives.
Sources: HuggingFace (26 upvotes, #11), ArXiv
Why trending: Spatial intelligence is a hot area with robotics and embodied AI growth. Fills the void of a principled open-source spatial data generation system.

12. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Authors: Shilin Yan, Jintao Tong, Hongwei Xue et al.
ArXiv: arxiv.org/abs/2604.08545
Summary: Addresses the “blind tool invocation” problem in multimodal agents where models reflexively call tools even when queries are answerable from visual context. Proposes a decoupled reward framework to balance internal reasoning vs. external tool use.
Sources: HuggingFace (25 upvotes, #12), ArXiv
Why trending: Targets a practical inefficiency in deployed agent systems. Meta-cognitive arbitration between internal knowledge and tool use is an emerging research direction.

Authors: Yiduo Jia, Muzhi Zhu, Hao Zhong et al.
ArXiv: arxiv.org/abs/2604.08209
Summary: A self-supervised RL post-training framework for omni-modal models using temporal reordering of shuffled audio-visual clips. Proposes three cross-modal integration strategies and a two-stage data filtering pipeline.
Sources: HuggingFace (16 upvotes, #13), ArXiv
Why trending: Novel approach to extending RL post-training beyond text to audio-visual reasoning. Creative proxy task design.

14. Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Authors: Dawei Li, Zongxia Li, Hongyang Du et al.
ArXiv: arxiv.org/abs/2604.05333
Summary: An inference-time structural retrieval layer for large agent skill libraries. Constructs executable skill graphs offline and retrieves bounded, dependency-aware skill bundles to avoid context window saturation and reduce hallucination.
Sources: HuggingFace (16 upvotes, #14), ArXiv
Why trending: Practical solution to the scaling problem of agent skill libraries. Complements the SkillClaw paper (#1) in the agent skills ecosystem.

15. Structured Distillation of Web Agent Capabilities Enables Generalization (Agent-as-Annotators)

Authors: Xing Han Lù, Siva Reddy
ArXiv: arxiv.org/abs/2604.07776
Summary: A framework that distills frontier LLM web agent capabilities into small open-weight models using structured synthetic trajectory generation. A 9B student model achieves 41.5% on WebArena, surpassing Claude 3.5 Sonnet (36.0%) and GPT-4o (31.5%).
Sources: HuggingFace (13 upvotes, #15), ArXiv, papers.fzhiy.net daily
Why trending: Remarkable result: a 9B model beating frontier closed-source models on web navigation. Key advancement for open-weight agent models.

16. OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Authors: Wenbo Hu, Xin Chen, Yan Gao-Tian et al.
ArXiv: arxiv.org/abs/2604.08539
Summary: Introduces Gaussian GRPO (G²RPO), a novel RL training objective using non-linear distributional matching to force advantage distributions to converge to standard normal. Enables training a generalist multimodal reasoning model across diverse visual tasks.
Sources: HuggingFace (13 upvotes, #16), ArXiv
Why trending: Advances the state of open-source multimodal reasoning models. Novel RL objective design addresses reward variance across heterogeneous tasks.

17. FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On

Authors: Johanna Karras, Yuanhao Wang, Yingwei Li et al.
ArXiv: arxiv.org/abs/2604.08526
Summary: First dataset providing precise garment and body size information for virtual try-on, including “ill-fit” cases. Addresses the overlooked dimension of garment fit accuracy — how an XL shirt looks on an XS person.
Sources: HuggingFace (12 upvotes, #17), ArXiv, Paperium
Why trending: Solves a key gap in virtual try-on research. Practical e-commerce applications make this commercially relevant.

18. Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

Authors: Quantong Qiu, Zhiyi Hong, Yi Yang et al.
ArXiv: arxiv.org/abs/2604.07394
Summary: A context-aware framework that dynamically optimizes attention computation at the layer level, combining Full Attention and Sparse Attention with a lightweight Layer Router. Addresses the scalability bottleneck of standard attention for long-context LLMs.
Sources: HuggingFace (9 upvotes, #18), ArXiv
Why trending: Long-context efficiency is a critical infrastructure problem. Dynamic layer-level sparsity is a pragmatic middle ground between full and sparse attention.

19. Automating Database-Native Function Code Synthesis with LLMs (DBCooker)

Authors: Wei Zhou, Xuanhe Zhou, Qikang He et al.
ArXiv: arxiv.org/abs/2604.06231
Summary: An LLM-based system for automatically synthesizing database native functions, handling the complexity of multi-unit registration, internal reference linking, and logic implementation that generic code generation tools struggle with.
Sources: HuggingFace (9 upvotes, #19), ArXiv
Why trending: Niche but impactful — LLM-driven automation for database kernel development. Addresses a real pain point in database systems engineering.

20. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

Authors: Jindi Lv, Hao Li, Jie Li et al.
ArXiv: arxiv.org/abs/2604.08168
Summary: Repurposes a pretrained video generator for value estimation in robot manipulation RL. Jointly predicts future proprioception and state values, leveraging spatiotemporal priors to handle partial observability and delayed feedback.
Sources: HuggingFace (8 upvotes, #20), ArXiv
Why trending: Creative intersection of video generation and robot RL. Uses video priors to solve the long-horizon value estimation problem in manipulation.

Key Themes Today

Agent Skills & Infrastructure (#1, #6, #9, #12, #14): Agent systems are maturing beyond model capability into skill management, retrieval, and evolution.
Reasoning & Generalization (#2, #16): The SFT vs RL debate for reasoning continues with nuanced findings.
Embodied AI (#3, #11, #20): Robotics foundation models gaining momentum with practical deployment focus.
Video Generation (#4, #7): Text-to-video alignment and character performance remain active areas.
Efficient Inference (#10, #18): Parallel decoding and attention optimization for production LLMs.

Report generated 2026-04-10 10:00 PDT

Share on

Twitter Facebook LinkedIn

Alireza Shamsoshoara

Daily AI Papers — April 10, 2026

1. SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

2. Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

3. HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

4. When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models (NUMINA)

5. MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

6. ClawBench: Can AI Agents Complete Everyday Online Tasks?

7. LPM 1.0: Video-based Character Performance Model

8. KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

9. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

10. DMax: Aggressive Parallel Decoding for dLLMs

11. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

12. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

14. Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

15. Structured Distillation of Web Agent Capabilities Enables Generalization (Agent-as-Annotators)

16. OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

17. FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On

18. Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

19. Automating Database-Native Function Code Synthesis with LLMs (DBCooker)

20. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

Key Themes Today

Share on

You May Also Enjoy

Future Blog Post

Daily AI Papers — July 09, 2026

1. Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning

Daily AI Papers — July 08, 2026

1. RynnWorld-4D: 4D Embodied World Models for Robotic Manipulation

Daily AI Papers — July 07, 2026

#1 — UI-MOPD: Multi-Platform On-Policy Distillation for Continual GUI Agent Learning

Alireza Shamsoshoara

1. SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

2. Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

3. HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

4. When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models (NUMINA)

5. MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

6. ClawBench: Can AI Agents Complete Everyday Online Tasks?

7. LPM 1.0: Video-based Character Performance Model

8. KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

9. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

10. DMax: Aggressive Parallel Decoding for dLLMs

11. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

12. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

13. OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

14. Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

15. Structured Distillation of Web Agent Capabilities Enables Generalization (Agent-as-Annotators)

16. OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

17. FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On

18. Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

19. Automating Database-Native Function Code Synthesis with LLMs (DBCooker)

20. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

Key Themes Today

Share on

You May Also Enjoy

Future Blog Post

Daily AI Papers — July 09, 2026

1. Accurate, Interdisciplinary and Transparent Structure-property Understanding with Deep Native Structural Reasoning

Daily AI Papers — July 08, 2026

1. RynnWorld-4D: 4D Embodied World Models for Robotic Manipulation

Daily AI Papers — July 07, 2026

#1 — UI-MOPD: Multi-Platform On-Policy Distillation for Continual GUI Agent Learning