Daily AI Papers — April 2, 2026

12 minute read

Published:

1. Terminal Agents Suffice for Enterprise Automation

Authors: Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton et al. (ServiceNow) Summary: Challenges whether complex agentic systems (MCP tool-augmented agents, web agents with GUIs) are necessary for enterprise automation. Shows that simple terminal-based agents – just a model with a shell – can match or beat more complex approaches. Questions the current rush toward elaborate agent architectures. Link: arxiv.org/abs/2604.00073 Source: HuggingFace daily (Apr 2), alphaxiv.org discussion, YouTube explainer video, CACM blog on multi-agent enterprise automation Why trending: Provocative claim from ServiceNow that simplicity wins. Directly challenges the MCP and web-agent hype cycle with empirical evidence.

2. Brevity Constraints Reverse Performance Hierarchies in Language Models

Authors: MD Azizul Hakim Summary: On 7.7% of benchmark problems, larger LLMs underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. The mechanism: scale-dependent verbosity – larger models spontaneously produce longer outputs that fail under brevity constraints. Tested across 31 models (0.5B-405B) on 1,485 problems. Link: arxiv.org/abs/2604.00025 Source: HuggingFace daily (Apr 2), wand.ai blog post, richlyai blog post Why trending: Counterintuitive and practically important: bigger models can be worse when asked to be concise. Multiple blog posts amplifying the finding.

3. Universal YOCO for Efficient Depth Scaling

Authors: Yutao Sun, Li Dong, Tianzhu Ye, Shaohan Huang, Jianyong Wang, Furu Wei (Microsoft Research) Summary: Test-time scaling via deeper reasoning is booming, but standard Transformers can’t scale depth efficiently – looping strategies suffer from high compute overhead and KV cache that inflates with model depth. Universal YOCO (YOCO-U) combines the YOCO architecture with efficient depth scaling, enabling test-time compute scaling without the usual overhead. Link: arxiv.org/abs/2604.01220 Source: HuggingFace daily (Apr 2), extends YOCO from NeurIPS 2024 (neurips.cc/virtual/2024/poster/96833), fugumt translation Why trending: Furu Wei (MSR) extending their influential YOCO architecture for the test-time scaling era. Solves a real infrastructure bottleneck.

4. Reasoning Shift: How Context Silently Shortens LLM Reasoning

Authors: Gleb Rodionov Summary: Reveals that reasoning models’ extended thinking (self-verification, multi-step traces) is fragile: context can silently shorten reasoning chains without the model acknowledging the change. Systematically evaluates multiple reasoning models across three scenarios to expose this “reasoning shift” vulnerability. Link: arxiv.org/abs/2604.01161 Source: HuggingFace daily (Apr 2), arxiv cs.CL Why trending: Reasoning models are being deployed widely. This paper shows their reasoning depth is more context-dependent than assumed – a silent failure mode.

5. Embarrassingly Simple Self-Distillation Improves Code Generation

Authors: Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang Summary: Can an LLM improve at coding using only its own outputs, without a verifier, teacher, or RL? Yes. Simple Self-Distillation (SSD): sample solutions at certain temperatures, then fine-tune on those samples with standard SFT. No reward model, no external signal. Works. Link: arxiv.org/abs/2604.01193 Source: HuggingFace daily (Apr 2), alphaxiv.org discussion Why trending: Elegantly simple method from Meta AI researchers (Ronan Collobert). The “embarrassingly simple” framing always gets attention when backed by results.

6. HippoCamp: Benchmarking Contextual Agents on Personal Computers

Authors: Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu et al. (NTU, Ziwei Liu lab) Summary: New benchmark for agents operating on personal computers with multimodal file management. Unlike existing benchmarks (web interaction, tool use), HippoCamp evaluates agents in user-centric environments with individual user profiles and massive personal file search requiring context-aware reasoning. Link: arxiv.org/abs/2604.01221 Source: HuggingFace daily (Apr 2), alphaxiv.org, researchtrend.ai, HuggingFace paper page Why trending: Personal computer agents are an emerging product category. This benchmark evaluates the user-centric aspects that matter for real deployment.

7. MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

Authors: Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li et al. Summary: Deep research benchmarks only evaluate final reports with fixed rubrics, missing the research process itself. MiroEval evaluates both process and outcome for multimodal deep research agents, with refreshable tasks that reflect real-world query complexity. Dedicated project website. Link: arxiv.org/abs/2603.28407 Source: HuggingFace daily (Apr 2), EmergentMind, project website (miroeval.github.io), HuggingFace paper page Why trending: Deep research agents are the hottest capability right now. This is the most comprehensive evaluation framework so far with multi-platform discussion.

8. S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Authors: Jack Young Summary: Tunes a single initial state matrix per recurrent layer, freezing all model weights. With just ~48 training examples, outperforms LoRA by +10.8pp (p<0.001) on HumanEval. Zero inference overhead – nothing changes at inference time. Works on hybrid recurrent-attention models (GatedDeltaNet). Link: arxiv.org/abs/2604.01168 Source: HuggingFace daily (Apr 2), arxiv cs.LG Why trending: Outperforming LoRA with zero overhead and ~48 examples is a remarkable efficiency claim. Highlights hybrid recurrent models as an emerging architecture.

9. ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents

Authors: Songyang Liu, Chaozhuo Li, Chenxu Wang, Jinyu Hou et al. Summary: OpenClaw-style autonomous agents have powerful capabilities (tool integration, file access, shell execution) but these privileges create critical security vulnerabilities. ClawKeeper provides comprehensive safety through skills, plugins, and watchers to prevent data leakage, privilege escalation, and other system-level threats. Link: arxiv.org/abs/2603.24414 Source: HuggingFace daily (Apr 2), YouTube explainer video, arxivlens, HuggingFace paper page Why trending: Agent safety is urgent as autonomous coding agents go mainstream. Directly addresses security vulnerabilities in the tools we use daily.

10. Proactive Agent Research Environment (PARE)

Authors: Deepak Nathani, Cheng Zhang, Chang Huan, Jiaming Shan et al. Summary: Proactive agents that anticipate user needs lack realistic evaluation frameworks. Existing approaches model apps as flat tool-calling APIs, missing the stateful, sequential nature of real user interaction. PARE simulates active users to properly evaluate proactive assistants. Link: arxiv.org/abs/2604.00842 Source: HuggingFace daily (Apr 2), arxiv cs.AI Why trending: Proactive assistants are the next frontier beyond reactive chatbots. This benchmark provides the missing evaluation infrastructure.

11. Paper Reconstruction Evaluation: Evaluating AI-Written Papers

Authors: Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe et al. Summary: First systematic evaluation framework for quantifying quality and risks of papers written by modern coding agents. Measures presentation quality and hallucination rates in AI-generated scientific manuscripts. Timely as AI paper-writing tools proliferate. Link: arxiv.org/abs/2604.01128 Source: HuggingFace daily (Apr 2), arxiv cs.CL Why trending: AI paper writing is a growing concern in academia. This paper provides the first rigorous framework to measure the problem.

12. Do Phone-Use Agents Respect Your Privacy?

Authors: Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye et al. Summary: Studies whether phone-use agents leak private data while completing benign mobile tasks. Introduces MyPhoneBench with verifiable evaluation of what data agents actually type into form fields during execution. Operationalizes privacy-compliant behavior for phone agents. Link: arxiv.org/abs/2604.00986 Source: HuggingFace daily (Apr 2), arxiv cs.CR Why trending: Phone agents are being deployed rapidly. This paper shows they leak private data in measurable ways – a concrete privacy risk.

13. A Survey of On-Policy Distillation for Large Language Models

Authors: Mingyang Song, Mao Zheng Summary: Comprehensive survey of on-policy knowledge distillation: training student models on their own outputs rather than static teacher data. Addresses the train-test mismatch problem in off-policy distillation. Covers the entire landscape of techniques for transferring reasoning from frontier LLMs to smaller models. Link: arxiv.org/abs/2604.00626 Source: HuggingFace daily (Apr 2), arxiv cs.CL Why trending: On-policy distillation is how most open-weight reasoning models are being trained. This survey organizes a fast-moving field.

14. Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

Authors: Jingjie Ning, Xueqi Li, Chengyu Yu Summary: Multi-LLM revision pipelines (model A drafts, model B revises) are assumed to work via genuine error correction. This paper decomposes second-pass gains into three additive components: re-solving, scaffold, and content. Finds that much of the improvement comes from re-solving, not revision. Link: arxiv.org/abs/2604.01029 Source: HuggingFace daily (Apr 2), arxiv cs.CL Why trending: Challenges a widespread assumption in LLM pipelines. If “revision” is mostly “re-solving,” current multi-model architectures may be overcomplicated.

15. Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

Authors: Jaeik Kim, Woojin Kim, Jihwan Hong, Yejoon Lee et al. Summary: First masked-diffusion-based omnimodal foundation model unifying text, image, speech understanding and generation with video understanding in a single architecture. Unlike autoregressive models that serialize modalities or compositional models needing external decoders. Link: arxiv.org/abs/2604.00007 Source: HuggingFace daily (Apr 1 carryover), arxiv cs.CV Why trending: “First” masked-diffusion omnimodal model. Unifying all modalities under diffusion rather than autoregression is a distinct architectural bet.

16. MonitorBench: CoT Monitorability in LLMs

Authors: Han Wang, Yifan Sun, Brian Ko et al. Summary: First comprehensive benchmark for studying when LLM chain-of-thought is not causally responsible for outputs. Open-source, measures the “reduced monitorability” problem. Link: arxiv.org/abs/2603.28590 Source: HuggingFace daily (Apr 1), OpenAI blog post (openai.com/index/evaluating-chain-of-thought-monitorability/) Why trending: OpenAI companion blog post. CoT faithfulness is a top safety concern for reasoning models. Still accumulating attention.

17. All Roads Lead to Rome: Incentivizing Divergent Thinking in VLMs

Authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He et al. Summary: Studies why GRPO works for VLM reasoning and identifies its limitations. Proposes incentivizing divergent thinking – generating multiple distinct reasoning paths – to improve VLM capabilities beyond what standard RL achieves. Link: arxiv.org/abs/2604.00479 Source: HuggingFace daily (Apr 1 carryover), arxiv cs.CV Why trending: GRPO mechanism analysis is in high demand. Extends the reasoning revolution from text to vision-language models.

18. MMaDA-VLA: Large Diffusion Vision-Language-Action Model

Authors: Yang Liu, Pengxiang Ding, Tengyue Jiang, Xudong Wang et al. Summary: Vision-Language-Action model for robot control using diffusion instead of autoregression. Avoids temporal inconsistency and long-horizon error accumulation that plague autoregressive VLA models. Unified multi-modal instruction and generation. Link: arxiv.org/abs/2603.25406 Source: HuggingFace daily (Apr 2), arxiv cs.RO Why trending: VLA models for robotics are a growing research area. Diffusion-based approach addresses known failure modes.

19. UniMixer: Scaling Laws in Recommendation Systems

Authors: Mingming Ha, Guanchen Wang, Linxun Chen et al. Summary: Unifies three mainstream architectures for recommendation scaling (attention-based, TokenMixer-based, factorization-machine-based) into a single architecture. Studies scaling laws governing the relationship between performance and parameters/FLOPs for recommender systems. Link: arxiv.org/abs/2604.00590 Source: HuggingFace daily (Apr 2), arxiv cs.IR Why trending: Scaling laws for recommendation is highly relevant to industry. Unifying architectures under one framework enables systematic comparison.

20. Understand and Accelerate Memory Processing for Disaggregated LLM Inference

Authors: Zifan He, Rui Ma, Yizhou Sun, Jason Cong (UCLA) Summary: Unifies sparse attention, RAG, and compressed contextual memory into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Attend. Shows these optimizations share common structure and can be jointly accelerated for disaggregated inference. Link: arxiv.org/abs/2603.29002 Source: HuggingFace daily (Apr 2), arxiv cs.AR Why trending: Infrastructure paper from Jason Cong’s lab (UCLA). Unifying the memory optimization landscape is immediately useful for serving teams.


Honorable Mentions

  • LongCat-Next (2603.27538) – Meituan’s discrete-token multimodal model (still trending, GitHub + website)
  • Falcon Perception (2603.27365) – TII early-fusion perception (HF blog post)
  • FIPO (2603.19835) – RL for deep reasoning via future-KL (v3)
  • ViGoR-Bench (2603.25823) – Benchmarking visual generative models as zero-shot reasoners
  • Vision2Web (2603.26648) – Hierarchical benchmark for visual website development
  • GaussianGPT (2603.26661) – Autoregressive 3D Gaussian scene generation
  • Think, Act, Build (2604.00528) – Agentic framework for zero-shot 3D visual grounding

Methodology

SourceURLSignalWhat Was Found
HuggingFace Daily (Apr 2)huggingface.co/api/daily_papers?date=2026-04-02New submissions today26 papers
HuggingFace Daily (Apr 1)huggingface.co/api/daily_papers?date=2026-04-01Yesterday’s papers, carryover43 papers
alphaxiv.orgalphaxiv.orgCross-platform discussionTerminal Agents, Self-Distillation, HippoCamp
YouTubeyoutube.comVideo explainersTerminal Agents (StarShell), ClawKeeper
CACM Blogcacm.acm.orgEnterprise automation discussionMulti-agent systems article referencing terminal agents
wand.ai Blogwand.ai/blogAI analysisBrevity constraints / reasoning length paradox
richlyai Blogrichlyai.com/blogAI analysisBrevity boosts LLM performance
EmergentMindemergentmind.comCross-platform discussionMiroEval
MiroEval Websitemiroeval.github.ioProject pageDeep research agent benchmark
OpenAI Blogopenai.com/index/evaluating-chain-of-thought-monitorability/Company postMonitorBench companion (still trending from Apr 1)
NeurIPS 2024neurips.cc/virtual/2024/poster/96833Prior workOriginal YOCO paper (YOCO-U extends it)
Web search (Reddit, X, HN)VariousSocial media buzzGeneric results, no specific paper-level hits
DeepMind, OpenAI, Meta blogsVariousCompany announcementsNo major new paper releases today
arxiv listingsarxiv.org/list/cs.*Raw submissionsUsed for abstract retrieval

Ranking criteria: Cross-platform presence > company blog/announcement > HF daily position > topic relevance.

*Generated by JarvisNext report: April 3, 2026 at 10:00 AM PT*