Daily AI Papers — June 30, 2026

12 minute read

Published:

1. Agentic Abstention: Do Agents Know When to Stop Instead of Act?

Authors: Han Luo, Bingbing Wen, Lucy Lu Wang

Summary: This paper defines “Agentic Abstention” — the problem of when an LLM agent should stop taking actions rather than continuing under uncertainty. Unlike single-turn abstention, this is a sequential decision problem evaluated across web shopping, terminal environments, and QA tasks, benchmarking 13 LLM-as-agent systems on over 28,000 tasks.

Why Trending: #1 on HuggingFace with 112 upvotes — the most-upvoted paper today. Hits a critical practical gap: agents that never stop (wasting compute) vs. agents that give up too early. Directly relevant to production agentic systems.

Sources: HuggingFace (#1, 112 upvotes)

arXiv: arxiv.org/abs/2606.28733


2. LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing

Authors: Xinyu Wang, Chongbo Zhao, Fangneng Zhan, Yue Ma

Summary: LiveEdit introduces a causal, frame-by-frame streaming video editing framework that maintains stable backgrounds and non-edited regions over time while achieving real-time latency. The system uses a three-stage distillation pipeline converting a bidirectional foundation model into a unidirectional streaming editor, with an AR-oriented mask cache for long-horizon content preservation.

Why Trending: 65 HF upvotes. Real-time diffusion-based video editing is a hot space; this directly addresses the two bottlenecks (stability + latency) that limit practical AR/interactive deployment.

Sources: HuggingFace (#2, 65 upvotes)

arXiv: arxiv.org/abs/2606.26740


3. Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Authors: Lei Bai, Zongsheng Cao, Yang Chen, Zhiyao Cui, Shangheng Du, Yue Fan, et al.

Summary: Introduces Agents-A1, a 35B Mixture-of-Experts Agentic Model that matches trillion-parameter-level performance by scaling agent horizon (trajectory length and heterogeneous abilities) rather than model size. Uses a three-stage training pipeline: supervised fine-tuning, domain-level teacher models, and multi-teacher domain-routed on-policy distillation.

Why Trending: 63 HF upvotes. Provocative claim — a 35B model beating much larger models by “scaling horizon” rather than parameters. Directly challenges the parameter-scaling orthodoxy.

Sources: HuggingFace (#3, 63 upvotes)

arXiv: arxiv.org/abs/2606.30616


4. TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

Authors: Shoufa Chen, Luyuan Wang, Xuan Yang, Zhiheng Liu, Yuren Cong, Yuanfeng Ji

Summary: TUA-Bench presents a comprehensive evaluation suite for agents operating in terminal environments across diverse digital activities and specialized software workflows — extending beyond coding to general computer use. It uses an execution-based scoring protocol and reveals significant performance gaps among frontier agents on shell-based tasks.

Why Trending: 42 HF upvotes. Terminal-use agents are increasingly important as coding agents expand to full computer use; this fills a benchmark gap between GUI-only and code-only evaluations.

Sources: HuggingFace (#4, 42 upvotes)

arXiv: arxiv.org/abs/2606.28480


5. ReFreeKV: Towards Threshold-Free KV Cache Compression

Authors: Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu

Summary: ReFreeKV eliminates the domain-specific threshold required by existing KV cache pruning methods, which break down when input distribution shifts. The approach uses adaptive budget allocation to maintain full-cache performance across diverse datasets and model sizes without manual threshold tuning.

Why Trending: 42 HF upvotes (tied #4). KV cache compression is critical for long-context LLM inference; removing the fragile threshold dependency is a practical engineering win.

Sources: HuggingFace (#5, 42 upvotes)

arXiv: arxiv.org/abs/2502.16886


6. Beyond IID: How General Are Tabular Foundation Models, Really?

Authors: Lennart Purucker, Andrej Tschalzev, Nick Erickson, Gioia Blayer, David Holzmüller, Alan Arazi

Summary: This paper stress-tests tabular foundation models (TabFMs) under non-IID conditions — large-scale, high-dimensional, out-of-distribution datasets — finding that traditional tree-based methods still outperform newer TabFMs on complex, real-world data. Introduces the Data Foundry benchmark to standardize cross-discipline evaluation.

Why Trending: 35 HF upvotes. Tabular FMs have had big hype; this paper shows the emperor’s clothes are thinner than claimed, especially outside curated academic benchmarks.

Sources: HuggingFace (#6, 35 upvotes)

arXiv: arxiv.org/abs/2606.30410


7. Trimming the Long-Tail of Visual World Modeling Evaluation

Authors: Bingxuan Li, Yining Hong, Cheng Qian, Hyeonjeong Ha, Jiateng Liu, Zhenhailong Wang

Summary: Visual world models (image/video generators) look impressive on common physical interactions but struggle on rare, irregular, and impossible scenarios. This paper develops a structured evaluation framework covering “regular,” “unconventional,” and “impossible” scenario modes and shows current models lack genuine physical principle generalization.

Why Trending: 31 HF upvotes. Exposes a systematic blind spot in world model evaluation — standard benchmarks only test the head of the distribution, missing the long tail where physical reasoning matters most.

Sources: HuggingFace (#7, 31 upvotes)

arXiv: arxiv.org/abs/2606.24256


8. Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning

Authors: Hohin Kwan, Hongyu Li, Ray Zhang, Manyuan Zhang, Xianghao Kong, Anyi Rao

Summary: Video-MME-Logical introduces a controlled benchmark for evaluating whether MLLMs can reason over dynamic visual evidence through temporal-logical operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition — going beyond frame-level object recognition.

Why Trending: 23 HF upvotes. As video understanding matures, evaluation needs to move from perception to reasoning; this benchmark makes that shift concrete and measurable.

Sources: HuggingFace (#8, 23 upvotes)

arXiv: arxiv.org/abs/2606.27828


9. AsyncOPD: How Stale Can On-Policy Distillation Be?

Authors: Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjun Kang, Sanghyun Park, Donghoon Kim

Summary: Asynchronous On-Policy Distillation (AsyncOPD) decouples rollout generation from learner updates to eliminate the LLM post-training systems bottleneck, enabling continuous training while rollouts are generated asynchronously. The paper analyzes how much staleness KL-divergence (forward vs. reverse) can tolerate and proposes Monte Carlo estimation corrections.

Why Trending: 23 HF upvotes. On-policy distillation is the backbone of modern RLHF-style training; making it asynchronous without degrading quality is a direct throughput win for large-scale post-training.

Sources: HuggingFace (#9, 23 upvotes)

arXiv: arxiv.org/abs/2606.24143


10. Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

Authors: Sunqi Fan, Qingle Liu, Runqi Yin, Meng-Hao Guo, Shuojin Yang

Summary: Proposes a generalized keyframe extraction method that bridges static Video QA benchmarks and dynamic video-guided agentic tasks (e.g., following GUI tutorials). The approach selects task-relevant frames by jointly considering task relevance and scene dynamics, improving MLLM performance on both evaluation paradigms.

Why Trending: 21 HF upvotes. Connecting QA-style video understanding to agentic task execution is a key step toward video-instructed agents that can learn from tutorials.

Sources: HuggingFace (#10, 21 upvotes)

arXiv: arxiv.org/abs/2606.29445


11. One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Authors: Philip Zmushko, Egor Petrov, Nursultan Abdullaev, Mikhail Khrushchev, Samuel Horváth

Summary: Demonstrates that asynchronous pipeline parallelism (PipeDream-2BW) can match synchronous training quality through careful optimizer selection (Muon over AdamW) and error feedback correction, even with one-step gradient staleness. This eliminates pipeline bubble idle time without the convergence penalties traditionally assumed.

Why Trending: 17 HF upvotes. GPU utilization during LLM pretraining is a multi-million dollar problem; removing the convergence barrier to async pipeline parallelism is a significant training infrastructure advance.

Sources: HuggingFace (#11, 17 upvotes)

arXiv: arxiv.org/abs/2606.30634


12. Monte Carlo Energy Aggregation for Mobile 3D Gaussian Splatting (Flux-GS)

Authors: Xiaobiao Du, YuAn Wang, Hao Li, Bosheng Wang, Xun Sun, Xin Yu

Summary: Flux-GS enables real-time, high-fidelity 3D Gaussian Splatting on mobile platforms by replacing high-order Spherical Harmonics with a Monte Carlo Specular Energy Aggregator in latent space, combined with Attribute-Conditioned SH Enhancement and multi-view gradient-based densification strategies for mobile-grade efficiency.

Why Trending: 16 HF upvotes. 3DGS on mobile devices is a key frontier for AR applications; this paper directly addresses the SH overhead bottleneck that has prevented practical mobile deployment.

Sources: HuggingFace (#12, 16 upvotes)

arXiv: arxiv.org/abs/2606.30017


13. One Model, Many Latencies: Universal Speech Enhancement for Diverse Real-Time Applications

Authors: Szu-Wei Fu, Rong Chao, Xuesong Yang, Sung-Feng Huang, Ante Jukić, Yu Tsao

Summary: A single universal speech enhancement model supports configurable algorithmic and computational latency constraints through parallel convolutional layers and an early-exit mechanism, replacing the need for separate models per latency budget. A two-stage training approach (shared encoder → decoder branching) enables smooth shared-to-multiple decoder transitions.

Why Trending: 15 HF upvotes. Speech enhancement with controllable latency is practically important for telephony, hearing aids, and AR — one model replacing an entire fleet is a deployment win.

Sources: HuggingFace (#13, 15 upvotes)

arXiv: arxiv.org/abs/2606.25621


14. TACO: Tool-Augmented Credit Optimization for Agentic Tool Use

Authors: Mingkuan Feng, Jinyang Wu, Hao Gu, Fangrui Lv, Ruihan Jin, Chuyuan Zhang

Summary: TACO addresses a core RL training problem for code-tool agents: outcome-only rewards cannot distinguish useful, redundant, or misleading tool calls. The method introduces Differential Answer-Probe Reward (per-tool contribution credit) and Outcome-Gated Advantage Routing (final outcome distribution), enabling more precise credit assignment in multimodal agentic RL.

Why Trending: 14 HF upvotes. Credit assignment in multi-step agentic tool use is a fundamental research problem; TACO offers a principled solution for tool-heavy GRPO-style training.

Sources: HuggingFace (#14, 14 upvotes)

arXiv: arxiv.org/abs/2606.30251


15. OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Authors: Mengqi Yuan, Zilong Zhou, Xinzhuang Xiong, Weiming Wu, Jiayang Sun, Jiamin Song

Summary: OSWorld 2.0 introduces 108 long-horizon computer-use workflows across everyday and professional tasks, exposing limitations of frontier agents in cross-source reasoning, implicit-state inference, and visual-spatial precision. It also includes safety reports and a token efficiency metric alongside binary-completion scoring.

Why Trending: 13 HF upvotes. The successor to OSWorld, one of the most influential computer use benchmarks; the “long-horizon” focus directly challenges the current generation of Claude/GPT-4o computer-use agents.

Sources: HuggingFace (#15, 13 upvotes)

arXiv: arxiv.org/abs/2606.29537


16. GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots

Authors: Sunqi Fan, Lingshan Chen, Runqi Yin, Qingle Liu, Yongming Rao, Meng-Hao Guo

Summary: GUICrafter trains GUI agents using massive unannotated screenshots through a two-stage curriculum learning framework: first learning visual grounding from web screenshots without labels, then refining with RL calibration. Eliminates the expensive annotation bottleneck that limits GUI agent data collection.

Why Trending: 10 HF upvotes. Annotation cost is the primary bottleneck for GUI agents; weakly-supervised approaches that leverage the web’s billions of existing screenshots could dramatically scale training data.

Sources: HuggingFace (#16, 10 upvotes)

arXiv: arxiv.org/abs/2606.29705


17. Interleaved Speech Language Models Latently Work In Text

Authors: Talia Sternberg, Gallil Maimon, Yossi Adi

Summary: Using logit lens analysis on interleaved speech-text LMs, this paper discovers that these models implicitly transcribe speech in intermediate layers (text tokens become decodable) before predicting in text space, then transform back to speech tokens — revealing a text-mediated internal computation pathway even in speech-native models.

Why Trending: 9 HF upvotes. Mechanistic interpretability applied to multimodal speech models; the finding that speech models “think in text” internally has significant implications for speech model design and evaluation.

Sources: HuggingFace (#17, 9 upvotes)

arXiv: arxiv.org/abs/2606.22473


18. DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model

Authors: Daniyel Ayupov, Artur Markov-Tsoy

Summary: DreamForge-World adapts a Wan2.1-based autoregressive video stack with a residual action pathway (inspired by Matrix-Game) to enable real-time interactive world simulation on consumer GPUs. Features mid-stream reprompting and dual-view operation for interactive rollouts without dedicated server hardware.

Why Trending: 8 HF upvotes. A consumer-GPU-capable controllable world model is a step toward democratized game/simulation generation; notable for running on accessible hardware.

Sources: HuggingFace (#18, 8 upvotes)

arXiv: arxiv.org/abs/2606.30292


19. Apple Neural Engine: Architecture, Programming, and Performance

Authors: Spencer H. Bryngelson

Summary: A detailed reverse-engineered technical guide to Apple’s Neural Engine (ANE) — the fixed-function matrix accelerator shipping in all Apple silicon since A11/M1. Documents the full datapath, roofline model, dispatch route below Core ML, on-disk program format, weight compression, and kernel driver/firmware internals through direct measurement and static analysis.

Why Trending: Front page of Hacker News. The ANE has been a black box since 2017; this is the most comprehensive public documentation of its internals, enabling direct low-level neural network programming outside the Core ML abstraction layer.

Sources: Hacker News (front page)

arXiv: arxiv.org/abs/2606.22283


20. How Good Can Linear Models Be for Time-Series Forecasting?

Authors: Lang Huang, Jinglue Xu, Luke Darlow

Summary: Challenges the trend toward large transformer and foundation models for time-series forecasting, showing that aggressive preprocessing optimizations — context length tuning, local normalization, regularization, and augmentation — close most of the accuracy gap at dramatically lower cost using simple Ridge regression models.

Why Trending: 6 HF upvotes. A contrarian result with direct practical implications: most of the accuracy gains attributed to large forecasting models may be obtainable from tuned linear baselines, questioning the ROI of scaling in this domain.

Sources: HuggingFace (#21, 6 upvotes)

arXiv: arxiv.org/abs/2606.27282