Daily AI Papers — June 30, 2026
Published:
1. Agentic Abstention: Do Agents Know When to Stop Instead of Act?
Authors: Han Luo, Bingbing Wen, Lucy Lu Wang
Summary: This paper defines “Agentic Abstention” — the problem of when an LLM agent should stop taking actions rather than continuing under uncertainty. Unlike single-turn abstention, this is a sequential decision problem evaluated across web shopping, terminal environments, and QA tasks, benchmarking 13 LLM-as-agent systems on over 28,000 tasks.
Why Trending: #1 on HuggingFace with 112 upvotes — the most-upvoted paper today. Hits a critical practical gap: agents that never stop (wasting compute) vs. agents that give up too early. Directly relevant to production agentic systems.
Sources: HuggingFace (#1, 112 upvotes)
arXiv: arxiv.org/abs/2606.28733
2. LiveEdit: Towards Real-Time Diffusion-Based Streaming Video Editing
Authors: Xinyu Wang, Chongbo Zhao, Fangneng Zhan, Yue Ma
Summary: LiveEdit introduces a causal, frame-by-frame streaming video editing framework that maintains stable backgrounds and non-edited regions over time while achieving real-time latency. The system uses a three-stage distillation pipeline converting a bidirectional foundation model into a unidirectional streaming editor, with an AR-oriented mask cache for long-horizon content preservation.
Why Trending: 65 HF upvotes. Real-time diffusion-based video editing is a hot space; this directly addresses the two bottlenecks (stability + latency) that limit practical AR/interactive deployment.
Sources: HuggingFace (#2, 65 upvotes)
arXiv: arxiv.org/abs/2606.26740
3. Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent
Authors: Lei Bai, Zongsheng Cao, Yang Chen, Zhiyao Cui, Shangheng Du, Yue Fan, et al.
Summary: Introduces Agents-A1, a 35B Mixture-of-Experts Agentic Model that matches trillion-parameter-level performance by scaling agent horizon (trajectory length and heterogeneous abilities) rather than model size. Uses a three-stage training pipeline: supervised fine-tuning, domain-level teacher models, and multi-teacher domain-routed on-policy distillation.
Why Trending: 63 HF upvotes. Provocative claim — a 35B model beating much larger models by “scaling horizon” rather than parameters. Directly challenges the parameter-scaling orthodoxy.
Sources: HuggingFace (#3, 63 upvotes)
arXiv: arxiv.org/abs/2606.30616
4. TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents
Authors: Shoufa Chen, Luyuan Wang, Xuan Yang, Zhiheng Liu, Yuren Cong, Yuanfeng Ji
Summary: TUA-Bench presents a comprehensive evaluation suite for agents operating in terminal environments across diverse digital activities and specialized software workflows — extending beyond coding to general computer use. It uses an execution-based scoring protocol and reveals significant performance gaps among frontier agents on shell-based tasks.
Why Trending: 42 HF upvotes. Terminal-use agents are increasingly important as coding agents expand to full computer use; this fills a benchmark gap between GUI-only and code-only evaluations.
Sources: HuggingFace (#4, 42 upvotes)
arXiv: arxiv.org/abs/2606.28480
5. ReFreeKV: Towards Threshold-Free KV Cache Compression
Authors: Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu
Summary: ReFreeKV eliminates the domain-specific threshold required by existing KV cache pruning methods, which break down when input distribution shifts. The approach uses adaptive budget allocation to maintain full-cache performance across diverse datasets and model sizes without manual threshold tuning.
Why Trending: 42 HF upvotes (tied #4). KV cache compression is critical for long-context LLM inference; removing the fragile threshold dependency is a practical engineering win.
Sources: HuggingFace (#5, 42 upvotes)
arXiv: arxiv.org/abs/2502.16886
6. Beyond IID: How General Are Tabular Foundation Models, Really?
Authors: Lennart Purucker, Andrej Tschalzev, Nick Erickson, Gioia Blayer, David Holzmüller, Alan Arazi
Summary: This paper stress-tests tabular foundation models (TabFMs) under non-IID conditions — large-scale, high-dimensional, out-of-distribution datasets — finding that traditional tree-based methods still outperform newer TabFMs on complex, real-world data. Introduces the Data Foundry benchmark to standardize cross-discipline evaluation.
Why Trending: 35 HF upvotes. Tabular FMs have had big hype; this paper shows the emperor’s clothes are thinner than claimed, especially outside curated academic benchmarks.
Sources: HuggingFace (#6, 35 upvotes)
arXiv: arxiv.org/abs/2606.30410
7. Trimming the Long-Tail of Visual World Modeling Evaluation
Authors: Bingxuan Li, Yining Hong, Cheng Qian, Hyeonjeong Ha, Jiateng Liu, Zhenhailong Wang
Summary: Visual world models (image/video generators) look impressive on common physical interactions but struggle on rare, irregular, and impossible scenarios. This paper develops a structured evaluation framework covering “regular,” “unconventional,” and “impossible” scenario modes and shows current models lack genuine physical principle generalization.
Why Trending: 31 HF upvotes. Exposes a systematic blind spot in world model evaluation — standard benchmarks only test the head of the distribution, missing the long tail where physical reasoning matters most.
Sources: HuggingFace (#7, 31 upvotes)
arXiv: arxiv.org/abs/2606.24256
8. Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning
Authors: Hohin Kwan, Hongyu Li, Ray Zhang, Manyuan Zhang, Xianghao Kong, Anyi Rao
Summary: Video-MME-Logical introduces a controlled benchmark for evaluating whether MLLMs can reason over dynamic visual evidence through temporal-logical operations: state tracking, sequential counting, temporal ordering, dynamic spatiality, and structural composition — going beyond frame-level object recognition.
Why Trending: 23 HF upvotes. As video understanding matures, evaluation needs to move from perception to reasoning; this benchmark makes that shift concrete and measurable.
Sources: HuggingFace (#8, 23 upvotes)
arXiv: arxiv.org/abs/2606.27828
9. AsyncOPD: How Stale Can On-Policy Distillation Be?
Authors: Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjun Kang, Sanghyun Park, Donghoon Kim
Summary: Asynchronous On-Policy Distillation (AsyncOPD) decouples rollout generation from learner updates to eliminate the LLM post-training systems bottleneck, enabling continuous training while rollouts are generated asynchronously. The paper analyzes how much staleness KL-divergence (forward vs. reverse) can tolerate and proposes Monte Carlo estimation corrections.
Why Trending: 23 HF upvotes. On-policy distillation is the backbone of modern RLHF-style training; making it asynchronous without degrading quality is a direct throughput win for large-scale post-training.
Sources: HuggingFace (#9, 23 upvotes)
arXiv: arxiv.org/abs/2606.24143
10. Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction
Authors: Sunqi Fan, Qingle Liu, Runqi Yin, Meng-Hao Guo, Shuojin Yang
Summary: Proposes a generalized keyframe extraction method that bridges static Video QA benchmarks and dynamic video-guided agentic tasks (e.g., following GUI tutorials). The approach selects task-relevant frames by jointly considering task relevance and scene dynamics, improving MLLM performance on both evaluation paradigms.
Why Trending: 21 HF upvotes. Connecting QA-style video understanding to agentic task execution is a key step toward video-instructed agents that can learn from tutorials.
Sources: HuggingFace (#10, 21 upvotes)
arXiv: arxiv.org/abs/2606.29445
11. One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining
Authors: Philip Zmushko, Egor Petrov, Nursultan Abdullaev, Mikhail Khrushchev, Samuel Horváth
Summary: Demonstrates that asynchronous pipeline parallelism (PipeDream-2BW) can match synchronous training quality through careful optimizer selection (Muon over AdamW) and error feedback correction, even with one-step gradient staleness. This eliminates pipeline bubble idle time without the convergence penalties traditionally assumed.
Why Trending: 17 HF upvotes. GPU utilization during LLM pretraining is a multi-million dollar problem; removing the convergence barrier to async pipeline parallelism is a significant training infrastructure advance.
Sources: HuggingFace (#11, 17 upvotes)
arXiv: arxiv.org/abs/2606.30634
12. Monte Carlo Energy Aggregation for Mobile 3D Gaussian Splatting (Flux-GS)
Authors: Xiaobiao Du, YuAn Wang, Hao Li, Bosheng Wang, Xun Sun, Xin Yu
Summary: Flux-GS enables real-time, high-fidelity 3D Gaussian Splatting on mobile platforms by replacing high-order Spherical Harmonics with a Monte Carlo Specular Energy Aggregator in latent space, combined with Attribute-Conditioned SH Enhancement and multi-view gradient-based densification strategies for mobile-grade efficiency.
Why Trending: 16 HF upvotes. 3DGS on mobile devices is a key frontier for AR applications; this paper directly addresses the SH overhead bottleneck that has prevented practical mobile deployment.
Sources: HuggingFace (#12, 16 upvotes)
arXiv: arxiv.org/abs/2606.30017
13. One Model, Many Latencies: Universal Speech Enhancement for Diverse Real-Time Applications
Authors: Szu-Wei Fu, Rong Chao, Xuesong Yang, Sung-Feng Huang, Ante Jukić, Yu Tsao
Summary: A single universal speech enhancement model supports configurable algorithmic and computational latency constraints through parallel convolutional layers and an early-exit mechanism, replacing the need for separate models per latency budget. A two-stage training approach (shared encoder → decoder branching) enables smooth shared-to-multiple decoder transitions.
Why Trending: 15 HF upvotes. Speech enhancement with controllable latency is practically important for telephony, hearing aids, and AR — one model replacing an entire fleet is a deployment win.
Sources: HuggingFace (#13, 15 upvotes)
arXiv: arxiv.org/abs/2606.25621
14. TACO: Tool-Augmented Credit Optimization for Agentic Tool Use
Authors: Mingkuan Feng, Jinyang Wu, Hao Gu, Fangrui Lv, Ruihan Jin, Chuyuan Zhang
Summary: TACO addresses a core RL training problem for code-tool agents: outcome-only rewards cannot distinguish useful, redundant, or misleading tool calls. The method introduces Differential Answer-Probe Reward (per-tool contribution credit) and Outcome-Gated Advantage Routing (final outcome distribution), enabling more precise credit assignment in multimodal agentic RL.
Why Trending: 14 HF upvotes. Credit assignment in multi-step agentic tool use is a fundamental research problem; TACO offers a principled solution for tool-heavy GRPO-style training.
Sources: HuggingFace (#14, 14 upvotes)
arXiv: arxiv.org/abs/2606.30251
15. OSWorld 2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
Authors: Mengqi Yuan, Zilong Zhou, Xinzhuang Xiong, Weiming Wu, Jiayang Sun, Jiamin Song
Summary: OSWorld 2.0 introduces 108 long-horizon computer-use workflows across everyday and professional tasks, exposing limitations of frontier agents in cross-source reasoning, implicit-state inference, and visual-spatial precision. It also includes safety reports and a token efficiency metric alongside binary-completion scoring.
Why Trending: 13 HF upvotes. The successor to OSWorld, one of the most influential computer use benchmarks; the “long-horizon” focus directly challenges the current generation of Claude/GPT-4o computer-use agents.
Sources: HuggingFace (#15, 13 upvotes)
arXiv: arxiv.org/abs/2606.29537
16. GUICrafter: Weakly-Supervised GUI Agent Leveraging Massive Unannotated Screenshots
Authors: Sunqi Fan, Lingshan Chen, Runqi Yin, Qingle Liu, Yongming Rao, Meng-Hao Guo
Summary: GUICrafter trains GUI agents using massive unannotated screenshots through a two-stage curriculum learning framework: first learning visual grounding from web screenshots without labels, then refining with RL calibration. Eliminates the expensive annotation bottleneck that limits GUI agent data collection.
Why Trending: 10 HF upvotes. Annotation cost is the primary bottleneck for GUI agents; weakly-supervised approaches that leverage the web’s billions of existing screenshots could dramatically scale training data.
Sources: HuggingFace (#16, 10 upvotes)
arXiv: arxiv.org/abs/2606.29705
17. Interleaved Speech Language Models Latently Work In Text
Authors: Talia Sternberg, Gallil Maimon, Yossi Adi
Summary: Using logit lens analysis on interleaved speech-text LMs, this paper discovers that these models implicitly transcribe speech in intermediate layers (text tokens become decodable) before predicting in text space, then transform back to speech tokens — revealing a text-mediated internal computation pathway even in speech-native models.
Why Trending: 9 HF upvotes. Mechanistic interpretability applied to multimodal speech models; the finding that speech models “think in text” internally has significant implications for speech model design and evaluation.
Sources: HuggingFace (#17, 9 upvotes)
arXiv: arxiv.org/abs/2606.22473
18. DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model
Authors: Daniyel Ayupov, Artur Markov-Tsoy
Summary: DreamForge-World adapts a Wan2.1-based autoregressive video stack with a residual action pathway (inspired by Matrix-Game) to enable real-time interactive world simulation on consumer GPUs. Features mid-stream reprompting and dual-view operation for interactive rollouts without dedicated server hardware.
Why Trending: 8 HF upvotes. A consumer-GPU-capable controllable world model is a step toward democratized game/simulation generation; notable for running on accessible hardware.
Sources: HuggingFace (#18, 8 upvotes)
arXiv: arxiv.org/abs/2606.30292
19. Apple Neural Engine: Architecture, Programming, and Performance
Authors: Spencer H. Bryngelson
Summary: A detailed reverse-engineered technical guide to Apple’s Neural Engine (ANE) — the fixed-function matrix accelerator shipping in all Apple silicon since A11/M1. Documents the full datapath, roofline model, dispatch route below Core ML, on-disk program format, weight compression, and kernel driver/firmware internals through direct measurement and static analysis.
Why Trending: Front page of Hacker News. The ANE has been a black box since 2017; this is the most comprehensive public documentation of its internals, enabling direct low-level neural network programming outside the Core ML abstraction layer.
Sources: Hacker News (front page)
arXiv: arxiv.org/abs/2606.22283
20. How Good Can Linear Models Be for Time-Series Forecasting?
Authors: Lang Huang, Jinglue Xu, Luke Darlow
Summary: Challenges the trend toward large transformer and foundation models for time-series forecasting, showing that aggressive preprocessing optimizations — context length tuning, local normalization, regularization, and augmentation — close most of the accuracy gap at dramatically lower cost using simple Ridge regression models.
Why Trending: 6 HF upvotes. A contrarian result with direct practical implications: most of the accuracy gains attributed to large forecasting models may be obtainable from tuned linear baselines, questioning the ROI of scaling in this domain.
Sources: HuggingFace (#21, 6 upvotes)
arXiv: arxiv.org/abs/2606.27282
