Daily AI Papers — June 26, 2026
Published:
1. DanceOPD: On-Policy Generative Field Distillation
Authors: Wei Zhou, Xiongwei Zhu, Zelin Xu, Bo Dong, Lixue Gong, Yongyuan Liang, Meng Chu, Leigang Qu, Lingdong Kong, Wei Liu, Tat-Seng Chua (ByteDance Seed) arXiv: arxiv.org/abs/2606.27377
Summary: Modern image generation demands a single model that unifies text-to-image (T2I), local editing, and global editing — capabilities that naturally conflict during training. DanceOPD introduces an on-policy generative field distillation framework for flow-matching models that routes each sample to a specific capability field, queries a low-noise student-induced state, and trains with a simple velocity MSE objective. Each capability source is defined as a velocity field over a shared flow state space, enabling the student to compose expert capabilities from fields queried on its own rollout states — while also absorbing operator-defined fields like classifier-free guidance.
Sources: HuggingFace Daily Papers (#1, 52 upvotes) Why Trending: Top HF paper of the day from ByteDance Seed; addresses the long-standing challenge of unifying T2I, local editing, and global editing in one model without capability conflicts.
2. ViQ: Text-Aligned Visual Quantized Representations at Any Resolution
Authors: Xumin Yu, Zuyan Liu, Zhenyu Yang, Yuhao Dong, Shengsheng Qian, Jiwen Lu, Han Hu, Yongming Rao (Tsinghua University / Microsoft) arXiv: arxiv.org/abs/2606.27313
Summary: ViQ proposes a visual quantization framework that balances semantic richness and fine-grained detail in discrete image representations while supporting native-resolution inputs. The approach uses two stages: text-aligned pre-training (rich language supervision from a pretrained LM) and feature discretization via proximal representation learning and position-aware head-wise quantization. ViQ serves as a unified discrete visual representation for multimodal LLMs, matching continuous-feature encoders on multimodal tasks while achieving 20–70% training speedup.
Sources: HuggingFace Daily Papers (#2, 34 upvotes) Why Trending: Tackles the fundamental tension between discrete image tokens and rich visual semantics; shows 20–70% training acceleration across different LLM backbones.
3. OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
Authors: Shuo Yang, Jinyang Wu, Zhengxi Lu, Yuhao Shen, Fan Zhang, Lang Feng arXiv: arxiv.org/abs/2606.26790
Summary: Outcome-based RL provides stable optimization for language agents but suffers from sparse trajectory-level rewards. OPID (On-Policy skill dIstillation for agentic reinforcement learning) extracts dense hindsight supervision from completed agent trajectories, distilling these into token-level skill signals without requiring external skill memories or privileged context. This avoids distribution mismatch between retrieved context and on-policy states, improving both training efficiency and final performance on agentic benchmarks.
Sources: HuggingFace Daily Papers (#3, 31 upvotes) Why Trending: Directly addresses the sparse-reward bottleneck in outcome-based RL for language agents — a key pain point for anyone training agents with RLVR or similar methods.
4. Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation
Authors: Zekai Zhang, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang (Qwen Team) arXiv: arxiv.org/abs/2606.26907
Summary: Real-world image generation requests are often underspecified, implicit, or require up-to-date knowledge. Qwen-Image-Agent identifies this as the “Context Gap” and proposes an agentic pipeline that integrates planning, reasoning, web search, memory, and feedback to progressively construct the complete generation context before passing it to a T2I model. The framework is context-centric: it treats bridging missing context as a first-class objective rather than an afterthought.
Sources: HuggingFace Daily Papers (#4, 31 upvotes) Why Trending: From the Qwen team; addresses a practical failure mode of T2I models — users rarely provide the detailed, specific prompts these models need.
5. The Verification Horizon: No Silver Bullet for Coding Agent Rewards
Authors: Binghai Wang, Chenlong Zhang, Dayiheng Liu, Jiajun Zhang, Jiawei Chen, Mouxiang Chen (Qwen Team / Alibaba) arXiv: arxiv.org/abs/2606.26300
Summary: As foundation models generate increasingly complex code, reliably verifying candidate solutions has emerged as the harder problem. Every verifier is only a proxy for human intent, never the intent itself, making verification subject to Goodhart’s Law. This paper systematically analyzes the “Verification Horizon” — the fundamental limits of proxy-based reward signals for coding agents — and argues for adaptive verification systems that co-evolve with generative capabilities rather than relying on fixed test suites or execution environments.
Sources: HuggingFace Daily Papers (#5, 28 upvotes) Why Trending: Challenges a core assumption of modern coding-agent training: that verification is easy. As models improve, the verification problem becomes the bottleneck.
6. JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
Authors: Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian arXiv: arxiv.org/abs/2606.18394
Summary: Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying in parallel, but standard methods hit a scaling ceiling: increasing draft budget helps only when acceptance is high and drafting overhead is low. JetFlow breaks this ceiling by replacing causal (path-conditioned) drafting with parallel tree drafting, avoiding the causality-efficiency dilemma of autoregressive heads. The method combines efficient forward drafting with causal conditioning to improve acceptance rates and throughput across benchmarks.
Sources: HuggingFace Daily Papers (#6, 19 upvotes) Why Trending: Speculative decoding is a hot area for LLM inference optimization; JetFlow claims to break the scaling ceiling that has limited prior head-based methods.
7. Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It
Authors: Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao arXiv: arxiv.org/abs/2606.26027
Summary: Multi-step tool-use RL training frequently exhibits catastrophic collapse: performance abruptly drops and tool-invocation structure breaks down. This paper traces the failure to unexpected probability spikes in specific control tokens, disrupting the model’s ability to format tool calls. The fix is interleaved supervised fine-tuning + RL training, which stabilizes the probability distribution over control tokens while preserving the benefits of policy optimization for complex multi-step tool use.
Sources: HuggingFace Daily Papers (#7, 10 upvotes) Why Trending: Practical training-stability finding for anyone fine-tuning LLMs with RL on tool-use tasks — catastrophic collapse is a real and underdiagnosed problem.
8. Fast LeWorldModel
Authors: Yuntian Gao, Xiangyu Xu arXiv: arxiv.org/abs/2606.26217
Summary: Joint-Embedding Predictive Architectures (JEPAs) have emerged as a strong reconstruction-free foundation for visual world models, but planning with LeWorldModel (LeWM) requires repeated autoregressive rollout of a one-step latent transition model — expensive and error-prone over long horizons. Fast LeWorldModel replaces autoregressive rollout with parallel action-prefix prediction, substantially reducing computational cost and eliminating error accumulation during long-horizon visual planning.
Sources: HuggingFace Daily Papers (#8, 10 upvotes) Why Trending: Visual world models and JEPA-style architectures are gaining traction; Fast-LeWM addresses their primary inference bottleneck for planning.
9. GauntletBench: Re-evaluating the Capabilities of Agents Beyond Familiar Environments
Authors: Mykola Vysotskyi, Runqi Lin, Grzegorz Biziel, Michal Zakrzewski, Sebastian Montagna, Damian Rynczak arXiv: arxiv.org/abs/2606.14397
Summary: Current agent benchmarks are saturated by modern systems and fail to expose real capability limits. GauntletBench is a web-based benchmark that evaluates agent generalization in challenging out-of-distribution scenarios, revealing significant gaps between state-of-the-art agentic systems and human performance in temporal perception, graphical understanding, and 3D spatial reasoning — dimensions largely absent from existing evaluations.
Sources: HuggingFace Daily Papers (#9, 9 upvotes) Why Trending: Benchmark saturation is a real problem; GauntletBench targets capabilities current evals miss — temporal perception, graphical understanding, 3D reasoning.
10. LISA: Likelihood Score Alignment for Visual-Condition Controllable Generation
Authors: Yanghao Wang, Hongxu Chen, Jiazhen Liu, Zhenqi He, Rui Liu, Zhen Wang arXiv: arxiv.org/abs/2606.27192
Summary: The dual-branch paradigm (a side network encoding visual conditions, fused into a frozen main network) underlies ControlNet and many successors. LISA revisits this through score-based generative modeling, revealing that side branches contribute likelihood scores rather than arbitrary features. This insight leads to a likelihood score alignment regularization that improves training efficiency without architectural changes to the controllable generation pipeline.
Sources: HuggingFace Daily Papers (#10, 8 upvotes) Why Trending: Provides a principled theoretical justification for ControlNet-style architectures and derives a practical training efficiency improvement from it.
11. GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents
Authors: Xiao Zhou, Siyue Zhang, Yilun Zhao, Jinbiao Wei, Tingyu Song, Arman Cohan arXiv: arxiv.org/abs/2606.24551
Summary: Computer-use agents can interact via graphical GUI or programmatic CLI, but existing benchmarks conflate interface modality with task difficulty, initial state, and verifier differences. This paper introduces a matched benchmark of 440 desktop tasks across 18 applications and 12 workflow categories where GUI and CLI agents face identical goals, states, and verifiers — isolating the execution layer as the sole variable. The results reveal clear modality-specific bottlenecks invisible in prior evaluations.
Sources: HuggingFace Daily Papers (#11, 7 upvotes) Why Trending: Cleanly decouples interface modality from task difficulty in computer-use agent evaluation — a methodological contribution the field has needed.
12. Confidence-Aware Tool Orchestration for Robust Video Understanding
Authors: Yangfan He, Yujin Choi, Jaehong Yoon arXiv: arxiv.org/abs/2606.26904
Summary: Video reasoning models implicitly trust every frame equally — a “Blind Trust Problem” that causes 15–30% accuracy drops under realistic perturbations like motion blur or occlusion. Robust-TO integrates per-frame trustworthiness estimation into an agentic video understanding pipeline, using calibrated evidence weighting and reliability-aware reasoning to maintain accuracy when visual input quality varies.
Sources: HuggingFace Daily Papers (#12, 6 upvotes) Why Trending: Addresses a concrete safety/robustness gap: frontier video models suffer 15–30% accuracy drops under realistic visual perturbations while remaining oblivious to degraded input.
13. In-Context World Modeling for Robotic Control (ICWM)
Authors: Siyin Wang, Junhao Shi, Senyu Fei, Zhaoyang Fu, Li Ji, Jingjing Gong arXiv: arxiv.org/abs/2606.26025
Summary: Vision-Language-Action (VLA) models typically condition only on current observations and language, implicitly assuming a fixed execution context. ICWM treats the underlying system configuration (camera viewpoint, robot morphology) as a latent variable to be inferred from self-generated interactions, enabling adaptation to novel setups without any parameter updates. System identification is reframed as an in-context adaptation problem solved at inference time.
Sources: HuggingFace Daily Papers (#13, 5 upvotes) Why Trending: Zero-shot adaptation to new robot configurations without fine-tuning — a key practical barrier for VLA deployment.
14. CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
Authors: Issa Sugiura, Daichi Hattori, Kazuo Araragi, Keita Ogawa, Shota Onose, Taro Makino arXiv: arxiv.org/abs/2606.16613
Summary: Unlike benchmarks where a single agent interacts with a passive environment, economic systems require autonomous agents to communicate, negotiate, and transact over extended time horizons. CoffeeBench places LLM agents as firms in a 90-day heterogeneous multi-agent economy, evaluating their ability to maximize profits through strategic interaction. The benchmark exposes model-specific differences in communication patterns and long-horizon planning invisible in shorter evaluations.
Sources: HuggingFace Daily Papers (#14, 4 upvotes) Why Trending: Multi-agent economic simulation is an underexplored evaluation setting; 90-day firm simulation reveals sharp model differences not visible on standard benchmarks.
15. PhysiFormer: Learning to Simulate Mechanics in World Space
Authors: Yiming Chen, Yushi Lan, Andrea Vedaldi arXiv: arxiv.org/abs/2606.27364
Summary: PhysiFormer is a diffusion transformer for physically-plausible 3D object motion simulation that represents objects as 3D meshes in world coordinates rather than view-dependent pixel space. Given initial vertex positions, velocities, and material type (rigid or elastic), the model samples physically consistent future trajectories without explicit inductive biases or ad-hoc latent spaces, generalizing to complex materials and geometries beyond training distribution.
Sources: HuggingFace Daily Papers (#15, 2 upvotes) Why Trending: Novel approach to neural physics simulation that avoids view-dependent pixel space and ad-hoc latent spaces, generalizing to complex materials.
16. Discretizing Reward Models
Authors: Vijay Viswanathan, Shiqi Wang, Devamanyu Hazarika, Chirag Nagpal, Tongshuang Wu, Graham Neubig arXiv: arxiv.org/abs/2606.21795
Summary: Continuous reward models suffer from oversensitivity: they assign meaningfully different scores to responses that are equally good, leading to noisy policy gradients and poor RL training. This paper demonstrates that reward models often score equivalent responses inconsistently, then shows that discretizing reward model outputs mitigates this oversensitivity while preserving discriminative ability between genuinely different-quality responses.
Sources: HuggingFace Daily Papers (#16, 2 upvotes) Why Trending: Reward model oversensitivity is an underappreciated failure mode in RLHF; discretization as a simple fix has broad practical implications.
17. Information-Aware KV Cache Compression for Long Reasoning (InfoKV)
Authors: Jushi Kai, Zhuiri Xiao, Alexandra Birch, Zhouhan Lin arXiv: arxiv.org/abs/2606.26875
Summary: As LLM reasoning chains grow longer, KV cache size in both prefilling and decoding becomes a bottleneck. Existing compression methods estimate token importance via attention weights, which captures contextual relevance but ignores predictive uncertainty and token informativeness. InfoKV incorporates entropy-aware, information-theoretic signals alongside attention weights for more accurate importance estimation, improving long-context reasoning accuracy under aggressive cache budgets.
Sources: HuggingFace Daily Papers (#17, 1 upvote) Why Trending: KV cache compression is critical for long-context reasoning; InfoKV goes beyond attention weights with information-theoretic signals.
18. EO-WM: A Physically Informed World Model for Probabilistic Earth Observation Forecasting
Authors: Junwei Luo, Shuai Yuan, Zhenya Yang, Yansheng Li, Zhe Liu, Hengshuang Zhao arXiv: arxiv.org/abs/2606.27277
Summary: Earth Observation forecasting predicts future land-surface dynamics from satellite imagery under changing weather conditions — a partially observed, weather-driven world modeling problem. EO-WM applies a video diffusion transformer with physically informed conditioning frameworks that treat weather as a signal while explicitly modeling forecasting uncertainty from sparse observations and unobserved land states. It outperforms deterministic baselines in capturing weather-driven uncertainty in land-surface dynamics.
Sources: HuggingFace Daily Papers (#18, 1 upvote) Why Trending: Novel application of video diffusion transformers to satellite image forecasting with physically grounded uncertainty modeling.
19. Hallucination in World Models is Predictable and Preventable
Authors: Nicklas Hansen, Xiaolong Wang arXiv: arxiv.org/abs/2606.27326
Summary: Generative world models produce visually fluent rollouts that drift from ground-truth dynamics — hallucinations concentrated in low-coverage state-action regions. This paper introduces MMBench2, a 427-hour, 210-task dataset for visual world modeling with ground-truth actions, and shows that hallucination is predictable via lightweight data-centric signals (coverage-aware sampling) and preventable through targeted mitigation during training or inference.
Sources: HuggingFace Daily Papers (#19, 1 upvote) Why Trending: Identifies a concrete, actionable failure mode in generative world models and provides a large benchmark (MMBench2) to study it.
20. When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models
Authors: Josef Chen arXiv: arxiv.org/abs/2606.27288
Summary: Multi-model LLM systems (routing, voting, cascades, mixture-of-agents) are used to beat single-model accuracy, but their gain is capped by the co-failure rate: the fraction of queries where every model in the ensemble is simultaneously wrong. This paper proves that accuracy cannot exceed 1 − β (where β is the co-failure rate) for any policy whose output is a member model answer, across 67 frontier models. Standard diagnostics like pairwise error correlation cannot identify this ceiling, making it a commonly missed constraint in system design.
Sources: HuggingFace Daily Papers (#20, 1 upvote) Why Trending: Elegant theoretical result that bounds the maximum accuracy gain from any multi-model ensemble strategy — a useful tool for anyone building routing or MoA systems.
