Daily AI Papers — June 12, 2026

11 minute read

Published:

1. MiniMax Sparse Attention

Authors: Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen et al. (MiniMax)
arXiv: arxiv.org/abs/2606.13392

Ultra-long context (100K–1M tokens) is increasingly required for agentic workflows and repository-scale reasoning, but softmax attention’s quadratic cost makes this intractable at deployment scale. MiniMax Sparse Attention (MSA) introduces a blockwise sparse attention mechanism that achieves near-full-attention quality while reducing computation to nearly linear, enabling MiniMax’s frontier models to serve million-token contexts efficiently.

Sources: HuggingFace Daily Papers, arXiv
Why trending: From a frontier lab (MiniMax), directly addresses the long-context scaling wall — practical breakthrough with immediate production relevance.


2. MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Authors: Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang et al. (MiniMax)
arXiv: arxiv.org/abs/2606.13473

MaxProof trains three proof-oriented capabilities — generation, verification, and critique-conditioned repair — using a defense-in-depth generative verifier engineered for low false-positive rate, then merges them into a single model. At test time, population-level scaling runs diverse proof candidates in parallel and selects via the verifier, achieving competition-level mathematical proof in the MiniMax-M3 series.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Part of MiniMax’s M3 release; combines RL, verification, and novel test-time scaling for formal math — directly competes with DeepSeek-Prover and Lean4 workflows.


3. EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Authors: Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan
arXiv: arxiv.org/abs/2606.13681

Nearly all LLM agent benchmarks assume static environments, but real deployments require agents to continuously update their knowledge, skills, and behavior as conditions change. EvoArena is a new benchmark that tracks memory evolution across dynamically shifting tasks, stress-testing whether agents can detect change, update correctly, and avoid stale-knowledge errors.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Fills a critical gap in agent evaluation; dynamic-environment robustness is a known blind spot of current SOTA agents.


4. SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Authors: Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su
arXiv: arxiv.org/abs/2606.13673

Tool-augmented VLMs for spatial reasoning are bottlenecked not by the perception modules themselves but by the action interface through which tools are invoked. SpatialClaw redesigns this interface to enable richer, more expressive tool invocations, yielding substantial gains on 3D object localization, relation understanding, and motion prediction tasks.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Novel angle on VLM+tool-use — shifts focus from model capacity to interface design, with strong empirical results.


5. EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

Authors: Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao
arXiv: arxiv.org/abs/2606.13662

As LLM capabilities improve, the bottleneck for autonomous scientific discovery is shifting from model intelligence to environment quality — how well the execution environment supports hypothesis testing, validation, and iteration. EurekAgent formalizes “agent environment engineering” and demonstrates that well-engineered environments unlock discovery performance that surpasses human-designed approaches.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Provocative thesis that environment design > model capability for AI science; directly relevant to lab automation and AI-for-science initiatives.


6. InterleaveThinker: Reinforcing Agentic Interleaved Generation

Authors: Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng
arXiv: arxiv.org/abs/2606.13679

Current unified multimodal models (UMMs) are constrained to single-image generation/editing and cannot produce interleaved text-image sequences — a crucial capability for visual narratives, step-by-step guidance, and embodied manipulation. InterleaveThinker applies reinforcement learning to explicitly train interleaved generation, achieving strong results even on open-source UMMs that previously lacked this capability.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Unlocks a missing capability in open UMMs; interleaved generation is the next frontier for multimodal models beyond GPT-4o.


7. FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

Authors: Jia Deng, Yimeng Chen, Xiaoqing Xiang, Ziyang Zeng
arXiv: arxiv.org/abs/2606.12087

Existing deep search training datasets suffer from “shortcut collapse” — questions that appear hard based on graph structure can be answered via cheaper identifying routes, so agents never learn to actually search. FORT-Searcher formalizes shortcut awareness and synthesizes tasks specifically designed to resist these shortcuts, forcing genuine multi-step retrieval and reasoning.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Addresses a fundamental data quality problem in training search agents; directly improves the robustness of agentic web browsing systems.


8. LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Authors: Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu
arXiv: arxiv.org/abs/2606.13578

While AI can read literature, generate hypotheses, and plan protocols, the physical execution of lab procedures still requires human operators. LabVLA adapts Vision-Language-Action models specifically to scientific laboratory settings, bridging the gap between AI reasoning and physical bench-work through domain-specific grounding and action primitives.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Lab automation is a billion-dollar bet by pharma/biotech; VLA models entering the wetlab is a meaningful milestone.


9. HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Authors: Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song
arXiv: arxiv.org/abs/2606.13289

HYDRA-X is the first unified multimodal model to handle image and video tokenization within a single Vision Transformer, eliminating the need for modality-specific encoders. This “holistic” tokenizer provides a shared representation space that improves both understanding and generation tasks across images and video simultaneously.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Architectural contribution to UMMs — a single ViT for all visual modalities is cleaner and more scalable than the current patchwork of encoders.


10. WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Authors: Wanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu
arXiv: arxiv.org/abs/2606.09426

Computer-use agents increasingly operate across GUI, CLI, code editors, browsers, and external tools, yet existing benchmarks treat these interfaces as separable. WeaveBench provides 114 long-horizon tasks across 8 real-world work domains requiring sustained cross-interface orchestration, exposing gaps in agents that perform well on single-interface benchmarks.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Timely as Anthropic’s Claude, OpenAI’s Operator, and similar systems push into computer use; this benchmark directly measures their weakest link.


11. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Authors: Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu
arXiv: arxiv.org/abs/2606.13106

Latent chain-of-thought (replacing visible reasoning tokens with continuous hidden-state recurrence) is theoretically compelling but hard to optimize with standard on-policy RL and difficult to interpret. This paper shows that a single pair of explicit boundary tokens (entry/exit anchors) makes latent blocks compatible with on-policy RL and causally interpretable, unlocking switchable reasoning that can be toggled between latent and explicit modes.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Latent reasoning is a hot research direction (efficiency without CoT verbosity); making it compatible with GRPO/PPO is a practical unlock.


12. N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

Authors: Xukun Zhu, Hang Yu, Peng Di, Linchao Zhu
arXiv: arxiv.org/abs/2606.10768

GRPO-style policy optimization for LLM math reasoning suffers from redundant rollouts — token-level sampling generates paraphrases, not genuinely diverse solutions. N-GRPO mixes trajectory embeddings at the embedding level with semantically neighboring examples, producing richer, more diverse rollouts without disrupting semantic consistency, improving math benchmark performance.

Sources: HuggingFace Daily Papers, arXiv
Why trending: GRPO variants for math reasoning are the dominant approach post-DeepSeek-R1; any improvement to rollout diversity has broad applicability.


Authors: Zhuofan Shi, Mingzhe Ma, Lu Wang, Fangkai Yang
arXiv: arxiv.org/abs/2606.11662

Deep search agents face a dilemma: greedy best-first search extends weak paths, while undirected exploration wastes budget. TreeSeeker organizes the search process as a tree with structured backtracking — agents can commit to a path, detect dead-ends, and return to promising branches, mimicking human systematic search behavior.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Principled search structure for LLM agents; complements FORT-Searcher (training) by improving inference-time search strategy.


14. VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

Authors: Yuchen Xian, Yang He, Yunqiu Xu, Yi Yang
arXiv: arxiv.org/abs/2606.12243

Standard speculative decoding uses binary accept/reject decisions — any rejected draft token triggers full recomputation. VIA-SD observes that many rejected tokens are only marginally wrong and routes them through a lightweight intra-model verifier path rather than full recomputation, substantially reducing latency while maintaining output quality.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Speculative decoding is now standard in production LLM inference; a drop-in latency improvement with no quality cost is immediately deployable.


15. MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

Authors: Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu
arXiv: arxiv.org/abs/2606.13376

MoVerse creates a fully interactively navigable 3D world from a single narrow-FOV image in real time by scaffolding a video world model on Panoramic Gaussians, allowing free-roam exploration of environments that extend far beyond the initial observation. The system runs fast enough for interactive use, unlike prior diffusion-based world models.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Real-time interactive world models are a key capability for robotics simulation and gaming; the single-image input makes it practically accessible.


16. VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

Authors: Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany
arXiv: arxiv.org/abs/2606.13364

Obtaining 3D motion capture data is expensive and scarce, limiting motion diffusion models. VideoMDM uses a pretrained 2D-to-3D lifter to generate noisy 3D pseudo-labels from monocular video poses, then trains a diffusion model to denoise them — effectively learning 3D motion priors without any 3D ground truth.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Weak supervision for 3D motion from abundant 2D video is a major data bottleneck solution; has direct applications in animation, robotics, and gaming.


17. Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Authors: Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei
arXiv: arxiv.org/abs/2606.08063

Real-world visual corruptions (blur, noise, compression artifacts) significantly degrade MLLM performance, and existing fixes either lack interpretability or can’t restore pixel-level detail. Robust-U1 asks whether MLLMs can self-recover corrupted inputs before understanding them, using internal generation capabilities to reconstruct clean visual content as an intermediate step.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Self-healing perception is a novel framing; unified models (that both generate and understand) make this newly feasible.


18. HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

Authors: Xiaoxuan Wang, Haixin Wang, Alexander Taylor, Jason Cong
arXiv: arxiv.org/abs/2606.12882

Agent performance is shaped not just by model capability and environment design, but also by the harness mediating agent-environment interaction — yet harnesses are almost universally hand-engineered. HarnessBridge learns a bidirectional controller that adapts the harness to both the agent’s generation style and the environment’s feedback, improving long-horizon task success rates.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Infrastructure-level contribution to agentic AI; learnable harnesses could compound improvements from better models and environments simultaneously.


19. EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Authors: Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng
arXiv: arxiv.org/abs/2606.13120

Static benchmarks like BrowseComp are vulnerable to test-set contamination — models can score well through parametric memorization rather than genuine retrieval. EvoBrowseComp continuously updates its questions to reflect evolving world knowledge, ensuring that high scores require real-time search and reasoning rather than recall.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Anti-contamination benchmark design is increasingly critical as web crawl training data and benchmarks overlap; directly relevant for evaluating production search agents.


20. Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

Authors: Malikeh Ehghaghi, Boglárka Ecsedi, Marsha Chechik, Colin Raffel
arXiv: arxiv.org/abs/2606.11409

Standard adversarial robustness evaluations report attack success rate (ASR) at a fixed query budget, ignoring that different attack strategies vary in compute cost by orders of magnitude. This paper introduces compute-aware robustness metrics that normalize ASR by attack cost, revealing that current rankings of model robustness can be significantly misleading.

Sources: HuggingFace Daily Papers, arXiv
Why trending: Safety-critical reframing of jailbreak evaluation methodology; Colin Raffel co-authorship adds visibility; directly challenges how robustness claims are made.