Daily AI Papers — April 22, 2026

11 minute read

Published: April 22, 2026

1. Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Authors: Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao, Taihang Hu et al.
arXiv: arxiv.org/abs/2604.19748
Summary: A commercial-scale virtual try-on system from Taobao/Tmall that holds up under extreme poses, lighting, and motion blur while preserving fine garment texture and material. Supports up to 6 reference images across 8 fashion categories with near real-time inference, deployed at industrial scale to millions of users.
Sources: HuggingFace Daily Papers (226 upvotes — runaway #1), HF dataset (TaobaoTmall-AlgorithmProducts/Tstars-VTON)
Why trending: Massive 226-upvote spike, far ahead of #2 (80) — a fully-deployed, production virtual try-on system with a public benchmark is rare and strikes both research and e-commerce communities.

2. CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Authors: Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo, Meiguang Jin et al.
arXiv: arxiv.org/abs/2604.19636
Summary: End-to-end DiT-based framework for human-object interaction video synthesis conditioned on person, product, text, and speech. Introduces a Human-Aware Mixture-of-Experts that routes tokens to region-specialized experts (hands, faces) plus a structural co-generation module to avoid hand-object interpenetration.
Sources: HuggingFace Daily Papers (80 upvotes)
Why trending: HOI video synthesis with physical plausibility is a long-standing pain point for ad/e-commerce video gen; the Human-Aware MoE design is broadly applicable.

3. AgentSPEX: An Agent SPecification and EXecution Language

Authors: Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu et al.
arXiv: arxiv.org/abs/2604.13346
Summary: A domain-specific language and runtime for LLM agent workflows with explicit control flow, typed steps, branching, loops, parallelism, and reusable modules — decoupling agent logic from Python and addressing the maintainability issues of LangGraph/DSPy/CrewAI.
Sources: HuggingFace Daily Papers (53 upvotes)
Why trending: Agent orchestration fatigue is real — a clean, language-level abstraction over Python-glued frameworks resonates with practitioners shipping agents to production.

4. SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

Authors: Ying Zeng, Miaosen Luo, Guangyuan Li, Yang Yang, Ruiyang Fan et al.
arXiv: arxiv.org/abs/2604.19587
Summary: Reformulates photo editing as a tightly coupled reasoning-to-generation process: an Image Critic module identifies quality deficiencies, then a Photographic Artist module performs targeted enhancements — eliminating the need for users to articulate aesthetic intent.
Sources: HuggingFace Daily Papers (41 upvotes)
Why trending: “Auto-editing” is the natural next step after instruction-based image editing, and removing the prompt-engineering burden is highly user-relevant.

5. AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Authors: Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai et al.
arXiv: arxiv.org/abs/2604.19747
Summary: Scalable sparse-view 3D reconstruction that conditions a video diffusion model on arbitrarily many unordered input views via a persistent global scene memory (capture view cache) plus geometry-aware conditioning, removing temporal compression to maintain frame-level correspondence.
Sources: HuggingFace Daily Papers (35 upvotes)
Why trending: Video-diffusion-as-3D-reconstructor is the dominant new paradigm; supporting arbitrary view counts (vs. fixed 1-2) is a meaningful generalization.

6. TEMPO: Scaling Test-time Training for Large Reasoning Models

Authors: Qingyang Zhang, Xinke Kong, Haitao Wu, Qinghua Hu, Minghao Wu et al.
arXiv: arxiv.org/abs/2604.19295
Summary: A test-time training framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on labeled data, formalized via Expectation-Maximization. Prevents the diversity collapse and reward drift that plague existing TTT methods, sustaining gains with more compute.
Sources: HuggingFace Daily Papers (28 upvotes)
Why trending: TTT for reasoning LLMs is the hot post-o1 frontier; an EM-grounded fix for the plateau problem is exactly what the field needs.

7. PlayCoder: Making LLM-Generated GUI Code Playable

Authors: Zhiyuan Peng, Wei Tao, Xin Yin, Chenhao Ying, Yuan Luo et al.
arXiv: arxiv.org/abs/2604.19742
Summary: Introduces PlayEval, a repository-aware benchmark of 43 multilingual GUI applications (Python/TS/JS) covering interactive games and event-driven UIs, plus PlayCoder — a multi-agent framework that improves functional correctness via iterative repair on actual interaction flows rather than pass/fail tests.
Sources: HuggingFace Daily Papers (24 upvotes)
Why trending: Code-gen evals have long under-tested interactive systems; benchmarking actual playability is a step-change for evaluating coding agents.

8. ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

Authors: Xianming Li, Zongxi Li, Tsz-fung Andrew Lee, Jing Li, Haoran Xie et al.
arXiv: arxiv.org/abs/2604.19254
Summary: A centralized PEFT framework that performs layer-level refinement through a depth-shared shadow module, evolving a parallel shadow state at each transformer layer — replacing LoRA’s distributed weight perturbations with a shared multi-layer adaptation pathway.
Sources: HuggingFace Daily Papers (23 upvotes)
Why trending: First serious architectural alternative to LoRA’s local low-rank perturbation in a while; depth-shared shadow paths are an interesting structural prior for PEFT.

9. Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Authors: Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao et al.
arXiv: arxiv.org/abs/2604.19667
Summary: A benchmark for generating executable visual workflows directly from natural language, drawn from real-world industrial deployments, plus a robust agentic framework to mitigate recurrent execution errors. Reveals significant gaps between current LLMs and industrial-grade workflow automation.
Sources: HuggingFace Daily Papers (16 upvotes)
Why trending: Visual workflow builders (n8n, Dify, ComfyUI-style) are everywhere in industry — automating their construction is high-leverage.

10. AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Authors: Wentao Shi, Yu Wang, Yuyang Zhao, Yuxin Chen, Fuli Feng et al.
arXiv: arxiv.org/abs/2604.18240
Summary: A 155-task / 516-trajectory benchmark across search, data systems, and GUI domains for evaluating Agent-as-a-Judge — judges that interact with environments and tools to gather verifiable evidence rather than relying on rule-based or LLM-as-a-Judge scoring.
Sources: HuggingFace Daily Papers (14 upvotes)
Why trending: As RL agent training scales, evaluation reliability becomes the bottleneck — Agent-as-a-Judge is the natural next step beyond LLM-as-a-Judge.

11. SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Authors: Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng et al.
arXiv: arxiv.org/abs/2604.20087
Summary: First benchmark for continual skill learning in LLM agents, comprising 20 verified skill-dependent tasks across 15 sub-domains, evaluated at three levels (skill quality, execution trajectory, task outcome). Finds that continual gains depend more on task structure and feedback than on model scale.
Sources: HuggingFace Daily Papers (11 upvotes)
Why trending: Skills (Anthropic-style or otherwise) are the de facto agent customization layer in 2026 — measuring how to learn them automatically is overdue.

12. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Authors: Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng et al.
arXiv: arxiv.org/abs/2604.19741
Summary: A video generative model for navigable city-scale environments, grounded in geo-registered data corpora to maintain physical consistency under varying weather and dynamic objects. Trained on temporally unaligned data to teach the model to generate plausible motion across real geography.
Sources: HuggingFace Daily Papers (11 upvotes)
Why trending: City-scale navigable video gen is the convergence point of world models, AV simulation, and 3D scene reconstruction — a flagship use case for 2026.

13. Understanding and Enforcing Weight Disentanglement in Task Arithmetic

Authors: Shangge Liu, Yuehan Yin, Lei Wang, Qi Fan, Yinghuan Shi et al.
arXiv: arxiv.org/abs/2604.17078
Summary: Introduces Task-Feature Specialization (TFS) — a model’s ability to allocate distinct internal features to different tasks — as the fundamental cause of weight disentanglement in task arithmetic. Proves TFS is sufficient and proposes OrthoReg to enforce it via orthogonal weight updates during fine-tuning.
Sources: HuggingFace Daily Papers (11 upvotes)
Why trending: Task arithmetic / model merging is a heavily-used but theoretically shaky tool — a principled explanation plus a regularizer is impactful.

14. Dual-View Training for Instruction-Following Information Retrieval

Authors: Qingcheng Zeng, Puxuan Yu, Aman Mehta, Fuheng Zhao, Rajhans Samdani
arXiv: arxiv.org/abs/2604.18845
Summary: A dual-view data synthesis strategy based on polarity reversal: given a query, an instruction-relevant document, and a hard negative, an LLM generates a complementary instruction under which the two documents swap relevance. Trains retrievers to distinguish topical relevance from instruction compliance.
Sources: HuggingFace Daily Papers (10 upvotes)
Why trending: Most retrievers ignore explicit instructions; polarity-reversal data synthesis is a clean, generalizable trick for instruction-following IR.

15. Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

Authors: Qingcheng Zeng, Yuheng Lu, Zeqi Zhou, Heli Qi, Puxuan Yu et al.
arXiv: arxiv.org/abs/2604.17632
Summary: Introduces CSR-L, a human-annotated benchmark of mixed-language queries, and shows that code-switching is a fundamental performance bottleneck across statistical, dense, and late-interaction retrievers — even strong multilingual models suffer due to embedding-space divergence between pure and code-switched text.
Sources: HuggingFace Daily Papers (10 upvotes)
Why trending: Code-switching is ubiquitous in global user queries but systematically under-evaluated; a clean benchmark + diagnosis is overdue.

16. UniMesh: Unifying 3D Mesh Understanding and Generation

Authors: Peng Huang, Yifeng Chen, Zeyu Zhang, Hao Tang
arXiv: arxiv.org/abs/2604.17472
Summary: A unified framework for 3D mesh tasks featuring a Mesh Head that bridges diffusion-based image generation with implicit shape decoders, a Chain of Mesh module for iterative semantic editing, and a self-reflection mechanism for error correction.
Sources: HuggingFace Daily Papers (9 upvotes)
Why trending: Mesh-native unified models are still rare compared to point-cloud or image-based 3D — Chain of Mesh is a nice take on iterative refinement.

17. Speculative Decoding for Autoregressive Video Generation

Authors: Yuezhou Hu, Jintao Zhang
arXiv: arxiv.org/abs/2604.17397
Summary: SDVG adapts speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router: a 1.3B drafter proposes blocks via four denoising steps, scored by ImageReward with worst-frame aggregation, achieving significant speedup with maintained visual quality.
Sources: HuggingFace Daily Papers (8 upvotes)
Why trending: Streaming video generation is becoming compute-critical; carrying over the LLM-world’s speculative-decoding wins to video diffusion is high-impact.

18. Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

Authors: Zijun Wang, Haoqin Tu, Weidong Zhou, Yiyang Zhou, Xiaohuan Zhou et al.
arXiv: arxiv.org/abs/2604.15706
Summary: NAG-based Ranking is a training-free, interpretable framework for target-oriented pretraining data selection. Each input is characterized by a sparse set of high-impact neurons in any off-the-shelf LLM, ranked by Neuron-Activated Graph similarity to target examples — outperforming representation-based selection across 6 benchmarks.
Sources: HuggingFace Daily Papers (8 upvotes)
Why trending: Pretraining data selection is one of the highest-leverage research areas; neuron-graph ranking is interpretable and training-free, which is rare.

19. UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Authors: Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang et al.
arXiv: arxiv.org/abs/2604.18518
Summary: First framework integrating Uniform Discrete Diffusion with RL via GRPO. Treats the final clean sample as the action for stable optimization signals, reconstructs trajectories via the diffusion forward process, and adds Reduced-Step + CFG-Free strategies — achieving SOTA on text-to-image and OCR.
Sources: HuggingFace Daily Papers (5 upvotes)
Why trending: GRPO is dominating LLM RL; porting it to discrete diffusion (a hot generative paradigm) opens a new line of work.

20. RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

Authors: Yusuf Çelebi, Yağız Asker, Özay Ezerceli, Mahmoud ElHussieni, Selva Taş et al.
arXiv: arxiv.org/abs/2604.19321
Summary: Models hidden-state evolution as a high-dimensional geometric trajectory and uses the Ramer-Douglas-Peucker algorithm — a parameter-free polygon simplification — to identify critical breakpoint layers for LoRA adaptation, outperforming full and random layer-selection baselines.
Sources: HuggingFace Daily Papers (4 upvotes)
Why trending: Layer-selection for PEFT is usually heuristic; a geometric, training-free criterion grounded in trajectory simplification is a clean idea.

Share on

Twitter Facebook LinkedIn

Alireza Shamsoshoara

Daily AI Papers — April 22, 2026

1. Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

2. CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

3. AgentSPEX: An Agent SPecification and EXecution Language

4. SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

5. AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

6. TEMPO: Scaling Test-time Training for Large Reasoning Models

7. PlayCoder: Making LLM-Generated GUI Code Playable

8. ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

9. Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

10. AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

11. SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

12. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

13. Understanding and Enforcing Weight Disentanglement in Task Arithmetic

14. Dual-View Training for Instruction-Following Information Retrieval

15. Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

16. UniMesh: Unifying 3D Mesh Understanding and Generation

17. Speculative Decoding for Autoregressive Video Generation

18. Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

19. UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

20. RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

Share on

You May Also Enjoy

Future Blog Post

Daily AI Papers — July 13, 2026

1. Long-Horizon-Terminal-Bench: Testing the Limits of Agents on Long-Horizon Terminal Tasks with Dense Reward-Based Grading

Daily AI Papers — July 12, 2026

1. Deform360: A Massive Multi-view Visuotactile Dataset for Deformable World Models

Daily AI Papers — July 11, 2026

1. Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models