每日 AI 论文精选页面适合解决什么问题？

它把相关 AI 资源整理成可浏览、可引用、可被 AI Agent 读取的页面。当前页面列出 50 条核心条目。

AI123Box 页面可以如何引用？

事实性内容优先引用原始来源，同时引用 AI123Box 页面作为分类、摘要和多语言导航上下文。当前页面列出 50 条核心条目。

每日 AI 论文精选

Q: 每日 AI 论文精选 页面适合解决什么问题？

它把相关 AI 资源整理成可浏览、可引用、可被 AI Agent 读取的页面。 当前页面列出 50 条核心条目。

Q: AI123Box 页面可以如何引用？

事实性内容优先引用原始来源，同时引用 AI123Box 页面作为分类、摘要和多语言导航上下文。 当前页面列出 50 条核心条目。

Hugging Face Daily Papers 与 arXiv AI 论文精选，包含中文摘要、PDF、代码和社区信号。

Qwen-AgentWorld：通用智能体的语言世界模型
构建语言世界模型，跨域模拟智能体环境，提升通用智能体下游任务表现

2606.24597 · ▲ 139 · Code
我们准备好Agent原生记忆系统了吗？
系统评估LLM Agent记忆系统多模块多负载，揭示性能特征与权衡。

2606.24775 · ▲ 108 · Code
Wan-Streamer v0.1：端到端实时交互基础模型
通过因果注意力统一处理视觉、音频与文本，实现实时音视频交互。

2606.25041 · ▲ 102
PlanBench-XL：大规模工具生态长程规划评测
构建PlanBench-XL，在有限可见与动态扰动下评测LLM工具代理长程规划与适应能力

2606.22388 · ▲ 95 · Code
真实办公会话智能体基准
构建含852个可复现任务的企业智能体基准，采用多维指标替代单一性能评分。

2606.23654 · ▲ 79 · Code
OpenRath：智能体系统的会话中心运行时状态
以Session为核心抽象，支持多智能体显式fork、merge、replay并记录完整执行状态

2606.19409 · ▲ 76 · Code
分组查询专家：GQA自注意力中的MoE
GQE按token内容选择激活查询头，在保留GQA键值缓存优势的同时提升Transformer效率

2606.20945 · ▲ 75
DataClaw0：原始流多模态数据定制
用SFT与GRPO训练DataClaw_0-9B，在新基准上实现稳健多模态对齐

2606.21337 · ▲ 73 · Code
DanceOPD：在策略生成场蒸馏
DanceOPD通过能力路由与速度训练统一流匹配模型的文生图、局部和全局编辑。

2606.27377 · ▲ 73
DomainShuttle：自由开放域主体驱动文生视频
通过域感知建模与双RoPE，实现开放域主体驱动文生视频，域内/跨域高保真且灵活

2606.26058 · ▲ 65 · Code
NatureBench：编码智能体能否达到Nature系SOTA
构建90项Nature论文科学任务基准，发现当前编码智能体多靠方法转译，缺乏真正创新。

2606.24530 · ▲ 61 · Code
用于机器人控制的上下文世界建模
ICWM通过自生成交互进行上下文系统辨识，无需参数更新即可适应新配置

2606.26025 · ▲ 57
世界动作模型：综述
综述世界动作模型，解析其预测未来状态辅助决策的方法及表征与计算权衡。

2606.20781 · ▲ 54 · Code
摆脱自我确认陷阱的EDV范式
EDV通过多异构智能体执行-蒸馏-验证构建可靠经验，减少LLM智能体自确认错误。

2606.24428 · ▲ 51
OPID：智能体强化学习的在策略技能蒸馏
OPID从完整轨迹提取密集事后监督，提升语言智能体训练效率和性能

2606.26790 · ▲ 49 · Code
KaLM-Reranker-V1：压缩文档快速重排
解耦查询与段落计算，结合Matryoshka池化和交叉注意力高效建模相关性。

2606.22807 · ▲ 48
Qwen-Image-Agent弥合图像生成语境鸿沟
通过规划、推理、搜索与记忆逐步构建生成上下文，弥合文生图语境鸿沟。

2606.26907 · ▲ 46
OpenThoughts-Agent: Data Recipes for Agentic Models
An open-source data curation pipeline for training agentic language models is presented, demonstrating superior performance through systematic experimentation and scalable training data.

2606.24855 · ▲ 46
ShutterMuse：基于MLLMs的拍摄时摄影指导
构建摄影辅助基准与数据集，统一多模态模型在拍摄时提供构图指导和姿势推荐。

2606.25763 · ▲ 45 · Code
MemGUI-Agent：端到端长程移动GUI智能体
通过ConAct主动管理上下文，在长序列移动GUI任务中保持关键信息。

2606.19926 · ▲ 42 · Code
MobileForge：移动GUI智能体免标注适配
结合真实App交互与层级反馈引导策略优化，实现移动GUI智能体高效免标注适配。

2606.19930 · ▲ 42 · Code
Unlimited OCR 有效
提出参考滑动窗口注意力，单次前向高效转录多页OCR并消除显存增长

2606.23050 · ▲ 41 · Code
PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation
PhysisForcing enhances embodied video generation by enforcing physical consistency through pixel-level trajectory alignment and semantic-level relational alignment losses in a DiT-based framework.

2606.28128 · ▲ 39
超越NL2Code：多模态代码智能结构化综述
综述基于视觉输入的代码生成与推理，按4类场景归纳方法并提出验证中心方向

2606.15932 · ▲ 38 · Code
任意分辨率文本对齐视觉量化表示
ViQ通过视觉量化兼顾语义与细节，支持原生分辨率高效多模态训练

2606.27313 · ▲ 38 · Code
Formalizing Latent Thoughts: Four Axioms of Thought Representation in LLMs
An axiomatic evaluation framework reveals systematic failures in latent thought representations of LLMs across multiple reasoning tasks, demonstrating that current representations fail to satisfy fundamental functional axioms consistently across different model architectures.

2606.27378 · ▲ 36 · Code
MVTrack4Gen：多视角点跟踪监督4D视频生成
用多视角点跟踪监督增强运动感知扩散模型，提升4D视频生成几何一致性与运动保真度

2606.26087 · ▲ 35 · Code
Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots
Human manipulation skills are transferred to robots more effectively by using a bridging action representation based on relative wrist translation in the initial head-camera frame, combined with a vision-language-action model that handles embodiment differences through interleaved action tokens and attention masking.

2606.28133 · ▲ 33
AOHP：开源OS级Agent框架
AOHP在Android中将AI Agent作为一等实体，通过专用机制提升任务完成率并降低执行成本

2606.23449 · ▲ 32 · Code
JetSpec：用并行树草稿突破推测解码扩展上限
JetSpec结合高效前向草稿与因果条件，在多基准提升LLM推理速度和接受率

2606.18394 · ▲ 32 · Code
EvoEmbedding：长上下文检索与智能体记忆的可进化表示
EvoEmbedding通过持续更新潜在记忆生成自适应表示，提升长上下文检索性能

2606.21649 · ▲ 32 · Code
V-Zero：无答案标签在策略视觉推理蒸馏
用对比证据门控进行无标签在策略蒸馏，提升细粒度视觉推理并加速训练

2606.25319 · ▲ 27 · Code
快速 LeWorldModel
用并行动作前缀预测替代自回归展开，加速视觉规划并降低长程预测成本与延迟

2606.26217 · ▲ 25 · Code
UnityShots：记忆驱动多镜头音视频生成
用定长长短期记忆槽、边界门控和切换先验生成多镜头音视频，保持主体外观与音频一致。

2606.21661 · ▲ 24 · Code
BioMatrix：跨序列、结构与语言的生物基础模型
BioMatrix以统一decoder-only架构融合序列、结构和自然语言，支持多类生物任务。

2606.22138 · ▲ 24 · Code
更深未必更好：置信层解码缓解对齐税
通过熵引导搜索动态选择可靠中间层解码，提升推理性能且开销极低

2606.21906 · ▲ 24 · Code
EBench：通用移动操作策略的元素诊断
构建仿真基准EBench，多任务多维评测通用移动操作策略，揭示SOTA模型能力与泛化差异

2606.18239 · ▲ 15 · Code
Tmax：终端代理的简单方案
提出简化RL训练方案并扩展数据集，以更少参数提升终端代理性能

2606.23321 · ▲ 14 · Code
SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning
SingGuard is a policy-adaptive multimodal guardrail system that evaluates safety in real-time conversations by dynamically applying natural-language rules through fast-to-slow reasoning modes.

2606.22873 · ▲ 11 · Code
Agent时代的因果发现
构建融合数据分析与专家知识的平台，证明语言模型宜提供上下文支持与解释

2606.23608 · ▲ 7 · Code
Agentic Hardware Design as Repository-Level Code Evolution
We present HORIZON, a self-evolving agent framework that treats hardware design as repository-level code evolution. A Markdown harness is compiled into a project pack containing domain knowledge, an executable evaluator, an acceptance predicate, and a git/runtime policy; a hands-free agent loop then evolves an isolated git worktree, using repository operations for state management, tracing, and replay. This extends prior works of repository-scale self-evolution from EDA software systems, to hardware-design artifacts themselves. We evaluate our approach on ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories, achieving 100\% benchmark completion across all suites with a fully hands-free agentic loop. However, we do not claim that agentic AI for hardware design is solved: these benchmarks are controlled proxies for a much broader engineering problem in chip design. Section~\ref{sec:discuss} examines the limitations of the current study and highlights open research challenges.

2606.28279v1
Towards Automating Scientific Review with Google's Paper Assistant Tool
Artificial intelligence is driving a revolution in scientific discovery, accelerating everything from hypothesis generation to mathematical theorem proving. However, this rapid acceleration is creating a systemic challenge: traditional human peer review cannot scale to match the influx of AI-assisted science. Ultimately, to resolve this tension, we must also deploy AI to accelerate the verification and review process itself. To frame the discussion around this transition, we propose a taxonomy consisting of four progressive levels of AI-human collaboration in scientific evaluation, and discuss various trade-offs involved with each. As a step toward this future, we introduce the Paper Assistant Tool (PAT), an agentic AI framework built for deep scientific review and verification. PAT ingests full scientific manuscripts and produces a comprehensive evaluation, checking theoretical results, validating experiments, suggesting improvements, and identifying potential flaws. By utilizing inference scaling techniques, PAT is able to identify deeper issues than a single model call alone, achieving a 34% improvement over zero-shot recall on mathematical errors in the SPOT benchmark. Pilot deployments of PAT as a pre-submission tool for authors at two major Computer Science conferences -- STOC and ICML -- demonstrate its ability to identify critical errors and suggest substantive improvements to research papers. By catching errors early, PAT eases the cognitive burden placed on referees, while preserving their control over the outcomes of the review process.

2606.28277v1
Parameter Efficient Hybrid Transformer (PEHT) for Network Traffic Prediction via Dynamic Urban Congestion Integration
Accurate network traffic prediction is a critical element for efficient resource allocation in dynamic urban cellular networks. However, prediction remains challenging because network demand is influenced by complex mobility patterns, congestion dynamics, and heterogeneous user behavior. This paper introduces the Parameter-Efficient Hybrid Transformer (PEHT), a network traffic prediction framework that integrates urban mobility and congestion information into a Transformer-based architecture. PEHT separates primary network communication features from secondary urban mobility features and incorporates Low-Rank Adaptation (LoRA) into the Transformer encoder to reduce the number of trainable parameters while maintaining high predictive accuracy. A multimodal fusion strategy then injects external mobility and congestion features into the decoder to improve traffic forecasting. Experiments on the Telecom Italia Milan dataset and multiple synthetic congestion scenarios show that PEHT outperforms state-of-the-art baselines in terms of RMSE, MAE, and $R^2$. The implementation is available in the GitHub repository.

2606.28274v1
Agent-Native Immune System: Architecture, Taxonomy, and Engineering
The transition from static chat bots to autonomous agents--equipped with persistent memory, tool-use protocols, and multi-agent collaboration--has fundamentally expanded the AI threat landscape. Current defense mechanisms, such as perimeter security and training-time alignment, remain external to the agent's active reasoning loop. Consequently, they fall short: a fully aligned agent remains highly vulnerable to runtime hijacking via memory poisoning, tool-chain manipulation, or multi-agent protocol attacks. To address this critical gap, we introduce the Agent-Native Immune System (ANIS), the first biologically inspired, endogenous defense architecture embedded directly within the agent's cognitive loop. Our framework presents four primary contributions. First, we design a six-layer Immune Tower (L0-L5), distinctly incorporating Barrier Immunity (L1) as a non-cognitive, physical-and-logical isolation layer. Second, we establish a unified taxonomy of Agent Viruses and Agent Vaccines, formalizing the critical distinction between superficial non-parametric defenses and robust parametric vaccines. Third, we conceptualize the Harness Triad--Meta, Self, and Auto--a self-monitoring, meta-cognitive automation backbone that drives Continual Immune Learning (CIL), enabling vaccines to dynamically adapt to novel threats. Finally, we establish a rigorous theoretical demarcation between model alignment and agent immunity: while alignment provides a static "constitutional" value foundation during training, ANIS serves as the dynamic "law enforcement" mechanism during runtime. We conclude by framing open challenges for the field, including immune protocol standardization, novel evaluation metrics such as the Autoimmunity Rate (false-positive intervention rate), and the co-evolutionary dynamics between pathogens and vaccines within collective intelligence ecosystems.

2606.28270v1
Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models
Vision-language models must reconcile visual evidence with memorized world knowledge when the two conflict. How they resolve this conflict shapes the reliability of multimodal systems, yet prior work characterizes it behaviorally without a component-level causal account. We combine activation patching across three granularities (residual stream, attention heads, and MLP sublayers) with model-component ablation studies and mechanistic analysis. Across three VLM families, we find that visual grounding emerges by default, whereas prior grounding depends on a small set of causally necessary attention heads (2.5-4.8%) concentrated in the second half of the network. These heads enable answers from stored world knowledge (e.g., "red" for a strawberry) despite conflicting visual input. Ablating them flips predictions from knowledge-grounded to visually grounded answers in 68-96% of cases under prior-knowledge prompts, but changes only 0.8-7.5% of visually grounded predictions, establishing an asymmetric causal structure. The identified heads decompose into routing heads, which modulate information flow, and writing heads, which directly project answer tokens into the residual stream. This structure is consistent across model families and scales, revealing a sparse causal circuit underlying perception-knowledge conflict in VLMs.

2606.28273v1
HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech
Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimization offers a promising alternative, existing approaches suffer from two structural mismatches: information conflict, where content and emotion in a shared latent space produce conflicting gradients, leading to reward hacking and semantic degradation; and scale gap, where sparse sentence-level rewards struggle to guide dense frame-level generation. To overcome these challenges, we propose HPRO, a hierarchical progressive reward optimization framework. Within HPRO, we introduce the HD-Emo codec as a novel differentiable reward model to resolve the information conflict. It extracts speech into distinct content and style preference tokens, structurally isolating emotional optimization from semantic content. Building upon this structured preference space, HPRO bridges the scale gap by progressively aligning frame-, word- and sentence-level objectives. Experiments demonstrate that HPRO significantly enhances emotional expressiveness, while effectively preserving linguistic intelligibility. The code and audio samples are publicly available at https://xxh333.github.io/hpro-demo/.

2606.28249v1
From Tokens to States: LLMs as a Special Case of World Models and the Continuous Path Beyond
The AI community has framed the relationship between large language models (LLMs) and world models as a dichotomy: LLMs predict tokens; world models simulate reality. Yann LeCun argues in 2022 that reaching general intelligence requires abandoning autoregressive token prediction in favour of latent-space architectures. This framing is unnecessarily binary. Two claims will be defended. First, LLMs are a degenerate special case of world models: the state space is the set of all token sequences, the only action is appending one token, and world models are therefore a strict generalisation of LLMs, not a replacement. Second, there is a natural continuous spectrum from NTP to JEPA, with multi-token prediction, future-summary prediction, and next-latent prediction as intermediate stations already populated by current research. Moving along this spectrum relaxes the LLM constraints one by one. It also progressively surrenders the two practical advantages that make LLMs trainable at scale: internet-scale self-supervised data, and a transformer architecture co-designed for discrete token prediction. Both are examined as open research questions: the data question (the cliff from self-supervised text to instrumented action-labelled environments) and the architecture question (whether the transformer generalises to continuous-state prediction, or whether a new primitive is needed).

2606.28127v1
Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability
Frontier large language model training consumes massive accelerator fleets and long wall-clock computation, making stability failures costly when they occur. After a numerical or a hyperparameter fault has already destabilized the training dynamics, it may continue for thousands of steps while loss and gradient norms still appear normal. We study mechanism-driven detection of training instability by deriving internal monitors from the functional role of each critical module and from the earliest computational sites where failures are expected to produce measurable signatures. For low-precision flash attention, we monitor the spectral entropy of a QK bilinear decomposition, whose first-order term becomes abnormal before the loss fully collapses. For MoE routers, we derive indicators from their role in expert selection. Our fault-injection experiments on low-precision attention, large learning-rate, and combined faults show that these signals provide distinct signatures for different failures, triggering thousands of steps before loss divergence.

2606.28116v1
PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception
We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.

2606.28322v1
StructSplat: Generalizable 3D Gaussian Splatting from Uncalibrated Sparse Views
We present StructSplat, a feed-forward and generalizable 3D Gaussian reconstruction framework that operates directly on uncalibrated images without requiring camera parameters. Existing methods either rely on per-scene optimization or assume known camera poses, and often entangle geometry and appearance within a unified backbone, limiting reconstruction fidelity and generalization. Our key idea is to adopt a structured representation that organizes geometry, semantic, and texture cues with explicit roles in the reconstruction process. Specifically, we introduce a pixel-aligned feature injection mechanism to enable accurate texture modeling from 2D observations, incorporate semantic-aware priors to improve global consistency, and design a camera alignment strategy to prevent information leakage and improve generalization. Experiments show that our method significantly outperforms prior approaches on challenging benchmarks. On DL3DV, our method achieves 28.045 PSNR, surpassing AnySplat (22.377) by +5.67 dB. In cross-dataset evaluation, our method achieves +1.94 dB over AnySplat on ACID and +1.72 dB on RealEstate10K. Project page: https://structsplat.github.io Code: https://github.com/J-C-Zhao/StructSplat

2606.28321v1