Hugging Face Daily Papers 与 arXiv AI 论文精选,包含中文摘要、PDF、代码和社区信号。
构建语言世界模型,跨域模拟智能体环境,提升通用智能体下游任务表现
2606.24597 · ▲ 139 · Code
系统评估LLM Agent记忆系统多模块多负载,揭示性能特征与权衡。
2606.24775 · ▲ 108 · Code
通过因果注意力统一处理视觉、音频与文本,实现实时音视频交互。
2606.25041 · ▲ 102
构建PlanBench-XL,在有限可见与动态扰动下评测LLM工具代理长程规划与适应能力
2606.22388 · ▲ 95 · Code
构建含852个可复现任务的企业智能体基准,采用多维指标替代单一性能评分。
2606.23654 · ▲ 79 · Code
以Session为核心抽象,支持多智能体显式fork、merge、replay并记录完整执行状态
2606.19409 · ▲ 76 · Code
GQE按token内容选择激活查询头,在保留GQA键值缓存优势的同时提升Transformer效率
2606.20945 · ▲ 75
用SFT与GRPO训练DataClaw_0-9B,在新基准上实现稳健多模态对齐
2606.21337 · ▲ 73 · Code
DanceOPD通过能力路由与速度训练统一流匹配模型的文生图、局部和全局编辑。
2606.27377 · ▲ 73
通过域感知建模与双RoPE,实现开放域主体驱动文生视频,域内/跨域高保真且灵活
2606.26058 · ▲ 65 · Code
构建90项Nature论文科学任务基准,发现当前编码智能体多靠方法转译,缺乏真正创新。
2606.24530 · ▲ 61 · Code
ICWM通过自生成交互进行上下文系统辨识,无需参数更新即可适应新配置
2606.26025 · ▲ 57
综述世界动作模型,解析其预测未来状态辅助决策的方法及表征与计算权衡。
2606.20781 · ▲ 54 · Code
EDV通过多异构智能体执行-蒸馏-验证构建可靠经验,减少LLM智能体自确认错误。
2606.24428 · ▲ 51
OPID从完整轨迹提取密集事后监督,提升语言智能体训练效率和性能
2606.26790 · ▲ 49 · Code
解耦查询与段落计算,结合Matryoshka池化和交叉注意力高效建模相关性。
2606.22807 · ▲ 48
通过规划、推理、搜索与记忆逐步构建生成上下文,弥合文生图语境鸿沟。
2606.26907 · ▲ 46
An open-source data curation pipeline for training agentic language models is presented, demonstrating superior performance through systematic experimentation and scalable training data.
2606.24855 · ▲ 46
构建摄影辅助基准与数据集,统一多模态模型在拍摄时提供构图指导和姿势推荐。
2606.25763 · ▲ 45 · Code
通过ConAct主动管理上下文,在长序列移动GUI任务中保持关键信息。
2606.19926 · ▲ 42 · Code
结合真实App交互与层级反馈引导策略优化,实现移动GUI智能体高效免标注适配。
2606.19930 · ▲ 42 · Code
提出参考滑动窗口注意力,单次前向高效转录多页OCR并消除显存增长
2606.23050 · ▲ 41 · Code
PhysisForcing enhances embodied video generation by enforcing physical consistency through pixel-level trajectory alignment and semantic-level relational alignment losses in a DiT-based framework.
2606.28128 · ▲ 39
综述基于视觉输入的代码生成与推理,按4类场景归纳方法并提出验证中心方向
2606.15932 · ▲ 38 · Code
ViQ通过视觉量化兼顾语义与细节,支持原生分辨率高效多模态训练
2606.27313 · ▲ 38 · Code
An axiomatic evaluation framework reveals systematic failures in latent thought representations of LLMs across multiple reasoning tasks, demonstrating that current representations fail to satisfy fundamental functional axioms consistently across different model architectures.
2606.27378 · ▲ 36 · Code
用多视角点跟踪监督增强运动感知扩散模型,提升4D视频生成几何一致性与运动保真度
2606.26087 · ▲ 35 · Code
Human manipulation skills are transferred to robots more effectively by using a bridging action representation based on relative wrist translation in the initial head-camera frame, combined with a vision-language-action model that handles embodiment differences through interleaved action tokens and attention masking.
2606.28133 · ▲ 33
AOHP在Android中将AI Agent作为一等实体,通过专用机制提升任务完成率并降低执行成本
2606.23449 · ▲ 32 · Code
JetSpec结合高效前向草稿与因果条件,在多基准提升LLM推理速度和接受率
2606.18394 · ▲ 32 · Code
EvoEmbedding通过持续更新潜在记忆生成自适应表示,提升长上下文检索性能
2606.21649 · ▲ 32 · Code
用对比证据门控进行无标签在策略蒸馏,提升细粒度视觉推理并加速训练
2606.25319 · ▲ 27 · Code
用并行动作前缀预测替代自回归展开,加速视觉规划并降低长程预测成本与延迟
2606.26217 · ▲ 25 · Code
用定长长短期记忆槽、边界门控和切换先验生成多镜头音视频,保持主体外观与音频一致。
2606.21661 · ▲ 24 · Code
BioMatrix以统一decoder-only架构融合序列、结构和自然语言,支持多类生物任务。
2606.22138 · ▲ 24 · Code
通过熵引导搜索动态选择可靠中间层解码,提升推理性能且开销极低
2606.21906 · ▲ 24 · Code
构建仿真基准EBench,多任务多维评测通用移动操作策略,揭示SOTA模型能力与泛化差异
2606.18239 · ▲ 15 · Code
提出简化RL训练方案并扩展数据集,以更少参数提升终端代理性能
2606.23321 · ▲ 14 · Code
SingGuard is a policy-adaptive multimodal guardrail system that evaluates safety in real-time conversations by dynamically applying natural-language rules through fast-to-slow reasoning modes.
2606.22873 · ▲ 11 · Code
构建融合数据分析与专家知识的平台,证明语言模型宜提供上下文支持与解释
2606.23608 · ▲ 7 · Code
We present HORIZON, a self-evolving agent framework that treats hardware design as repository-level code evolution. A Markdown harness is compiled into a project pack containing domain knowledge, an executable evaluator, an acceptance predicate, and a git/runtime policy; a hands-free agent loop then evolves an isolated git worktree, using repository operations for state management, tracing, and replay. This extends prior works of repository-scale self-evolution from EDA software systems, to hardware-design artifacts themselves. We evaluate our approach on ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories, achieving 100\% benchmark completion across all suites with a fully hands-free agentic loop. However, we do not claim that agentic AI for hardware design is solved: these benchmarks are controlled proxies for a much broader engineering problem in chip design. Section~\ref{sec:discuss} examines the limitations of the current study and highlights open research challenges.
2606.28279v1
Artificial intelligence is driving a revolution in scientific discovery, accelerating everything from hypothesis generation to mathematical theorem proving. However, this rapid acceleration is creating a systemic challenge: traditional human peer review cannot scale to match the influx of AI-assisted science. Ultimately, to resolve this tension, we must also deploy AI to accelerate the verification and review process itself. To frame the discussion around this transition, we propose a taxonomy consisting of four progressive levels of AI-human collaboration in scientific evaluation, and discuss various trade-offs involved with each. As a step toward this future, we introduce the Paper Assistant Tool (PAT), an agentic AI framework built for deep scientific review and verification. PAT ingests full scientific manuscripts and produces a comprehensive evaluation, checking theoretical results, validating experiments, suggesting improvements, and identifying potential flaws. By utilizing inference scaling techniques, PAT is able to identify deeper issues than a single model call alone, achieving a 34% improvement over zero-shot recall on mathematical errors in the SPOT benchmark. Pilot deployments of PAT as a pre-submission tool for authors at two major Computer Science conferences -- STOC and ICML -- demonstrate its ability to identify critical errors and suggest substantive improvements to research papers. By catching errors early, PAT eases the cognitive burden placed on referees, while preserving their control over the outcomes of the review process.
2606.28277v1
Accurate network traffic prediction is a critical element for efficient resource allocation in dynamic urban cellular networks. However, prediction remains challenging because network demand is influenced by complex mobility patterns, congestion dynamics, and heterogeneous user behavior. This paper introduces the Parameter-Efficient Hybrid Transformer (PEHT), a network traffic prediction framework that integrates urban mobility and congestion information into a Transformer-based architecture. PEHT separates primary network communication features from secondary urban mobility features and incorporates Low-Rank Adaptation (LoRA) into the Transformer encoder to reduce the number of trainable parameters while maintaining high predictive accuracy. A multimodal fusion strategy then injects external mobility and congestion features into the decoder to improve traffic forecasting. Experiments on the Telecom Italia Milan dataset and multiple synthetic congestion scenarios show that PEHT outperforms state-of-the-art baselines in terms of RMSE, MAE, and $R^2$. The implementation is available in the GitHub repository.
2606.28274v1
The transition from static chat bots to autonomous agents--equipped with persistent memory, tool-use protocols, and multi-agent collaboration--has fundamentally expanded the AI threat landscape. Current defense mechanisms, such as perimeter security and training-time alignment, remain external to the agent's active reasoning loop. Consequently, they fall short: a fully aligned agent remains highly vulnerable to runtime hijacking via memory poisoning, tool-chain manipulation, or multi-agent protocol attacks. To address this critical gap, we introduce the Agent-Native Immune System (ANIS), the first biologically inspired, endogenous defense architecture embedded directly within the agent's cognitive loop. Our framework presents four primary contributions. First, we design a six-layer Immune Tower (L0-L5), distinctly incorporating Barrier Immunity (L1) as a non-cognitive, physical-and-logical isolation layer. Second, we establish a unified taxonomy of Agent Viruses and Agent Vaccines, formalizing the critical distinction between superficial non-parametric defenses and robust parametric vaccines. Third, we conceptualize the Harness Triad--Meta, Self, and Auto--a self-monitoring, meta-cognitive automation backbone that drives Continual Immune Learning (CIL), enabling vaccines to dynamically adapt to novel threats. Finally, we establish a rigorous theoretical demarcation between model alignment and agent immunity: while alignment provides a static "constitutional" value foundation during training, ANIS serves as the dynamic "law enforcement" mechanism during runtime. We conclude by framing open challenges for the field, including immune protocol standardization, novel evaluation metrics such as the Autoimmunity Rate (false-positive intervention rate), and the co-evolutionary dynamics between pathogens and vaccines within collective intelligence ecosystems.
2606.28270v1
Vision-language models must reconcile visual evidence with memorized world knowledge when the two conflict. How they resolve this conflict shapes the reliability of multimodal systems, yet prior work characterizes it behaviorally without a component-level causal account. We combine activation patching across three granularities (residual stream, attention heads, and MLP sublayers) with model-component ablation studies and mechanistic analysis. Across three VLM families, we find that visual grounding emerges by default, whereas prior grounding depends on a small set of causally necessary attention heads (2.5-4.8%) concentrated in the second half of the network. These heads enable answers from stored world knowledge (e.g., "red" for a strawberry) despite conflicting visual input. Ablating them flips predictions from knowledge-grounded to visually grounded answers in 68-96% of cases under prior-knowledge prompts, but changes only 0.8-7.5% of visually grounded predictions, establishing an asymmetric causal structure. The identified heads decompose into routing heads, which modulate information flow, and writing heads, which directly project answer tokens into the residual stream. This structure is consistent across model families and scales, revealing a sparse causal circuit underlying perception-knowledge conflict in VLMs.
2606.28273v1
Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimization offers a promising alternative, existing approaches suffer from two structural mismatches: information conflict, where content and emotion in a shared latent space produce conflicting gradients, leading to reward hacking and semantic degradation; and scale gap, where sparse sentence-level rewards struggle to guide dense frame-level generation. To overcome these challenges, we propose HPRO, a hierarchical progressive reward optimization framework. Within HPRO, we introduce the HD-Emo codec as a novel differentiable reward model to resolve the information conflict. It extracts speech into distinct content and style preference tokens, structurally isolating emotional optimization from semantic content. Building upon this structured preference space, HPRO bridges the scale gap by progressively aligning frame-, word- and sentence-level objectives. Experiments demonstrate that HPRO significantly enhances emotional expressiveness, while effectively preserving linguistic intelligibility. The code and audio samples are publicly available at https://xxh333.github.io/hpro-demo/.
2606.28249v1
The AI community has framed the relationship between large language models (LLMs) and world models as a dichotomy: LLMs predict tokens; world models simulate reality. Yann LeCun argues in 2022 that reaching general intelligence requires abandoning autoregressive token prediction in favour of latent-space architectures. This framing is unnecessarily binary. Two claims will be defended. First, LLMs are a degenerate special case of world models: the state space is the set of all token sequences, the only action is appending one token, and world models are therefore a strict generalisation of LLMs, not a replacement. Second, there is a natural continuous spectrum from NTP to JEPA, with multi-token prediction, future-summary prediction, and next-latent prediction as intermediate stations already populated by current research. Moving along this spectrum relaxes the LLM constraints one by one. It also progressively surrenders the two practical advantages that make LLMs trainable at scale: internet-scale self-supervised data, and a transformer architecture co-designed for discrete token prediction. Both are examined as open research questions: the data question (the cliff from self-supervised text to instrumented action-labelled environments) and the architecture question (whether the transformer generalises to continuous-state prediction, or whether a new primitive is needed).
2606.28127v1
Frontier large language model training consumes massive accelerator fleets and long wall-clock computation, making stability failures costly when they occur. After a numerical or a hyperparameter fault has already destabilized the training dynamics, it may continue for thousands of steps while loss and gradient norms still appear normal. We study mechanism-driven detection of training instability by deriving internal monitors from the functional role of each critical module and from the earliest computational sites where failures are expected to produce measurable signatures. For low-precision flash attention, we monitor the spectral entropy of a QK bilinear decomposition, whose first-order term becomes abnormal before the loss fully collapses. For MoE routers, we derive indicators from their role in expert selection. Our fault-injection experiments on low-precision attention, large learning-rate, and combined faults show that these signals provide distinct signatures for different failures, triggering thousands of steps before loss divergence.
2606.28116v1
We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.
2606.28322v1
We present StructSplat, a feed-forward and generalizable 3D Gaussian reconstruction framework that operates directly on uncalibrated images without requiring camera parameters. Existing methods either rely on per-scene optimization or assume known camera poses, and often entangle geometry and appearance within a unified backbone, limiting reconstruction fidelity and generalization. Our key idea is to adopt a structured representation that organizes geometry, semantic, and texture cues with explicit roles in the reconstruction process. Specifically, we introduce a pixel-aligned feature injection mechanism to enable accurate texture modeling from 2D observations, incorporate semantic-aware priors to improve global consistency, and design a camera alignment strategy to prevent information leakage and improve generalization. Experiments show that our method significantly outperforms prior approaches on challenging benchmarks. On DL3DV, our method achieves 28.045 PSNR, surpassing AnySplat (22.377) by +5.67 dB. In cross-dataset evaluation, our method achieves +1.94 dB over AnySplat on ACID and +1.72 dB on RealEstate10K. Project page: https://structsplat.github.io Code: https://github.com/J-C-Zhao/StructSplat
2606.28321v1