Hugging Face Daily Papers and arXiv AI research picks with summaries, PDFs, code links, and community signals.
A generative multi-agent world model is presented that uses simplex rotary agent encoding and sparse hub attention to enable scalable, permutation-symmetric interaction between multiple agents in interactive video generation.
2605.28816 · ▲ 407
Dynamic Variance-adaptive Advantage Optimization (DVAO) addresses training instability in multi-reward reinforcement learning by adaptively weighting objectives based on empirical reward variance, maintaining bounded advantage magnitudes and improving multi-objective performance.
2605.25604 · ▲ 132
Parallel Box Decoding enables efficient and accurate unified visual grounding and detection by decoding geometric elements as atomic units, improving both throughput and localization quality.
2605.27365 · ▲ 128
A lightweight and scalable agent safety alignment framework is proposed to address emerging threats from advanced AI models, featuring taxonomy-guided training with minimal samples and efficient deployment in real-world scenarios.
2605.29801 · ▲ 127
A unified vision-language-action model is presented that integrates diverse embodied decision-making tasks through a shared architecture and training approach, demonstrating strong performance across manipulation, navigation, and trajectory prediction with generalization across different robot platforms and environments.
2605.30280 · ▲ 107
WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.
2605.25874 · ▲ 100 · Code
Proactive recommender systems using reinforcement learning face challenges with gradient estimation bias and variance, which are addressed through stepwise reward centering and position-specific advantage estimation mechanisms.
2605.28293 · ▲ 81 · Code
Agents using vision-language models with extended reasoning face challenges in tool utilization, which are addressed through AXPO, a method that improves performance by optimizing thinking prefixes and tool call resampling.
2605.28774 · ▲ 79
Generative UI models enable personal agents to synthesize dynamic interfaces with lightweight executable actions for enhanced interaction beyond text-only formats.
2605.24830 · ▲ 79
Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly interact with one another. As these systems scale, the bottleneck shifts away from raw model capability toward coordination. Agents need to form reliable relationships, organize multi-agent work, exchange value, support an AI economy, and stay safe and accountable under real-world oversight. This paper introduces the Foundation Protocol (FP), a graph-first coordination layer for an emerging human-AI society. FP unifies heterogeneous entities, including agents, tools, resources, humans, institutions, and organizations, and supports native multi-party organization and event-based collaboration. It also provides economic primitives for metering, receipts, and settlement, and treats policy, provenance, and audit as first-class concerns. FP is designed to wrap and bridge existing protocols rather than replace them, enabling incremental adoption while reducing integration and governance overhead. The aim is to keep autonomous agency composable while keeping accountability non-negotiable, so that coordination itself can become shared infrastructure for a human-AI society that is open, pluralistic, and governable.
2605.23218 · ▲ 77 · Code
EvalVerse presents a comprehensive evaluation framework for generative video models that bridges the gap between human aesthetic judgment and machine scoring through expert-calibrated vision-language models and multi-stage cinematic assessment.
2605.23271 · ▲ 77
SpatialBench presents a comprehensive benchmark for evaluating spatial foundation models across diverse domains and tasks, revealing limitations in current models and introducing DA-Next-5M and DA-Next to advance spatial representation learning.
2605.27367 · ▲ 68 · Code
NEO-ov is a native vision-language model that end-to-end learns cross-frame and pixel-word correspondences without modular components, enabling unified spatiotemporal modeling and competitive performance in visual perception tasks.
2605.28820 · ▲ 68
OmniRetrieval is a framework that handles diverse knowledge sources by identifying appropriate repositories and dispatching native queries to their respective execution engines, outperforming single-source approaches across multiple dataset types.
2605.29250 · ▲ 66 · Code
MobileGym presents a browser-based mobile environment enabling deterministic evaluation and scalable reinforcement learning through JSON-based state management and parallel execution.
2605.26114 · ▲ 58 · Code
Bidirectional Evolutionary Search combines forward candidate evolution with backward goal decomposition to improve language model generation by overcoming limitations of traditional search methods.
2605.28814 · ▲ 55 · Code
CollectionLoRA enables efficient deployment of multiple customized image editing effects by distilling numerous LoRAs into a single model through multi-teacher distillation and specialized mechanisms for concept isolation and generation.
2605.25378 · ▲ 53 · Code
TriSplat is a feed-forward 3D reconstruction network that uses oriented triangle primitives to directly generate simulation-ready meshes from single images, bypassing expensive post-processing steps.
2605.26115 · ▲ 50 · Code
A comprehensive framework is presented for converting bidirectional video diffusion models into real-time interactive world models with controllable, causal, and low-latency capabilities through fine-tuning and distillation techniques.
2605.30263 · ▲ 49 · Code
DenoiseRL is a reinforcement learning framework that enhances reasoning in large language models by learning from incorrect traces through failure-oriented optimization, improving scalability and reducing dependence on external supervision.
2605.28421 · ▲ 44 · Code
Native multimodal modeling advances beyond traditional fusion approaches by integrating modalities inherently within a unified transformer framework, enabling seamless understanding and generation across diverse input-output configurations.
2605.25343 · ▲ 42 · Code
Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis.
2605.30346 · ▲ 41 · Code
ThriftAttention reduces long-context attention computation by selectively applying higher precision to critical query-key interactions, achieving near-full precision quality at reduced bitwidth efficiency.
2605.23081 · ▲ 41 · Code
QUEST is an open-family of deep research agents trained with synthesized data and reinforcement learning to perform well across diverse long-horizon search tasks.
2605.24218 · ▲ 40 · Code
A novel diffusion-based framework for multi-view 3D reconstruction that restores both scene geometry and high-quality imagery from degraded inputs by operating in the feature space of a 3D reconstructor.
2605.26230 · ▲ 39 · Code
Large language model-based memory systems can benefit from personalized policies that adapt to individual user contexts, though accurate implementation remains challenging.
2605.25535 · ▲ 39 · Code
GEM is a vision-language model that integrates depth map generation during pre-training to improve embodied intelligence and physical operation capabilities in robotics.
2605.28548 · ▲ 38 · Code
Vision-language models exhibit entangled spatial representations that correlate vertical image position with distance, impacting reasoning robustness and performance across benchmarks.
2605.30161 · ▲ 38 · Code
Memory systems in large language models suffer from reliability issues that can be addressed through a novel tracing framework and automated fault attribution for improved performance.
2605.28732 · ▲ 37 · Code
LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences.
2605.26244 · ▲ 36 · Code
LearnWeak is an annotation-free framework that enhances small computer-use agents by identifying weaknesses through a stronger reference agent and generating targeted training data for improved domain specialization.
2605.28775 · ▲ 35 · Code
ParaVT enables parallel video tool calling through multi-agent reinforcement learning, addressing limitations of sequential approaches and improving long-video understanding performance.
2605.20342 · ▲ 34 · Code
Autonomous research agents exhibit verifiability issues like fabricated citations and unreproducible results, which are addressed through a framework ensuring evidence traceability and an end-to-end system maintaining integrity throughout research processes.
2605.26340 · ▲ 32 · Code
GenClaw presents a code-driven agentic image generation framework that enables precise visual construction through conceptualization, sketching, and coloring stages, integrating programmatic logic with generative models.
2605.30248 · ▲ 31 · Code
A multi-agent framework called Soap2Soap is presented for long-horizon video-to-video generation that maintains narrative structure and character identity across extended sequences through consistent semantic backbone and visual reference anchors.
2605.17423 · ▲ 31 · Code
Latent diffusion models using clean-data prediction outperform velocity prediction in compressed representations, demonstrating that prediction targets are geometrically dependent rather than algebraically interchangeable.
2605.27102 · ▲ 30 · Code
RLVR framework for computer-use agents addresses data scarcity through scalable generation pipeline and synthetic environments, achieving superior performance on verification and transfer benchmarks.
2605.25624 · ▲ 29 · Code
Long-lived AI agents require lifespan evaluation and mechanism-level diagnosis beyond initial performance testing to ensure reliability over time.
2605.26302 · ▲ 28 · Code
EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining model accuracy.
2605.30010 · ▲ 27 · Code
LLaVA-OneVision-2 achieves superior multimodal performance through codec-stream tokenization, windowed attention, and large-scale open supervision across video understanding, temporal grounding, and tracking tasks.
2605.25979 · ▲ 25 · Code
NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.
2605.30073 · ▲ 24 · Code
Claw-Anything benchmark evaluates large language model agents on comprehensive user activity contexts spanning extended timeframes, multiple services, and diverse device interactions to assess true always-on personal assistance capabilities.
2605.26086 · ▲ 23 · Code
OSP-Next is an efficient text-to-video generation model that combines sparse attention, parallelism, quantization, and reinforcement learning to achieve high-quality video synthesis with reduced computational costs.
2605.28691 · ▲ 20 · Code
LLM-based agents perform poorly on VibeSearch benchmark, which evaluates multi-turn dialogue search scenarios reflecting real user-agent collaboration rather than traditional single-turn query tasks.
2605.27882 · ▲ 11 · Code
Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]
2605.30353v1
Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.
2605.30351v1
The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.
2605.30348v1
Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, the first large language model that generates editable PCB schematics from natural-language requests. The key challenge lies in the lack of an LLM-suited representation and a large-scale dataset. Current schematic formats are dominated by verbose, tool-specific syntax and geometry-heavy descriptions, making them difficult to generate reliably. We introduce a semantically grounded code representation that encodes schematic editing primitives with relative placement and pin-name-based wiring, transforming a geometry-driven generation problem into a semantics-driven matching task amenable to LLMs. We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation. Experiments show that SchGen significantly outperforms alternative representations and even larger general-purpose LLMs on wire connectivity accuracy and functional correctness. Our results highlight the critical role of representation design in enabling generative models for complex hardware design tasks.
2605.30345v1
A plausible future mathematical claim must satisfy two constraints: it should follow the direction of prior work and respect the formal dependencies that constrain what can validly follow. Existing approaches typically model only one of these sources, producing claims that are either weakly grounded or insufficiently motivated. We introduce grounded future mathematical generation, where the goal is to generate a plausible future theorem-like claim for an anchor paper using two complementary sources of context: its scientific citation graph and aligned formal theorem dependency graph. To address this setting, we propose COMPOSE, a dual-graph framework that conditions a language model on both scientific citation context and formal theorem structure. To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025. Experiments show that COMPOSE outperforms strong baselines on retrieval to real future papers and achieves the best overall performance under LLM-judge evaluation, producing more grounded and mathematically richer outputs. These results show that future mathematical generation benefits from combining scientific context with formal structure. Project page is available at https://david-busbib.github.io/COMPOSE-page/.
2605.30333v1
Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.
2605.30315v1