Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jie Tang

Tony

Data Standards for Humanoid Robotics: The Missing Infrastructure for Physical AI

Jun 18, 2026

Shaoshan Liu, Xiugong Qin, Xuan Wu, Xuan Xia, Ning Ding, Jialu Liu, Jie Tang

Abstract:The scalability of humanoid robots will depend not only on models and hardware, but also on whether physical experience can accumulate across robots, tasks, organizations, and time. Drawing on the authors' work in developing ISO/WD 26264-1, Humanoid robot datasets -- Part 1: General requirements, within ISO/TC 299/WG 16, this article argues that data standards are becoming foundational infrastructure for Physical AI. We develop three insights. First, humanoid robot data is embodied interaction data, not a collection of isolated digital samples; a useful dataset must preserve the relationship among robot body, action, task, scene, execution trace, and outcome. Second, its value depends on physical coherence: multimodal streams are reusable only when timing, coordinate frames, calibration, kinematics, units, and synchronization assumptions remain inspectable. Third, the main bottleneck is not only data scarcity, but non-cumulative data caused by high collection costs, data silos, and inconsistent evaluation. We argue that humanoid robot data standards address these bottlenecks by making embodied experience interpretable, shareable, traceable, and reusable. A general standard should provide horizontal infrastructure for lifecycle management, metadata, provenance, quality, versioning, and traceability, while capability-specific parts should define domain grammar for manipulation, locomotion, human-robot interaction, cognition, and future humanoid capabilities. As AI moves from screens into bodies, data standards must evolve from organizing digital information to structuring physical interaction.

Via

Access Paper or Ask Questions

LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings

Jun 16, 2026

Yi Zhao, Zhen Yang, Mengpan Chen, Mingde Xu, Shanghui Gong, Xijun Liu, Jibing Gong, Jie Tang

Abstract:Recent vision-language models (VLMs) have shown promising progress in generating webpages from visual inputs, yet existing evaluations mainly focus on short, single-screen, and largely static webpages. We introduce LongWebBench, a benchmark for evaluating long-horizon webpage generation from both structural and functional perspectives. LongWebBench contains 490 real-world long webpages for structural fidelity evaluation and 507 goal-oriented interaction tasks over 129 webpages for functional evaluation. It employs two complementary protocols: a multi-dimensional VLM-based metric for assessing long-range structural coherence, and a DOM-augmented agent-based pipeline for end-to-end functional verification. We further examine the automatic evaluation protocols through human agreement analysis. Experiments with state-of-the-art open-source and proprietary VLMs under single-image and multi-image settings reveal that structural fidelity degrades as webpage length increases, while visually plausible generations often fail to support executable multi-step interactions. These results highlight the need to evaluate long webpage generation beyond visual similarity, with executable interaction as a core criterion. Our code and data are available at https://github.com/zheny2751-dotcom/LongWebBench.

* 49 pages, 38 figures

Via

Access Paper or Ask Questions

SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

Jun 09, 2026

Wenhao Yan, Fengjia Guo, Zhuoyi Yang, Jie Tang

Abstract:Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, an framework that bypasses those intermediates and achieves \textbf{end-to-end} character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To archive the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: https://teal024.github.io/SCAIL-2/.

Via

Access Paper or Ask Questions

From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory

Jun 07, 2026

Yishuo Cai, Xingyu Guo, Xuancheng Huang, Jinhua Du, Can Huang, Wenxuan Huang, Wenhan Ma, Yuyang Hu, Aohan Zeng, Jie Tang(+1 more)

Abstract:Large language model (LLM) agents are increasingly deployed in long-running settings where improving through experience at test time becomes important. A common approach is to update an explicit memory after each interaction to guide future decisions. However, most existing methods rely on hand-designed prompting rules, making it difficult to align memory updates with downstream objectives over multi-step horizons consistently. We propose MemoPilot, a plug-in memory copilot that explicitly trains the memory update process to improve a frozen LLM's performance across sequential interactions. We formulate memory updating as a multi-turn decision problem and optimize it end-to-end with multi-turn GRPO. Our training recipe introduces (i) a turn-wise reward signal and (ii) a context-independent, turn-level advantage estimation across rollouts, enabling finer-grained credit assignment and more stable training in multi-turn settings. We evaluate MemoPilot on two testbeds: multi-round Rock-Paper-Scissors (RPS) and Limit Texas Hold'em (LHE). Across both environments, MemoPilot substantially improves test-time learning of a frozen player over strong baselines, ranking first in Elo ratings on both games (1762 on LHE and 1590 on RPS) and outperforming all baseline memory methods and proprietary models, including DeepSeek-V3.2.

* Accepted by ICML 2026

Via

Access Paper or Ask Questions

Cross-Source Reasoning-based Correction for Author Name Disambiguation

Jun 07, 2026

Fanjin Zhang, Yunhe Pang, Bo Chen, Zhiyu Shen, Yanghui Rao, Evgeny Kharlamov, Jie Tang

Abstract:Author name disambiguation is a critical challenge in academic search systems, often addressed through from-scratch and real-time disambiguation approaches. However, current algorithms remain vulnerable to cumulative errors of paper-author assignments and overlook inconsistent assignments across different sources. Resorting to expert annotation is resource-intensive. To this end, this paper explores a new perspective for author name disambiguation: cross-source correction by leveraging inconsistent assignments across sources. We propose CrossND, a full-stack framework that integrates data refinement, cross-source reasoning, and test-time scaling. First, a chain-of-refinement pipeline denoises author profiles and produces more accurate paper-author matching probabilities. Second, a supervised fine-tuning process incorporates these refined signals and a probabilistic soft logic-based cross-correction module to infer the assignments of which sources are incorrect. Third, test-time scaling further enhances the accuracy and robustness of the predictions. Experiments on real-world datasets indicate that CrossND consistently outperforms 17 baselines by leveraging cross-source reasoning without human intervention.

* Accepted at KDD 2026 ADS track

Via

Access Paper or Ask Questions

RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism

May 25, 2026

Mengyang Sun, Maochuan Dou, Tao Feng, Dan Zhang, Yihao Wang, Junpeng Liu, Yifan Zhu, Jie Tang

Abstract:While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.

Via

Access Paper or Ask Questions

Sparse Fluid Antenna Arrays: Continuous Position Design Beyond Classical DOF Limits

May 19, 2026

Tuo Wu, Jie Tang, Ye Tian, Cheng Zeng, Matthew C. Valenti, Hing Cheung So

Abstract:Fluid antenna system (FAS), which continuously repositions a single physical element across a deployment region $[0, D]$, breaks this limit by freeing antenna positions from the discrete grid entirely. This paper establishes the theoretical foundations of sparse FAS design for direction-of-arrival (DOA) estimation and shows that continuous position freedom unlocks three compounding advantages over the classical designs. \emph{First}, we derive a universal dual DOF bound and prove that FAS-optimized positions can approach it, growing the DOF linearly with $D/λ$ , where $λ$ is the signal wavelength, rather than saturating at $O(N^2)$. \emph{Second}, the CRB scales as $O(1/D^{2L})$ for $L$ sources, a $(D/(N^2 d_0))^{2L}$ improvement over the best grid design, with $d_0 = λ/2$ and D-optimal positions admitting closed-form solution for single sources and efficient Frank-Wolfe algorithm for multiple sources. \emph{Third}, we propose a two-stage FAS-MUSIC approach that combines coarray MUSIC disambiguation with full-aperture local maximum likelihood (ML) refinement to track the CRB, overcoming the grating-lobe ambiguity inherent in large-aperture non-uniform arrays. Robustness to minimum spacing constraints, mutual coupling, and finite position accuracy is also analyzed. Extensive simulations show that FAS-MUSIC achieves $17.5\times$ lower root mean squared error (RMSE) than uniform linear array (ULA) MUSIC and that FAS with $4$ antennas outperforms MRA with $8$ antennas, gains that are unattainable by any grid-constrained design.

Via

Access Paper or Ask Questions

How Many Independent Modes Does a Fluid Antenna Have? A Closed-Form Outage Analysis via Equivalent Degrees of Freedom

May 19, 2026

Tuo Wu, Junteng Yao, Kai-Kit Wong, Jie Tang, Maged Elkashlan, Baiyang Liu, Kin-Fai Tong, Hyundong Shin

Abstract:In a fluid antenna system (FAS), a single reconfigurable antenna is able to activate one of $N$ correlated ports to exploit spatial diversity. However, outage analysis is challenging because exact evaluation requires an $N$-dimensional multivariate integral, while existing closed-form approximations based on block-correlation models tend to underestimate the true outage probability. This paper shows that the spatial correlation matrix of a FAS with a normalized linear aperture length $W$ has at most $K^{*}=2\lceil W\rceil+1$ significant eigenmodes, regardless of the number of deployed ports. This is a spatial counterpart of the Slepian-Landau-Pollak spectral concentration theorem and reveals that the spatial degrees of freedom are determined by aperture size rather than port count. Motivated by this result, we derive an \emph{equivalent degree of freedom} (EDoF) approximation, under which the outage probability can be expressed in closed form as that of selection combining over $K^{*}$ independent branches. We propose a refined \emph{weighted independent modes} (WIM) approximation, to incorporate eigenvalue-dependent branch weights $\{β_k\}$ and yield a product-form closed-form expression with improved accuracy at moderate signal-to-noise ratio (SNR). Both approximations achieve the exact diversity order, become asymptotically exact at high SNR, and provably never underestimate the true outage probability by Anderson's inequality. The proposed framework is further extended to obtain closed-form expressions for ergodic capacity, characterize multi-user fluid antenna multiple access (FAMA) with explicit interference-limited outage floors. Besides, we analyze two-dimensional planar FAS, for which the diversity order scales multiplicatively with the aperture dimensions.

Via

Access Paper or Ask Questions

NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

May 07, 2026

Dijia Zhan, Jinyi Li, Chenxi Zheng, Shaoyu Huang, Yong Li, Jie Tang, Xuemiao Xu

Abstract:Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.

* 10 pages, 7 figures

Via

Access Paper or Ask Questions

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

Apr 21, 2026

Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Lin Fan, Yilin Zhou, Zikang Wang, Xiaotao Gu, Jie Tang, Hongning Wang, Minlie Huang

Abstract:Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.

* 49 pages, 6 figures, 19 tables, ACL 2026 main

Via

Access Paper or Ask Questions