Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiang Shi

Network Technology Lab., Huawei Technologies Co., Ltd

ROI-Driven Foveated Attention for Unified Egocentric Representations in Vision-Language-Action Systems

Mar 21, 2026

Xinhai Sun, Xiang Shi, Menglin Zou, Wenlong Huang

Abstract:The development of embodied AI systems is increasingly constrained by the availability and structure of physical interaction data. Despite recent advances in vision-language-action (VLA) models, current pipelines suffer from high data collection cost, limited cross-embodiment alignment, and poor transfer from internet-scale visual data to robot control. We propose a region-of-interest (ROI) driven engineering workflow that introduces an egocentric, geometry-grounded data representation. By projecting end-effector poses via forward kinematics (FK) into a single external camera, we derive movement-aligned hand-centric ROIs without requiring wrist-mounted cameras or multi-view systems. Unlike directly downsampling the full frame, ROI is cropped from the original image before resizing, preserving high local information density for contact-critical regions while retaining global context. We present a reproducible pipeline covering calibration, synchronization, ROI generation, deterministic boundary handling, and metadata governance. The resulting representation is embodiment-aligned and viewpoint-normalized, enabling data reuse across heterogeneous robots. We argue that egocentric ROI serves as a practical data abstraction for scalable collection and cross-embodiment learning, bridging internet-scale perception and robot-specific control.

Via

Access Paper or Ask Questions

SaiVLA-0: Cerebrum--Pons--Cerebellum Tripartite Architecture for Compute-Aware Vision-Language-Action

Mar 09, 2026

Xiang Shi, Wenlong Huang, Menglin Zou, Xinhai Sun

Abstract:We revisit Vision-Language-Action through a neuroscience-inspired triad. Biologically, the Cerebrum provides stable high-level multimodal priors and remains frozen; the Pons Adapter integrates these cortical features with real-time proprioceptive inputs and compiles intent into execution-ready tokens; and the Cerebellum (ParaCAT) performs fast, parallel categorical decoding for online control, with hysteresis/EMA/temperature/entropy for stability. A fixed-ratio schedule and two-stage feature caching make the system compute-aware and reproducible. Inspired by active, foveated vision, our wrist ROIs are geometrically tied to the end-effector via calibrated projection, providing a movement-stabilized, high-resolution view that is sensitive to fine-grained pose changes and complements the global context of the main view. The design is modular: upgrading the Cerebrum only retrains the Pons; changing robots only trains the Cerebellum; cerebellum-only RL can further refine control without touching high-level semantics. As a concept-and-protocol paper with preliminary evidence, we outline a timing protocol under matched conditions (GPU, resolution, batch) to verify anticipated efficiency gains. We also report preliminary LIBERO evidence showing that split feature caching reduces training time (7.5h to 4.5h) and improves average success (86.5% to 92.5%) under official N1.5 head-only training, and that SaiVLA0 reaches 99.0% mean success.

* 14 pages, 3 figures

Via

Access Paper or Ask Questions

Interplay Between Belief Propagation and Transformer: Differential-Attention Message Passing Transformer

Sep 19, 2025

Chin Wa Lau, Xiang Shi, Ziyan Zheng, Haiwen Cao, Nian Guo

Abstract:Transformer-based neural decoders have emerged as a promising approach to error correction coding, combining data-driven adaptability with efficient modeling of long-range dependencies. This paper presents a novel decoder architecture that integrates classical belief propagation principles with transformer designs. We introduce a differentiable syndrome loss function leveraging global codebook structure and a differential-attention mechanism optimizing bit and syndrome embedding interactions. Experimental results demonstrate consistent performance improvements over existing transformer-based decoders, with our approach surpassing traditional belief propagation decoders for short-to-medium length LDPC codes.

* 6 pages, 4 figures, to be published in ISIT2025

Via

Access Paper or Ask Questions

Understanding Stragglers in Large Model Training Using What-if Analysis

May 09, 2025

Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia(+6 more)

Figure 1 for Understanding Stragglers in Large Model Training Using What-if Analysis

Figure 2 for Understanding Stragglers in Large Model Training Using What-if Analysis

Figure 3 for Understanding Stragglers in Large Model Training Using What-if Analysis

Figure 4 for Understanding Stragglers in Large Model Training Using What-if Analysis

Abstract:Large language model (LLM) training is one of the most demanding distributed computations today, often requiring thousands of GPUs with frequent synchronization across machines. Such a workload pattern makes it susceptible to stragglers, where the training can be stalled by few slow workers. At ByteDance we find stragglers are not trivially always caused by hardware failures, but can arise from multiple complex factors. This work aims to present a comprehensive study on the straggler issues in LLM training, using a five-month trace collected from our ByteDance LLM training cluster. The core methodology is what-if analysis that simulates the scenario without any stragglers and contrasts with the actual case. We use this method to study the following questions: (1) how often do stragglers affect training jobs, and what effect do they have on job performance; (2) do stragglers exhibit temporal or spatial patterns; and (3) what are the potential root causes for stragglers?

Via

Access Paper or Ask Questions

Scalable Hierarchical Reinforcement Learning for Hyper Scale Multi-Robot Task Planning

Dec 27, 2024

Xuan Zhou, Xiang Shi, Lele Zhang, Chen Chen, Hongbo Li, Lin Ma, Fang Deng, Jie Chen

Figure 1 for Scalable Hierarchical Reinforcement Learning for Hyper Scale Multi-Robot Task Planning

Figure 2 for Scalable Hierarchical Reinforcement Learning for Hyper Scale Multi-Robot Task Planning

Figure 3 for Scalable Hierarchical Reinforcement Learning for Hyper Scale Multi-Robot Task Planning

Figure 4 for Scalable Hierarchical Reinforcement Learning for Hyper Scale Multi-Robot Task Planning

Abstract:To improve the efficiency of warehousing system and meet huge customer orders, we aim to solve the challenges of dimension disaster and dynamic properties in hyper scale multi-robot task planning (MRTP) for robotic mobile fulfillment system (RMFS). Existing research indicates that hierarchical reinforcement learning (HRL) is an effective method to reduce these challenges. Based on that, we construct an efficient multi-stage HRL-based multi-robot task planner for hyper scale MRTP in RMFS, and the planning process is represented with a special temporal graph topology. To ensure optimality, the planner is designed with a centralized architecture, but it also brings the challenges of scaling up and generalization that require policies to maintain performance for various unlearned scales and maps. To tackle these difficulties, we first construct a hierarchical temporal attention network (HTAN) to ensure basic ability of handling inputs with unfixed lengths, and then design multi-stage curricula for hierarchical policy learning to further improve the scaling up and generalization ability while avoiding catastrophic forgetting. Additionally, we notice that policies with hierarchical structure suffer from unfair credit assignment that is similar to that in multi-agent reinforcement learning, inspired of which, we propose a hierarchical reinforcement learning algorithm with counterfactual rollout baseline to improve learning performance. Experimental results demonstrate that our planner outperform other state-of-the-art methods on various MRTP instances in both simulated and real-world RMFS. Also, our planner can successfully scale up to hyper scale MRTP instances in RMFS with up to 200 robots and 1000 retrieval racks on unlearned maps while keeping superior performance over other methods.

Via

Access Paper or Ask Questions

Minder: Faulty Machine Detection for Large-scale Distributed Model Training

Nov 04, 2024

Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu(+5 more)

Figure 1 for Minder: Faulty Machine Detection for Large-scale Distributed Model Training

Figure 2 for Minder: Faulty Machine Detection for Large-scale Distributed Model Training

Figure 3 for Minder: Faulty Machine Detection for Large-scale Distributed Model Training

Figure 4 for Minder: Faulty Machine Detection for Large-scale Distributed Model Training

Abstract:Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.

Via

Access Paper or Ask Questions

Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models

Jul 26, 2024

Xiang Shi, Jiawei Liu, Yinpeng Liu, Qikai Cheng, Wei Lu

Figure 1 for Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models

Figure 2 for Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models

Figure 3 for Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models

Figure 4 for Every Part Matters: Integrity Verification of Scientific Figures Based on Multimodal Large Language Models

Abstract:This paper tackles a key issue in the interpretation of scientific figures: the fine-grained alignment of text and figures. It advances beyond prior research that primarily dealt with straightforward, data-driven visualizations such as bar and pie charts and only offered a basic understanding of diagrams through captioning and classification. We introduce a novel task, Figure Integrity Verification, designed to evaluate the precision of technologies in aligning textual knowledge with visual elements in scientific figures. To support this, we develop a semi-automated method for constructing a large-scale dataset, Figure-seg, specifically designed for this task. Additionally, we propose an innovative framework, Every Part Matters (EPM), which leverages Multimodal Large Language Models (MLLMs) to not only incrementally improve the alignment and verification of text-figure integrity but also enhance integrity through analogical reasoning. Our comprehensive experiments show that these innovations substantially improve upon existing methods, allowing for more precise and thorough analysis of complex scientific figures. This progress not only enhances our understanding of multimodal technologies but also stimulates further research and practical applications across fields requiring the accurate interpretation of complex visual data.

* 28 pages, 11 figures, under review

Via

Access Paper or Ask Questions

Let's Learn Step by Step: Enhancing In-Context Learning Ability with Curriculum Learning

Feb 16, 2024

Yinpeng Liu, Jiawei Liu, Xiang Shi, Qikai Cheng, Wei Lu

Figure 1 for Let's Learn Step by Step: Enhancing In-Context Learning Ability with Curriculum Learning

Figure 2 for Let's Learn Step by Step: Enhancing In-Context Learning Ability with Curriculum Learning

Figure 3 for Let's Learn Step by Step: Enhancing In-Context Learning Ability with Curriculum Learning

Figure 4 for Let's Learn Step by Step: Enhancing In-Context Learning Ability with Curriculum Learning

Abstract:Demonstration ordering, which is an important strategy for in-context learning (ICL), can significantly affects the performance of large language models (LLMs). However, most of the current approaches of ordering require additional knowledge and similarity calculation. We advocate the few-shot in-context curriculum learning (ICCL), a simple but effective demonstration ordering method for ICL, which implies gradually increasing the complexity of prompt demonstrations during the inference process. Then we design three experiments to discuss the effectiveness of ICCL, the formation mechanism of LLM's ICCL capability, and the impact of ordering subjects. Experimental results demonstrate that ICCL, developed during the instruction-tuning stage, is effective for open-source LLMs. Moreover, LLMs exhibit a weaker capacity compared to humans in discerning the difficulty levels of demonstrations. We release our code at https://github.com/61peng/curri_learning.

Via

Access Paper or Ask Questions

Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy Searcher

Oct 19, 2023

Xiang Shi, Jiawei Liu, Yinpeng Liu, Qikai Cheng, Wei Lu

Abstract:The advent of Large Language Models (LLMs) has shown the potential to improve relevance and provide direct answers in web searches. However, challenges arise in validating the reliability of generated results and the credibility of contributing sources, due to the limitations of traditional information retrieval algorithms and the LLM hallucination problem. Aiming to create a "PageRank" for the LLM era, we strive to transform LLM into a relevant, responsible, and trustworthy searcher. We propose a novel generative retrieval framework leveraging the knowledge of LLMs to foster a direct link between queries and online sources. This framework consists of three core modules: Generator, Validator, and Optimizer, each focusing on generating trustworthy online sources, verifying source reliability, and refining unreliable sources, respectively. Extensive experiments and evaluations highlight our method's superior relevance, responsibility, and trustfulness against various SOTA methods.

* 14 pages, 4 figures, under peer review

Via

Access Paper or Ask Questions

GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters

Aug 09, 2023

Guillermo Bernárdez, José Suárez-Varela, Xiang Shi, Shihan Xiao, Xiangle Cheng, Pere Barlet-Ros, Albert Cabellos-Aparicio

Figure 1 for GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters

Figure 2 for GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters

Figure 3 for GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters

Figure 4 for GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters

Abstract:Congestion Control (CC) plays a fundamental role in optimizing traffic in Data Center Networks (DCN). Currently, DCNs mainly implement two main CC protocols: DCTCP and DCQCN. Both protocols -- and their main variants -- are based on Explicit Congestion Notification (ECN), where intermediate switches mark packets when they detect congestion. The ECN configuration is thus a crucial aspect on the performance of CC protocols. Nowadays, network experts set static ECN parameters carefully selected to optimize the average network performance. However, today's high-speed DCNs experience quick and abrupt changes that severely change the network state (e.g., dynamic traffic workloads, incast events, failures). This leads to under-utilization and sub-optimal performance. This paper presents GraphCC, a novel Machine Learning-based framework for in-network CC optimization. Our distributed solution relies on a novel combination of Multi-agent Reinforcement Learning (MARL) and Graph Neural Networks (GNN), and it is compatible with widely deployed ECN-based CC protocols. GraphCC deploys distributed agents on switches that communicate with their neighbors to cooperate and optimize the global ECN configuration. In our evaluation, we test the performance of GraphCC under a wide variety of scenarios, focusing on the capability of this solution to adapt to new scenarios unseen during training (e.g., new traffic workloads, failures, upgrades). We compare GraphCC with a state-of-the-art MARL-based solution for ECN tuning -- ACC -- and observe that our proposed solution outperforms the state-of-the-art baseline in all of the evaluation scenarios, showing improvements up to $20\%$ in Flow Completion Time as well as significant reductions in buffer occupancy ($38.0-85.7\%$).

* 11 pages, 7 figures, 2 tables

Via

Access Paper or Ask Questions