Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Zhang

Alibaba Group

The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Apr 07, 2025

Hao Fang, Runmin Cong, Xiankai Lu, Zhiyang Chen, Wei Zhang

Figure 1 for The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Figure 2 for The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Figure 3 for The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Figure 4 for The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Abstract:Motion expression video segmentation is designed to segment objects in accordance with the input motion expressions. In contrast to the conventional Referring Video Object Segmentation (RVOS), it places emphasis on motion as well as multi-object expressions, making it more arduous. Recently, Large Multimodal Models (LMMs) have begun to shine in RVOS due to their powerful vision-language perception capabilities. In this work, we propose a simple and effective inference optimization method to fully unleash the potential of LMMs in referring video segmentation. Firstly, we use Sa2VA as our baseline, which is a unified LMM for dense grounded understanding of both images and videos. Secondly, we uniformly sample the video frames during the inference process to enhance the model's understanding of the entire video. Finally, we integrate the results of multiple expert models to mitigate the erroneous predictions of a single model. Our solution achieved 61.98% J&F on the MeViS test set and ranked 1st place in the 4th PVUW Challenge MeViS Track at CVPR 2025.

Via

Access Paper or Ask Questions

Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation

Apr 07, 2025

Jiaming Chen, Wentao Zhao, Ziyu Meng, Donghui Mao, Ran Song, Wei Pan, Wei Zhang

Figure 1 for Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation

Figure 2 for Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation

Figure 3 for Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation

Figure 4 for Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation

Abstract:Model Predictive Control (MPC) is a widely adopted control paradigm that leverages predictive models to estimate future system states and optimize control inputs accordingly. However, while MPC excels in planning and control, it lacks the capability for environmental perception, leading to failures in complex and unstructured scenarios. To address this limitation, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation planning framework that integrates the perception power of vision-language models (VLMs) with MPC. VLMPC utilizes a conditional action sampling module that takes a goal image or language instruction as input and leverages VLM to generate candidate action sequences. These candidates are fed into a video prediction model that simulates future frames based on the actions. In addition, we propose an enhanced variant, Traj-VLMPC, which replaces video prediction with motion trajectory generation to reduce computational complexity while maintaining accuracy. Traj-VLMPC estimates motion dynamics conditioned on the candidate actions, offering a more efficient alternative for long-horizon tasks and real-time applications. Both VLMPC and Traj-VLMPC select the optimal action sequence using a VLM-based hierarchical cost function that captures both pixel-level and knowledge-level consistency between the current observation and the task input. We demonstrate that both approaches outperform existing state-of-the-art methods on public benchmarks and achieve excellent performance in various real-world robotic manipulation tasks. Code is available at https://github.com/PPjmchen/VLMPC.

Via

Access Paper or Ask Questions

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Apr 05, 2025

Jieming Cui, Tengyu Liu, Ziyu Meng, Jiale Yu, Ran Song, Wei Zhang, Yixin Zhu, Siyuan Huang

Figure 1 for GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Figure 2 for GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Figure 3 for GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Figure 4 for GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

Abstract:Learning open-vocabulary physical skills for simulated agents presents a significant challenge in artificial intelligence. Current reinforcement learning approaches face critical limitations: manually designed rewards lack scalability across diverse tasks, while demonstration-based methods struggle to generalize beyond their training distribution. We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations. Our key insight is that Large Language Models(LLMs) and Vision Language Models(VLMs) provide complementary guidance -- LLMs generate precise physical constraints capturing task requirements, while VLMs evaluate motion semantics and naturalness. Through an iterative design process, VLM-based feedback continuously refines LLM-generated constraints, creating a self-improving reward system. To bridge the domain gap between simulation and natural images, we develop Pose2CLIP, a lightweight mapper that efficiently projects agent poses directly into semantic feature space without computationally expensive rendering. Extensive experiments across diverse embodiments and learning paradigms demonstrate GROVE's effectiveness, achieving 22.2% higher motion naturalness and 25.7% better task completion scores while training 8.4x faster than previous methods. These results establish a new foundation for scalable physical skill acquisition in simulated environments.

Via

Access Paper or Ask Questions

Gaussian Process Tilted Nonparametric Density Estimation using Fisher Divergence Score Matching

Apr 04, 2025

John Paisley, Wei Zhang, Brian Barr

Abstract:We present three Fisher divergence (FD) minimization algorithms for learning Gaussian process (GP) based score models for lower dimensional density estimation problems. The density is formed by multiplying a base multivariate normal distribution with an exponentiated GP refinement, and so we refer to it as a GP-tilted nonparametric density. By representing the GP part of the score as a linear function using the random Fourier feature (RFF) approximation, we show that all learning problems can be solved in closed form. This includes the basic and noise conditional versions of the Fisher divergence, as well as a novel alternative to noise conditional FD models based on variational inference (VI). Here, we propose using an ELBO-like optimization of the approximate posterior with which we derive a Fisher variational predictive distribution. The RFF representation of the GP, which is functionally equivalent to a single layer neural network score model with cosine activation, provides a unique linear form for which all expectations are in closed form. The Gaussian base also helps with tractability of the VI approximation. We demonstrate our three learning algorithms, as well as a MAP baseline algorithm, on several low dimensional density estimation problems. The closed-form nature of the learning problem removes the reliance on iterative algorithms, making this technique particularly well-suited to large data sets.

Via

Access Paper or Ask Questions

CSF: Fixed-outline Floorplanning Based on the Conjugate Subgradient Algorithm Assisted by Q-Learning

Apr 04, 2025

Huabin Cheng, Rujie Chen, Yu Chen, Wei Zhang, Ning Xu

Abstract:To perform the fixed-outline floorplanning problem efficiently, we propose to solve the original nonsmooth analytic optimization model via the conjugate subgradient algorithm (CSA), which is further accelerated by adaptively regulating the step size with the assistance of Q-learning. The objective for global floorplanning is a weighted sum of the half-perimeter wirelength, the overlapping area and the out-of-bound width, and the legalization is implemented by optimizing the weighted sum of the overlapping area and the out-of-bound width. Meanwhile, we also propose two improved variants for the legalizaiton algorithm based on constraint graphs (CGs). Experimental results demonstrate that the CSA assisted by Q-learning (CSAQ) can address both global floorplanning and legalization efficiently, and the two stages jointly contribute to competitive results on the optimization of wirelength. Meanwhile, the improved CG-based legalization methods also outperforms the original one in terms of runtime and success rate.

Via

Access Paper or Ask Questions

Dexterous Manipulation through Imitation Learning: A Survey

Apr 04, 2025

Shan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang(+2 more)

Abstract:Dexterous manipulation, which refers to the ability of a robotic hand or multi-fingered end-effector to skillfully control, reorient, and manipulate objects through precise, coordinated finger movements and adaptive force modulation, enables complex interactions similar to human hand dexterity. With recent advances in robotics and machine learning, there is a growing demand for these systems to operate in complex and unstructured environments. Traditional model-based approaches struggle to generalize across tasks and object variations due to the high-dimensionality and complex contact dynamics of dexterous manipulation. Although model-free methods such as reinforcement learning (RL) show promise, they require extensive training, large-scale interaction data, and carefully designed rewards for stability and effectiveness. Imitation learning (IL) offers an alternative by allowing robots to acquire dexterous manipulation skills directly from expert demonstrations, capturing fine-grained coordination and contact dynamics while bypassing the need for explicit modeling and large-scale trial-and-error. This survey provides an overview of dexterous manipulation methods based on imitation learning (IL), details recent advances, and addresses key challenges in the field. Additionally, it explores potential research directions to enhance IL-driven dexterous manipulation. Our goal is to offer researchers and practitioners a comprehensive introduction to this rapidly evolving domain.

* 22pages, 5 figures

Via

Access Paper or Ask Questions

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

Apr 03, 2025

Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao(+1 more)

Figure 1 for ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

Figure 2 for ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

Figure 3 for ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

Figure 4 for ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

Abstract:We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. Existing unified models have struggled to simultaneously handle the three fundamental capabilities in a unified model: understanding, generation, and editing. Models like Chameleon and EMU3 utilize VQGAN for image discretization, due to the lack of deep semantic interaction, they lag behind specialist models like LLaVA in visual understanding tasks. To mitigate this, LaViT and ILLUME employ semantic encoders for tokenization, but they struggle with image editing due to poor texture preservation. Meanwhile, Janus series decouples the input and output image representation, limiting their abilities to seamlessly handle interleaved image-text understanding and generation. In contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified MLLM and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible and efficient context-aware image editing and generation across diverse tasks. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications. Project Page: https://illume-unified-mllm.github.io/.

Via

Access Paper or Ask Questions

Reconfigurable Codebook-Based Beamforming for RDARS-Aided mmWave MU-MIMO Systems

Apr 02, 2025

Chengwang Ji, Qing Xue, Haiquan Lu, Jintao Wang, Qiaoyan Peng, Shaodan Ma, Wei Zhang

Figure 1 for Reconfigurable Codebook-Based Beamforming for RDARS-Aided mmWave MU-MIMO Systems

Figure 2 for Reconfigurable Codebook-Based Beamforming for RDARS-Aided mmWave MU-MIMO Systems

Figure 3 for Reconfigurable Codebook-Based Beamforming for RDARS-Aided mmWave MU-MIMO Systems

Figure 4 for Reconfigurable Codebook-Based Beamforming for RDARS-Aided mmWave MU-MIMO Systems

Abstract:Reconfigurable distributed antenna and reflecting surface (RDARS) is a new architecture for the sixth-generation (6G) millimeter wave (mmWave) communications. In RDARS-aided mmWave systems, the active and passive beamforming design and working mode configuration for reconfigurable elements are crucial for system performance. In this paper, we aim to maximize the weighted sum rate (WSR) in the RDARS-aided mmWave system. To take advantage of RDARS, we first design a reconfigurable codebook (RCB) in which the number and dimension of the codeword can be flexibly adjusted. Then, a low overhead beam training scheme based on hierarchical search is proposed. Accordingly, the active and passive beamforming for data transmission is designed to achieve the maximum WSR for both space-division multiple access (SDMA) and time-division multiple access (TDMA) schemes. For the TDMA scheme, the optimal number of RDARS transmit elements and the allocated power budget for WSR maximization are derived in closed form. Besides, the superiority of the RDARS is verified and the conditions under which RDARS outperforms RIS and DAS are given. For the SDMA scheme, we characterize the relationship between the number of RDARS connected elements and the user distribution, followed by the derivation of the optimal placement positions of the RDARS transmit elements. High-quality beamforming design solutions are derived to minimize the inter-user interference (IUI) at the base station and RDARS side respectively, which nearly leads to the maximal WSR. Finally, simulation results confirm our theoretical findings and the superiority of the proposed schemes.

Via

Access Paper or Ask Questions

Hierarchical Attention Networks for Lossless Point Cloud Attribute Compression

Apr 01, 2025

Yueru Chen, Wei Zhang, Dingquan Li, Jing Wang, Ge Li

Abstract:In this paper, we propose a deep hierarchical attention context model for lossless attribute compression of point clouds, leveraging a multi-resolution spatial structure and residual learning. A simple and effective Level of Detail (LoD) structure is introduced to yield a coarse-to-fine representation. To enhance efficiency, points within the same refinement level are encoded in parallel, sharing a common context point group. By hierarchically aggregating information from neighboring points, our attention model learns contextual dependencies across varying scales and densities, enabling comprehensive feature extraction. We also adopt normalization for position coordinates and attributes to achieve scale-invariant compression. Additionally, we segment the point cloud into multiple slices to facilitate parallel processing, further optimizing time complexity. Experimental results demonstrate that the proposed method offers better coding performance than the latest G-PCC for color and reflectance attributes while maintaining more efficient encoding and decoding runtimes.

* Accepted by DCC 2025

Via

Access Paper or Ask Questions

MAER-Nav: Bidirectional Motion Learning Through Mirror-Augmented Experience Replay for Robot Navigation

Mar 31, 2025

Shanze Wang, Mingao Tan, Zhibo Yang, Biao Huang, Xiaoyu Shen, Hailong Huang, Wei Zhang

Figure 1 for MAER-Nav: Bidirectional Motion Learning Through Mirror-Augmented Experience Replay for Robot Navigation

Figure 2 for MAER-Nav: Bidirectional Motion Learning Through Mirror-Augmented Experience Replay for Robot Navigation

Figure 3 for MAER-Nav: Bidirectional Motion Learning Through Mirror-Augmented Experience Replay for Robot Navigation

Figure 4 for MAER-Nav: Bidirectional Motion Learning Through Mirror-Augmented Experience Replay for Robot Navigation

Abstract:Deep Reinforcement Learning (DRL) based navigation methods have demonstrated promising results for mobile robots, but suffer from limited action flexibility in confined spaces. Conventional DRL approaches predominantly learn forward-motion policies, causing robots to become trapped in complex environments where backward maneuvers are necessary for recovery. This paper presents MAER-Nav (Mirror-Augmented Experience Replay for Robot Navigation), a novel framework that enables bidirectional motion learning without requiring explicit failure-driven hindsight experience replay or reward function modifications. Our approach integrates a mirror-augmented experience replay mechanism with curriculum learning to generate synthetic backward navigation experiences from successful trajectories. Experimental results in both simulation and real-world environments demonstrate that MAER-Nav significantly outperforms state-of-the-art methods while maintaining strong forward navigation capabilities. The framework effectively bridges the gap between the comprehensive action space utilization of traditional planning methods and the environmental adaptability of learning-based approaches, enabling robust navigation in scenarios where conventional DRL methods consistently fail.

* 8 pages, 8 figures

Via

Access Paper or Ask Questions