Alibaba Group
Abstract:We present three Fisher divergence (FD) minimization algorithms for learning Gaussian process (GP) based score models for lower dimensional density estimation problems. The density is formed by multiplying a base multivariate normal distribution with an exponentiated GP refinement, and so we refer to it as a GP-tilted nonparametric density. By representing the GP part of the score as a linear function using the random Fourier feature (RFF) approximation, we show that all learning problems can be solved in closed form. This includes the basic and noise conditional versions of the Fisher divergence, as well as a novel alternative to noise conditional FD models based on variational inference (VI). Here, we propose using an ELBO-like optimization of the approximate posterior with which we derive a Fisher variational predictive distribution. The RFF representation of the GP, which is functionally equivalent to a single layer neural network score model with cosine activation, provides a unique linear form for which all expectations are in closed form. The Gaussian base also helps with tractability of the VI approximation. We demonstrate our three learning algorithms, as well as a MAP baseline algorithm, on several low dimensional density estimation problems. The closed-form nature of the learning problem removes the reliance on iterative algorithms, making this technique particularly well-suited to large data sets.
Abstract:To perform the fixed-outline floorplanning problem efficiently, we propose to solve the original nonsmooth analytic optimization model via the conjugate subgradient algorithm (CSA), which is further accelerated by adaptively regulating the step size with the assistance of Q-learning. The objective for global floorplanning is a weighted sum of the half-perimeter wirelength, the overlapping area and the out-of-bound width, and the legalization is implemented by optimizing the weighted sum of the overlapping area and the out-of-bound width. Meanwhile, we also propose two improved variants for the legalizaiton algorithm based on constraint graphs (CGs). Experimental results demonstrate that the CSA assisted by Q-learning (CSAQ) can address both global floorplanning and legalization efficiently, and the two stages jointly contribute to competitive results on the optimization of wirelength. Meanwhile, the improved CG-based legalization methods also outperforms the original one in terms of runtime and success rate.
Abstract:Dexterous manipulation, which refers to the ability of a robotic hand or multi-fingered end-effector to skillfully control, reorient, and manipulate objects through precise, coordinated finger movements and adaptive force modulation, enables complex interactions similar to human hand dexterity. With recent advances in robotics and machine learning, there is a growing demand for these systems to operate in complex and unstructured environments. Traditional model-based approaches struggle to generalize across tasks and object variations due to the high-dimensionality and complex contact dynamics of dexterous manipulation. Although model-free methods such as reinforcement learning (RL) show promise, they require extensive training, large-scale interaction data, and carefully designed rewards for stability and effectiveness. Imitation learning (IL) offers an alternative by allowing robots to acquire dexterous manipulation skills directly from expert demonstrations, capturing fine-grained coordination and contact dynamics while bypassing the need for explicit modeling and large-scale trial-and-error. This survey provides an overview of dexterous manipulation methods based on imitation learning (IL), details recent advances, and addresses key challenges in the field. Additionally, it explores potential research directions to enhance IL-driven dexterous manipulation. Our goal is to offer researchers and practitioners a comprehensive introduction to this rapidly evolving domain.




Abstract:We present ILLUME+ that leverages dual visual tokenization and a diffusion decoder to improve both deep semantic understanding and high-fidelity image generation. Existing unified models have struggled to simultaneously handle the three fundamental capabilities in a unified model: understanding, generation, and editing. Models like Chameleon and EMU3 utilize VQGAN for image discretization, due to the lack of deep semantic interaction, they lag behind specialist models like LLaVA in visual understanding tasks. To mitigate this, LaViT and ILLUME employ semantic encoders for tokenization, but they struggle with image editing due to poor texture preservation. Meanwhile, Janus series decouples the input and output image representation, limiting their abilities to seamlessly handle interleaved image-text understanding and generation. In contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which preserves both fine-grained textures and text-aligned semantics while enabling a coarse-to-fine image representation strategy for multimodal understanding and generation. Additionally, we employ a diffusion model as the image detokenizer for enhanced generation quality and efficient super-resolution. ILLUME+ follows a continuous-input, discrete-output scheme within the unified MLLM and adopts a progressive training procedure that supports dynamic resolution across the vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible and efficient context-aware image editing and generation across diverse tasks. ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs and specialized models across multimodal understanding, generation, and editing benchmarks. With its strong performance, ILLUME+ provides a scalable and versatile foundation for future multimodal applications. Project Page: https://illume-unified-mllm.github.io/.




Abstract:Reconfigurable distributed antenna and reflecting surface (RDARS) is a new architecture for the sixth-generation (6G) millimeter wave (mmWave) communications. In RDARS-aided mmWave systems, the active and passive beamforming design and working mode configuration for reconfigurable elements are crucial for system performance. In this paper, we aim to maximize the weighted sum rate (WSR) in the RDARS-aided mmWave system. To take advantage of RDARS, we first design a reconfigurable codebook (RCB) in which the number and dimension of the codeword can be flexibly adjusted. Then, a low overhead beam training scheme based on hierarchical search is proposed. Accordingly, the active and passive beamforming for data transmission is designed to achieve the maximum WSR for both space-division multiple access (SDMA) and time-division multiple access (TDMA) schemes. For the TDMA scheme, the optimal number of RDARS transmit elements and the allocated power budget for WSR maximization are derived in closed form. Besides, the superiority of the RDARS is verified and the conditions under which RDARS outperforms RIS and DAS are given. For the SDMA scheme, we characterize the relationship between the number of RDARS connected elements and the user distribution, followed by the derivation of the optimal placement positions of the RDARS transmit elements. High-quality beamforming design solutions are derived to minimize the inter-user interference (IUI) at the base station and RDARS side respectively, which nearly leads to the maximal WSR. Finally, simulation results confirm our theoretical findings and the superiority of the proposed schemes.
Abstract:In this paper, we propose a deep hierarchical attention context model for lossless attribute compression of point clouds, leveraging a multi-resolution spatial structure and residual learning. A simple and effective Level of Detail (LoD) structure is introduced to yield a coarse-to-fine representation. To enhance efficiency, points within the same refinement level are encoded in parallel, sharing a common context point group. By hierarchically aggregating information from neighboring points, our attention model learns contextual dependencies across varying scales and densities, enabling comprehensive feature extraction. We also adopt normalization for position coordinates and attributes to achieve scale-invariant compression. Additionally, we segment the point cloud into multiple slices to facilitate parallel processing, further optimizing time complexity. Experimental results demonstrate that the proposed method offers better coding performance than the latest G-PCC for color and reflectance attributes while maintaining more efficient encoding and decoding runtimes.




Abstract:This study systematically analyzes geopolitical bias across 11 prominent Large Language Models (LLMs) by examining their responses to seven critical topics in U.S.-China relations. Utilizing a bilingual (English and Chinese) and dual-framing (affirmative and reverse) methodology, we generated 19,712 prompts designed to detect ideological leanings in model outputs. Responses were quantitatively assessed on a normalized scale from -2 (strongly Pro-China) to +2 (strongly Pro-U.S.) and categorized according to stance, neutrality, and refusal rates. The findings demonstrate significant and consistent ideological alignments correlated with the LLMs' geographic origins; U.S.-based models predominantly favored Pro-U.S. stances, while Chinese-origin models exhibited pronounced Pro-China biases. Notably, language and prompt framing substantially influenced model responses, with several LLMs exhibiting stance reversals based on prompt polarity or linguistic context. Additionally, we introduced comprehensive metrics to evaluate response consistency across languages and framing conditions, identifying variability and vulnerabilities in model behaviors. These results offer practical insights that can guide organizations and individuals in selecting LLMs best aligned with their operational priorities and geopolitical considerations, underscoring the importance of careful model evaluation in politically sensitive applications. Furthermore, the research highlights specific prompt structures and linguistic variations that can strategically trigger distinct responses from models, revealing methods for effectively navigating and influencing LLM outputs.




Abstract:Deep Reinforcement Learning (DRL) based navigation methods have demonstrated promising results for mobile robots, but suffer from limited action flexibility in confined spaces. Conventional DRL approaches predominantly learn forward-motion policies, causing robots to become trapped in complex environments where backward maneuvers are necessary for recovery. This paper presents MAER-Nav (Mirror-Augmented Experience Replay for Robot Navigation), a novel framework that enables bidirectional motion learning without requiring explicit failure-driven hindsight experience replay or reward function modifications. Our approach integrates a mirror-augmented experience replay mechanism with curriculum learning to generate synthetic backward navigation experiences from successful trajectories. Experimental results in both simulation and real-world environments demonstrate that MAER-Nav significantly outperforms state-of-the-art methods while maintaining strong forward navigation capabilities. The framework effectively bridges the gap between the comprehensive action space utilization of traditional planning methods and the environmental adaptability of learning-based approaches, enabling robust navigation in scenarios where conventional DRL methods consistently fail.
Abstract:The primary requirement for cross-modal data fusion is the precise alignment of data from different sensors. However, the calibration between LiDAR point clouds and camera images is typically time-consuming and needs external calibration board or specific environmental features. Cross-modal registration effectively solves this problem by aligning the data directly without requiring external calibration. However, due to the domain gap between the point cloud and the image, existing methods rarely achieve satisfactory registration accuracy while maintaining real-time performance. To address this issue, we propose a framework that projects point clouds into several 2D representations for matching with camera images, which not only leverages the geometric characteristic of LiDAR point clouds more effectively but also bridge the domain gap between the point cloud and image. Moreover, to tackle the challenges of cross modal differences and the limited overlap between LiDAR point clouds and images in the image matching task, we introduce a multi-scale feature extraction network to effectively extract features from both camera images and the projection maps of LiDAR point cloud. Additionally, we propose a patch-to-pixel matching network to provide more effective supervision and achieve higher accuracy. We validate the performance of our model through experiments on the KITTI and nuScenes datasets. Our network achieves real-time performance and extremely high registration accuracy. On the KITTI dataset, our model achieves a registration accuracy rate of over 99\%.




Abstract:Humanoid robots are anticipated to acquire a wide range of locomotion capabilities while ensuring natural movement across varying speeds and terrains. Existing methods encounter a fundamental dilemma in learning humanoid locomotion: reinforcement learning with handcrafted rewards can achieve agile locomotion but produces unnatural gaits, while Generative Adversarial Imitation Learning (GAIL) with motion capture data yields natural movements but suffers from unstable training processes and restricted agility. Integrating these approaches proves challenging due to the inherent heterogeneity between expert policies and human motion datasets. To address this, we introduce StyleLoco, a novel two-stage framework that bridges this gap through a Generative Adversarial Distillation (GAD) process. Our framework begins by training a teacher policy using reinforcement learning to achieve agile and dynamic locomotion. It then employs a multi-discriminator architecture, where distinct discriminators concurrently extract skills from both the teacher policy and motion capture data. This approach effectively combines the agility of reinforcement learning with the natural fluidity of human-like movements while mitigating the instability issues commonly associated with adversarial training. Through extensive simulation and real-world experiments, we demonstrate that StyleLoco enables humanoid robots to perform diverse locomotion tasks with the precision of expertly trained policies and the natural aesthetics of human motion, successfully transferring styles across different movement types while maintaining stable locomotion across a broad spectrum of command inputs.