Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wengang Zhou

Uni-Sign: Toward Unified Sign Language Understanding at Scale

Jan 25, 2025

Zecheng Li, Wengang Zhou, Weichao Zhao, Kepeng Wu, Hezhen Hu, Houqiang Li

Figure 1 for Uni-Sign: Toward Unified Sign Language Understanding at Scale

Figure 2 for Uni-Sign: Toward Unified Sign Language Understanding at Scale

Figure 3 for Uni-Sign: Toward Unified Sign Language Understanding at Scale

Figure 4 for Uni-Sign: Toward Unified Sign Language Understanding at Scale

Abstract:Sign language pre-training has gained increasing attention for its ability to enhance performance across various sign language understanding (SLU) tasks. However, existing methods often suffer from a gap between pre-training and fine-tuning, leading to suboptimal results. To address this, we propose \modelname, a unified pre-training framework that eliminates the gap between pre-training and downstream SLU tasks through a large-scale generative pre-training strategy and a novel fine-tuning paradigm. First, we introduce CSL-News, a large-scale Chinese Sign Language (CSL) dataset containing 1,985 hours of video paired with textual annotations, which enables effective large-scale pre-training. Second, \modelname unifies SLU tasks by treating downstream tasks as a single sign language translation (SLT) task during fine-tuning, ensuring seamless knowledge transfer between pre-training and fine-tuning. Furthermore, we incorporate a prior-guided fusion (PGF) module and a score-aware sampling strategy to efficiently fuse pose and RGB information, addressing keypoint inaccuracies and improving computational efficiency. Extensive experiments across multiple SLU benchmarks demonstrate that \modelname achieves state-of-the-art performance across multiple downstream SLU tasks. Dataset and code are available at \url{github.com/ZechengLi19/Uni-Sign}.

* Accepted by ICLR 2025

Via

Access Paper or Ask Questions

SmartEraser: Remove Anything from Images using Masked-Region Guidance

Jan 14, 2025

Longtao Jiang, Zhendong Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Lei Shi, Dong Chen, Houqiang Li

Figure 1 for SmartEraser: Remove Anything from Images using Masked-Region Guidance

Figure 2 for SmartEraser: Remove Anything from Images using Masked-Region Guidance

Figure 3 for SmartEraser: Remove Anything from Images using Masked-Region Guidance

Figure 4 for SmartEraser: Remove Anything from Images using Masked-Region Guidance

Abstract:Object removal has so far been dominated by the mask-and-inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Masked-Region Guidance. This paradigm retains the masked region in the input, using it as guidance for the removal process. It offers several distinct advantages: (a) it guides the model to accurately identify the object to be removed, preventing its regeneration in the output; (b) since the user mask often extends beyond the object itself, it aids in preserving the surrounding context in the final result. Leveraging this new paradigm, we present Syn4Removal, a large-scale object removal dataset, where instance segmentation data is used to copy and paste objects onto images as removal targets, with the original images serving as ground truths. Experimental results demonstrate that SmartEraser significantly outperforms existing methods, achieving superior performance in object removal, especially in complex scenes with intricate compositions.

* Project at: https://longtaojiang.github.io/smarteraser.github.io/

Via

Access Paper or Ask Questions

Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Dec 19, 2024

Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li

Figure 1 for Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Figure 2 for Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Figure 3 for Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Figure 4 for Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Abstract:Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes.

* Accepted by AAAI 2025

Via

Access Paper or Ask Questions

RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Dec 17, 2024

Junjie Lin, Jian Zhao, Lin Liu, Yue Deng, Youpeng Zhao, Lanxiao Huang, Xia Lin, Wengang Zhou, Houqiang Li

Figure 1 for RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Figure 2 for RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Figure 3 for RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Figure 4 for RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Abstract:Traditionally, AI development for two-player zero-sum games has relied on two primary techniques: decision trees and reinforcement learning (RL). A common approach involves using a fixed decision tree as one player's strategy while training an RL agent as the opponent to identify vulnerabilities in the decision tree, thereby improving its strategic strength iteratively. However, this process often requires significant human intervention to refine the decision tree after identifying its weaknesses, resulting in inefficiencies and hindering full automation of the strategy enhancement process. Fortunately, the advent of Large Language Models (LLMs) offers a transformative opportunity to automate the process. We propose RL-LLM-DT, an automatic decision tree generation method based on RL Evaluation and LLM Enhancement. Given an initial decision tree, the method involves two important iterative steps. Response Policy Search: RL is used to discover counter-strategies targeting the decision tree. Policy Improvement: LLMs analyze failure scenarios and generate improved decision tree code. In our method, RL focuses on finding the decision tree's flaws while LLM is prompted to generate an improved version of the decision tree. The iterative refinement process terminates when RL can't find any flaw of the tree or LLM fails to improve the tree. To evaluate the effectiveness of this integrated approach, we conducted experiments in a curling game. After iterative refinements, our curling AI based on the decision tree ranks first on the Jidi platform among 34 curling AIs in total, which demonstrates that LLMs can significantly enhance the robustness and adaptability of decision trees, representing a substantial advancement in the field of Game AI. Our code is available at https://github.com/Linjunjie99/RL-LLM-DT.

* Length:10 pages. Figures:10 figures. Additional Notes:In this paper, we have introduced a novel hybrid approach which leverages the strengths of both RL and LLMs to itera- tively refine decision tree tactics, enhancing their performance and adaptability

Via

Access Paper or Ask Questions

Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Nov 27, 2024

Zhiyang Guo, Jinxu Xiang, Kai Ma, Wengang Zhou, Houqiang Li, Ran Zhang

Figure 1 for Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Figure 2 for Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Figure 3 for Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Figure 4 for Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Abstract:3D characters are essential to modern creative industries, but making them animatable often demands extensive manual work in tasks like rigging and skinning. Existing automatic rigging tools face several limitations, including the necessity for manual annotations, rigid skeleton topologies, and limited generalization across diverse shapes and poses. An alternative approach is to generate animatable avatars pre-bound to a rigged template mesh. However, this method often lacks flexibility and is typically limited to realistic human shapes. To address these issues, we present Make-It-Animatable, a novel data-driven method to make any 3D humanoid model ready for character animation in less than one second, regardless of its shapes and poses. Our unified framework generates high-quality blend weights, bones, and pose transformations. By incorporating a particle-based shape autoencoder, our approach supports various 3D representations, including meshes and 3D Gaussian splats. Additionally, we employ a coarse-to-fine representation and a structure-aware modeling strategy to ensure both accuracy and robustness, even for characters with non-standard skeleton structures. We conducted extensive experiments to validate our framework's effectiveness. Compared to existing methods, our approach demonstrates significant improvements in both quality and speed.

* Project Page: https://jasongzy.github.io/Make-It-Animatable/

Via

Access Paper or Ask Questions

ROOT: VLM based System for Indoor Scene Understanding and Beyond

Nov 24, 2024

Yonghui Wang, Shi-Yong Chen, Zhenxing Zhou, Siyi Li, Haoran Li, Wengang Zhou, Houqiang Li

Figure 1 for ROOT: VLM based System for Indoor Scene Understanding and Beyond

Figure 2 for ROOT: VLM based System for Indoor Scene Understanding and Beyond

Figure 3 for ROOT: VLM based System for Indoor Scene Understanding and Beyond

Figure 4 for ROOT: VLM based System for Indoor Scene Understanding and Beyond

Abstract:Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within indoor scenes. This is followed by employing vision foundation models to acquire additional meta-information about the scene, such as bounding boxes. Building on this foundational data, we propose a specialized VLM, SceneVLM, which is capable of generating spatial hierarchical scene graphs and providing distance information for objects within indoor environments. This information enhances our understanding of the spatial arrangement of indoor scenes. To train our SceneVLM, we collect over 610,000 images from various public indoor datasets and implement a scene data generation pipeline with a semi-automated technique to establish relationships and estimate distances among indoor objects. By utilizing this enriched data, we conduct various training recipes and finish SceneVLM. Our experiments demonstrate that \rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI. The code will be released at \url{https://github.com/harrytea/ROOT}.

Via

Access Paper or Ask Questions

BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Nov 19, 2024

Zongmeng Zhang, Jinhua Zhu, Wengang Zhou, Xiang Qi, Peng Zhang, Houqiang Li

Figure 1 for BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Figure 2 for BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Figure 3 for BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Figure 4 for BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Abstract:Dense retrieval, which aims to encode the semantic information of arbitrary text into dense vector representations or embeddings, has emerged as an effective and efficient paradigm for text retrieval, consequently becoming an essential component in various natural language processing systems. These systems typically focus on optimizing the embedding space by attending to the relevance of text pairs, while overlooking the Boolean logic inherent in language, which may not be captured by current training objectives. In this work, we first investigate whether current retrieval systems can comprehend the Boolean logic implied in language. To answer this question, we formulate the task of Boolean Dense Retrieval and collect a benchmark dataset, BoolQuestions, which covers complex queries containing basic Boolean logic and corresponding annotated passages. Through extensive experimental results on the proposed task and benchmark dataset, we draw the conclusion that current dense retrieval systems do not fully understand Boolean logic in language, and there is a long way to go to improve our dense retrieval systems. Furthermore, to promote further research on enhancing the understanding of Boolean logic for language models, we explore Boolean operation on decomposed query and propose a contrastive continual training method that serves as a strong baseline for the research community.

* In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2767-2779
* Findings of the Association for Computational Linguistics: EMNLP 2024

Via

Access Paper or Ask Questions

Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Oct 22, 2024

Zongmeng Zhang, Yufeng Shi, Jinhua Zhu, Wengang Zhou, Xiang Qi, Peng Zhang, Houqiang Li

Figure 1 for Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Figure 2 for Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Figure 3 for Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Figure 4 for Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Abstract:Trustworthiness is an essential prerequisite for the real-world application of large language models. In this paper, we focus on the trustworthiness of language models with respect to retrieval augmentation. Despite being supported with external evidence, retrieval-augmented generation still suffers from hallucinations, one primary cause of which is the conflict between contextual and parametric knowledge. We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge. Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence and disregards the interference of parametric knowledge. Specifically, we propose a reinforcement learning based algorithm Trustworthy-Alignment, theoretically and experimentally demonstrating large language models' capability of reaching a trustworthy status without explicit supervision on how to respond. Our work highlights the potential of large language models on exploring its intrinsic abilities by its own and expands the application scenarios of alignment from fulfilling human preference to creating trustworthy agents.

* Proceedings of the 41st International Conference on Machine Learning, PMLR 235:59827-59850, 2024
* ICML 2024

Via

Access Paper or Ask Questions

MotionRL: Align Text-to-Motion Generation to Human Preferences with Multi-Reward Reinforcement Learning

Oct 09, 2024

Xiaoyang Liu, Yunyao Mao, Wengang Zhou, Houqiang Li

Figure 1 for MotionRL: Align Text-to-Motion Generation to Human Preferences with Multi-Reward Reinforcement Learning

Figure 2 for MotionRL: Align Text-to-Motion Generation to Human Preferences with Multi-Reward Reinforcement Learning

Figure 3 for MotionRL: Align Text-to-Motion Generation to Human Preferences with Multi-Reward Reinforcement Learning

Figure 4 for MotionRL: Align Text-to-Motion Generation to Human Preferences with Multi-Reward Reinforcement Learning

Abstract:We introduce MotionRL, the first approach to utilize Multi-Reward Reinforcement Learning (RL) for optimizing text-to-motion generation tasks and aligning them with human preferences. Previous works focused on improving numerical performance metrics on the given datasets, often neglecting the variability and subjectivity of human feedback. In contrast, our novel approach uses reinforcement learning to fine-tune the motion generator based on human preferences prior knowledge of the human perception model, allowing it to generate motions that better align human preferences. In addition, MotionRL introduces a novel multi-objective optimization strategy to approximate Pareto optimality between text adherence, motion quality, and human preferences. Extensive experiments and user studies demonstrate that MotionRL not only allows control over the generated results across different objectives but also significantly enhances performance across these metrics compared to other algorithms.

Via

Access Paper or Ask Questions

StreetSurfGS: Scalable Urban Street Surface Reconstruction with Planar-based Gaussian Splatting

Oct 06, 2024

Xiao Cui, Weicai Ye, Yifan Wang, Guofeng Zhang, Wengang Zhou, Tong He, Houqiang Li

Figure 1 for StreetSurfGS: Scalable Urban Street Surface Reconstruction with Planar-based Gaussian Splatting

Figure 2 for StreetSurfGS: Scalable Urban Street Surface Reconstruction with Planar-based Gaussian Splatting

Figure 3 for StreetSurfGS: Scalable Urban Street Surface Reconstruction with Planar-based Gaussian Splatting

Figure 4 for StreetSurfGS: Scalable Urban Street Surface Reconstruction with Planar-based Gaussian Splatting

Abstract:Reconstructing urban street scenes is crucial due to its vital role in applications such as autonomous driving and urban planning. These scenes are characterized by long and narrow camera trajectories, occlusion, complex object relationships, and data sparsity across multiple scales. Despite recent advancements, existing surface reconstruction methods, which are primarily designed for object-centric scenarios, struggle to adapt effectively to the unique characteristics of street scenes. To address this challenge, we introduce StreetSurfGS, the first method to employ Gaussian Splatting specifically tailored for scalable urban street scene surface reconstruction. StreetSurfGS utilizes a planar-based octree representation and segmented training to reduce memory costs, accommodate unique camera characteristics, and ensure scalability. Additionally, to mitigate depth inaccuracies caused by object overlap, we propose a guided smoothing strategy within regularization to eliminate inaccurate boundary points and outliers. Furthermore, to address sparse views and multi-scale challenges, we use a dual-step matching strategy that leverages adjacent and long-term information. Extensive experiments validate the efficacy of StreetSurfGS in both novel view synthesis and surface reconstruction.

Via

Access Paper or Ask Questions