Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Houqiang Li

Uni-Sign: Toward Unified Sign Language Understanding at Scale

Jan 25, 2025

Zecheng Li, Wengang Zhou, Weichao Zhao, Kepeng Wu, Hezhen Hu, Houqiang Li

Figure 1 for Uni-Sign: Toward Unified Sign Language Understanding at Scale

Figure 2 for Uni-Sign: Toward Unified Sign Language Understanding at Scale

Figure 3 for Uni-Sign: Toward Unified Sign Language Understanding at Scale

Figure 4 for Uni-Sign: Toward Unified Sign Language Understanding at Scale

Abstract:Sign language pre-training has gained increasing attention for its ability to enhance performance across various sign language understanding (SLU) tasks. However, existing methods often suffer from a gap between pre-training and fine-tuning, leading to suboptimal results. To address this, we propose \modelname, a unified pre-training framework that eliminates the gap between pre-training and downstream SLU tasks through a large-scale generative pre-training strategy and a novel fine-tuning paradigm. First, we introduce CSL-News, a large-scale Chinese Sign Language (CSL) dataset containing 1,985 hours of video paired with textual annotations, which enables effective large-scale pre-training. Second, \modelname unifies SLU tasks by treating downstream tasks as a single sign language translation (SLT) task during fine-tuning, ensuring seamless knowledge transfer between pre-training and fine-tuning. Furthermore, we incorporate a prior-guided fusion (PGF) module and a score-aware sampling strategy to efficiently fuse pose and RGB information, addressing keypoint inaccuracies and improving computational efficiency. Extensive experiments across multiple SLU benchmarks demonstrate that \modelname achieves state-of-the-art performance across multiple downstream SLU tasks. Dataset and code are available at \url{github.com/ZechengLi19/Uni-Sign}.

* Accepted by ICLR 2025

Via

Access Paper or Ask Questions

SmartEraser: Remove Anything from Images using Masked-Region Guidance

Jan 14, 2025

Longtao Jiang, Zhendong Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Lei Shi, Dong Chen, Houqiang Li

Figure 1 for SmartEraser: Remove Anything from Images using Masked-Region Guidance

Figure 2 for SmartEraser: Remove Anything from Images using Masked-Region Guidance

Figure 3 for SmartEraser: Remove Anything from Images using Masked-Region Guidance

Figure 4 for SmartEraser: Remove Anything from Images using Masked-Region Guidance

Abstract:Object removal has so far been dominated by the mask-and-inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Masked-Region Guidance. This paradigm retains the masked region in the input, using it as guidance for the removal process. It offers several distinct advantages: (a) it guides the model to accurately identify the object to be removed, preventing its regeneration in the output; (b) since the user mask often extends beyond the object itself, it aids in preserving the surrounding context in the final result. Leveraging this new paradigm, we present Syn4Removal, a large-scale object removal dataset, where instance segmentation data is used to copy and paste objects onto images as removal targets, with the original images serving as ground truths. Experimental results demonstrate that SmartEraser significantly outperforms existing methods, achieving superior performance in object removal, especially in complex scenes with intricate compositions.

* Project at: https://longtaojiang.github.io/smarteraser.github.io/

Via

Access Paper or Ask Questions

SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Jan 07, 2025

Shang Chai, Zihang Lin, Min Zhou, Xubin Li, Liansheng Zhuang, Houqiang Li

Figure 1 for SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Figure 2 for SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Figure 3 for SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Figure 4 for SceneBooth: Diffusion-based Framework for Subject-preserved Text-to-Image Generation

Abstract:Due to the demand for personalizing image generation, subject-driven text-to-image generation method, which creates novel renditions of an input subject based on text prompts, has received growing research interest. Existing methods often learn subject representation and incorporate it into the prompt embedding to guide image generation, but they struggle with preserving subject fidelity. To solve this issue, this paper approaches a novel framework named SceneBooth for subject-preserved text-to-image generation, which consumes inputs of a subject image, object phrases and text prompts. Instead of learning the subject representation and generating a subject, our SceneBooth fixes the given subject image and generates its background image guided by the text prompts. To this end, our SceneBooth introduces two key components, i.e., a multimodal layout generation module and a background painting module. The former determines the position and scale of the subject by generating appropriate scene layouts that align with text captions, object phrases, and subject visual information. The latter integrates two adapters (ControlNet and Gated Self-Attention) into the latent diffusion model to generate a background that harmonizes with the subject guided by scene layouts and text descriptions. In this manner, our SceneBooth ensures accurate preservation of the subject's appearance in the output. Quantitative and qualitative experimental results demonstrate that SceneBooth significantly outperforms baseline methods in terms of subject preservation, image harmonization and overall quality.

Via

Access Paper or Ask Questions

Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Dec 19, 2024

Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li

Figure 1 for Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Figure 2 for Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Figure 3 for Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Figure 4 for Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Abstract:Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes.

* Accepted by AAAI 2025

Via

Access Paper or Ask Questions

RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Dec 17, 2024

Junjie Lin, Jian Zhao, Lin Liu, Yue Deng, Youpeng Zhao, Lanxiao Huang, Xia Lin, Wengang Zhou, Houqiang Li

Figure 1 for RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Figure 2 for RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Figure 3 for RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Figure 4 for RL-LLM-DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

Abstract:Traditionally, AI development for two-player zero-sum games has relied on two primary techniques: decision trees and reinforcement learning (RL). A common approach involves using a fixed decision tree as one player's strategy while training an RL agent as the opponent to identify vulnerabilities in the decision tree, thereby improving its strategic strength iteratively. However, this process often requires significant human intervention to refine the decision tree after identifying its weaknesses, resulting in inefficiencies and hindering full automation of the strategy enhancement process. Fortunately, the advent of Large Language Models (LLMs) offers a transformative opportunity to automate the process. We propose RL-LLM-DT, an automatic decision tree generation method based on RL Evaluation and LLM Enhancement. Given an initial decision tree, the method involves two important iterative steps. Response Policy Search: RL is used to discover counter-strategies targeting the decision tree. Policy Improvement: LLMs analyze failure scenarios and generate improved decision tree code. In our method, RL focuses on finding the decision tree's flaws while LLM is prompted to generate an improved version of the decision tree. The iterative refinement process terminates when RL can't find any flaw of the tree or LLM fails to improve the tree. To evaluate the effectiveness of this integrated approach, we conducted experiments in a curling game. After iterative refinements, our curling AI based on the decision tree ranks first on the Jidi platform among 34 curling AIs in total, which demonstrates that LLMs can significantly enhance the robustness and adaptability of decision trees, representing a substantial advancement in the field of Game AI. Our code is available at https://github.com/Linjunjie99/RL-LLM-DT.

* Length:10 pages. Figures:10 figures. Additional Notes:In this paper, we have introduced a novel hybrid approach which leverages the strengths of both RL and LLMs to itera- tively refine decision tree tactics, enhancing their performance and adaptability

Via

Access Paper or Ask Questions

RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

Dec 17, 2024

Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Houqiang Li, Yanyong Zhang

Figure 1 for RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

Figure 2 for RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

Figure 3 for RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

Figure 4 for RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion

Abstract:We propose Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection by the following insight. The Radar-Camera fusion in outdoor 3D scene perception is capped by the image-to-BEV transformation--if the depth of pixels is not accurately estimated, the naive combination of BEV features actually integrates unaligned visual content. To avoid this problem, we propose a query-based framework that enables adaptively sample instance-relevant features from both the BEV and the original image view. Furthermore, we enhance system performance by two key designs: optimizing query initialization and strengthening the representational capacity of BEV. For the former, we introduce an adaptive circular distribution in polar coordinates to refine the initialization of object queries, allowing for a distance-based adjustment of query density. For the latter, we initially incorporate a radar-guided depth head to refine the transformation from image view to BEV. Subsequently, we focus on leveraging the Doppler effect of radar and introduce an implicit dynamic catcher to capture the temporal elements within the BEV. Extensive experiments on nuScenes and View-of-Delft (VoD) datasets validate the merits of our design. Remarkably, our method achieves superior results of 64.9% mAP and 70.2% NDS on nuScenes, even outperforming several LiDAR-based detectors. RaCFormer also secures the 1st ranking on the VoD dataset. The code will be released.

Via

Access Paper or Ask Questions

Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Nov 27, 2024

Zhiyang Guo, Jinxu Xiang, Kai Ma, Wengang Zhou, Houqiang Li, Ran Zhang

Figure 1 for Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Figure 2 for Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Figure 3 for Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Figure 4 for Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

Abstract:3D characters are essential to modern creative industries, but making them animatable often demands extensive manual work in tasks like rigging and skinning. Existing automatic rigging tools face several limitations, including the necessity for manual annotations, rigid skeleton topologies, and limited generalization across diverse shapes and poses. An alternative approach is to generate animatable avatars pre-bound to a rigged template mesh. However, this method often lacks flexibility and is typically limited to realistic human shapes. To address these issues, we present Make-It-Animatable, a novel data-driven method to make any 3D humanoid model ready for character animation in less than one second, regardless of its shapes and poses. Our unified framework generates high-quality blend weights, bones, and pose transformations. By incorporating a particle-based shape autoencoder, our approach supports various 3D representations, including meshes and 3D Gaussian splats. Additionally, we employ a coarse-to-fine representation and a structure-aware modeling strategy to ensure both accuracy and robustness, even for characters with non-standard skeleton structures. We conducted extensive experiments to validate our framework's effectiveness. Compared to existing methods, our approach demonstrates significant improvements in both quality and speed.

* Project Page: https://jasongzy.github.io/Make-It-Animatable/

Via

Access Paper or Ask Questions

ROOT: VLM based System for Indoor Scene Understanding and Beyond

Nov 24, 2024

Yonghui Wang, Shi-Yong Chen, Zhenxing Zhou, Siyi Li, Haoran Li, Wengang Zhou, Houqiang Li

Figure 1 for ROOT: VLM based System for Indoor Scene Understanding and Beyond

Figure 2 for ROOT: VLM based System for Indoor Scene Understanding and Beyond

Figure 3 for ROOT: VLM based System for Indoor Scene Understanding and Beyond

Figure 4 for ROOT: VLM based System for Indoor Scene Understanding and Beyond

Abstract:Recently, Vision Language Models (VLMs) have experienced significant advancements, yet these models still face challenges in spatial hierarchical reasoning within indoor scenes. In this study, we introduce ROOT, a VLM-based system designed to enhance the analysis of indoor scenes. Specifically, we first develop an iterative object perception algorithm using GPT-4V to detect object entities within indoor scenes. This is followed by employing vision foundation models to acquire additional meta-information about the scene, such as bounding boxes. Building on this foundational data, we propose a specialized VLM, SceneVLM, which is capable of generating spatial hierarchical scene graphs and providing distance information for objects within indoor environments. This information enhances our understanding of the spatial arrangement of indoor scenes. To train our SceneVLM, we collect over 610,000 images from various public indoor datasets and implement a scene data generation pipeline with a semi-automated technique to establish relationships and estimate distances among indoor objects. By utilizing this enriched data, we conduct various training recipes and finish SceneVLM. Our experiments demonstrate that \rootname facilitates indoor scene understanding and proves effective in diverse downstream applications, such as 3D scene generation and embodied AI. The code will be released at \url{https://github.com/harrytea/ROOT}.

Via

Access Paper or Ask Questions

BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Nov 19, 2024

Zongmeng Zhang, Jinhua Zhu, Wengang Zhou, Xiang Qi, Peng Zhang, Houqiang Li

Figure 1 for BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Figure 2 for BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Figure 3 for BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Figure 4 for BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Abstract:Dense retrieval, which aims to encode the semantic information of arbitrary text into dense vector representations or embeddings, has emerged as an effective and efficient paradigm for text retrieval, consequently becoming an essential component in various natural language processing systems. These systems typically focus on optimizing the embedding space by attending to the relevance of text pairs, while overlooking the Boolean logic inherent in language, which may not be captured by current training objectives. In this work, we first investigate whether current retrieval systems can comprehend the Boolean logic implied in language. To answer this question, we formulate the task of Boolean Dense Retrieval and collect a benchmark dataset, BoolQuestions, which covers complex queries containing basic Boolean logic and corresponding annotated passages. Through extensive experimental results on the proposed task and benchmark dataset, we draw the conclusion that current dense retrieval systems do not fully understand Boolean logic in language, and there is a long way to go to improve our dense retrieval systems. Furthermore, to promote further research on enhancing the understanding of Boolean logic for language models, we explore Boolean operation on decomposed query and propose a contrastive continual training method that serves as a strong baseline for the research community.

* In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2767-2779
* Findings of the Association for Computational Linguistics: EMNLP 2024

Via

Access Paper or Ask Questions

Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Oct 22, 2024

Zongmeng Zhang, Yufeng Shi, Jinhua Zhu, Wengang Zhou, Xiang Qi, Peng Zhang, Houqiang Li

Figure 1 for Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Figure 2 for Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Figure 3 for Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Figure 4 for Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Abstract:Trustworthiness is an essential prerequisite for the real-world application of large language models. In this paper, we focus on the trustworthiness of language models with respect to retrieval augmentation. Despite being supported with external evidence, retrieval-augmented generation still suffers from hallucinations, one primary cause of which is the conflict between contextual and parametric knowledge. We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge. Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence and disregards the interference of parametric knowledge. Specifically, we propose a reinforcement learning based algorithm Trustworthy-Alignment, theoretically and experimentally demonstrating large language models' capability of reaching a trustworthy status without explicit supervision on how to respond. Our work highlights the potential of large language models on exploring its intrinsic abilities by its own and expands the application scenarios of alignment from fulfilling human preference to creating trustworthy agents.

* Proceedings of the 41st International Conference on Machine Learning, PMLR 235:59827-59850, 2024
* ICML 2024

Via

Access Paper or Ask Questions