Alert button
Picture for Zhen Li

Zhen Li

Alert button

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Nov 26, 2023
Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li

3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.

* Under review, project website: https://curryyuan.github.io/ZSVG3D/ 
Viaarxiv icon

ScribblePolyp: Scribble-Supervised Polyp Segmentation through Dual Consistency Alignment

Nov 09, 2023
Zixun Zhang, Yuncheng Jiang, Jun Wei, Hannah Cui, Zhen Li

Automatic polyp segmentation models play a pivotal role in the clinical diagnosis of gastrointestinal diseases. In previous studies, most methods relied on fully supervised approaches, necessitating pixel-level annotations for model training. However, the creation of pixel-level annotations is both expensive and time-consuming, impeding the development of model generalization. In response to this challenge, we introduce ScribblePolyp, a novel scribble-supervised polyp segmentation framework. Unlike fully-supervised models, ScribblePolyp only requires the annotation of two lines (scribble labels) for each image, significantly reducing the labeling cost. Despite the coarse nature of scribble labels, which leave a substantial portion of pixels unlabeled, we propose a two-branch consistency alignment approach to provide supervision for these unlabeled pixels. The first branch employs transformation consistency alignment to narrow the gap between predictions under different transformations of the same input image. The second branch leverages affinity propagation to refine predictions into a soft version, extending additional supervision to unlabeled pixels. In summary, ScribblePolyp is an efficient model that does not rely on teacher models or moving average pseudo labels during training. Extensive experiments on the SUN-SEG dataset underscore the effectiveness of ScribblePolyp, achieving a Dice score of 0.8155, with the potential for a 1.8% improvement in the Dice score through a straightforward self-training strategy.

* Accepted by BIBM 2023 
Viaarxiv icon

SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

Sep 13, 2023
Yiran Qin, Chaoqun Wang, Zijian Kang, Ningning Ma, Zhen Li, Ruimao Zhang

Figure 1 for SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection
Figure 2 for SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection
Figure 3 for SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection
Figure 4 for SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

In this paper, we propose a novel training strategy called SupFusion, which provides an auxiliary feature level supervision for effective LiDAR-Camera fusion and significantly boosts detection performance. Our strategy involves a data enhancement method named Polar Sampling, which densifies sparse objects and trains an assistant model to generate high-quality features as the supervision. These features are then used to train the LiDAR-Camera fusion model, where the fusion feature is optimized to simulate the generated high-quality features. Furthermore, we propose a simple yet effective deep fusion module, which contiguously gains superior performance compared with previous fusion methods with SupFusion strategy. In such a manner, our proposal shares the following advantages. Firstly, SupFusion introduces auxiliary feature-level supervision which could boost LiDAR-Camera detection performance without introducing extra inference costs. Secondly, the proposed deep fusion could continuously improve the detector's abilities. Our proposed SupFusion and deep fusion module is plug-and-play, we make extensive experiments to demonstrate its effectiveness. Specifically, we gain around 2% 3D mAP improvements on KITTI benchmark based on multiple LiDAR-Camera 3D detectors.

* Accepted to ICCV2023 
Viaarxiv icon

ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models

Sep 03, 2023
Yuhao Du, Yuncheng Jiang, Shuangyi Tan, Xusheng Wu, Qi Dou, Zhen Li, Guanbin Li, Xiang Wan

Figure 1 for ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models
Figure 2 for ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models
Figure 3 for ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models
Figure 4 for ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models

Colonoscopy analysis, particularly automatic polyp segmentation and detection, is essential for assisting clinical diagnosis and treatment. However, as medical image annotation is labour- and resource-intensive, the scarcity of annotated data limits the effectiveness and generalization of existing methods. Although recent research has focused on data generation and augmentation to address this issue, the quality of the generated data remains a challenge, which limits the contribution to the performance of subsequent tasks. Inspired by the superiority of diffusion models in fitting data distributions and generating high-quality data, in this paper, we propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks. Specifically, ArSDM utilizes the ground-truth segmentation mask as a prior condition during training and adjusts the diffusion loss for each input according to the polyp/background size ratio. Furthermore, ArSDM incorporates a pre-trained segmentation model to refine the training process by reducing the difference between the ground-truth mask and the prediction mask. Extensive experiments on segmentation and detection tasks demonstrate the generated data by ArSDM could significantly boost the performance of baseline methods.

* Accepted by MICCAI-2023 
Viaarxiv icon

LATR: 3D Lane Detection from Monocular Images with Transformer

Aug 20, 2023
Yueru Luo, Chaoda Zheng, Xu Yan, Tang Kun, Chao Zheng, Shuguang Cui, Zhen Li

Figure 1 for LATR: 3D Lane Detection from Monocular Images with Transformer
Figure 2 for LATR: 3D Lane Detection from Monocular Images with Transformer
Figure 3 for LATR: 3D Lane Detection from Monocular Images with Transformer
Figure 4 for LATR: 3D Lane Detection from Monocular Images with Transformer

3D lane detection from monocular images is a fundamental yet challenging task in autonomous driving. Recent advances primarily rely on structural 3D surrogates (e.g., bird's eye view) built from front-view image features and camera parameters. However, the depth ambiguity in monocular images inevitably causes misalignment between the constructed surrogate feature map and the original image, posing a great challenge for accurate lane detection. To address the above issue, we present a novel LATR model, an end-to-end 3D lane detector that uses 3D-aware front-view features without transformed view representation. Specifically, LATR detects 3D lanes via cross-attention based on query and key-value pairs, constructed using our lane-aware query generator and dynamic 3D ground positional embedding. On the one hand, each query is generated based on 2D lane-aware features and adopts a hybrid embedding to enhance lane information. On the other hand, 3D space information is injected as positional embedding from an iteratively-updated 3D ground plane. LATR outperforms previous state-of-the-art methods on both synthetic Apollo, realistic OpenLane and ONCE-3DLanes by large margins (e.g., 11.4 gain in terms of F1 score on OpenLane). Code will be released at https://github.com/JMoonr/LATR .

* Accepted by ICCV2023 (Oral) 
Viaarxiv icon

WeakPolyp: You Only Look Bounding Box for Polyp Segmentation

Jul 20, 2023
Jun Wei, Yiwen Hu, Shuguang Cui, S. Kevin Zhou, Zhen Li

Figure 1 for WeakPolyp: You Only Look Bounding Box for Polyp Segmentation
Figure 2 for WeakPolyp: You Only Look Bounding Box for Polyp Segmentation
Figure 3 for WeakPolyp: You Only Look Bounding Box for Polyp Segmentation
Figure 4 for WeakPolyp: You Only Look Bounding Box for Polyp Segmentation

Limited by expensive pixel-level labels, polyp segmentation models are plagued by data shortage and suffer from impaired generalization. In contrast, polyp bounding box annotations are much cheaper and more accessible. Thus, to reduce labeling cost, we propose to learn a weakly supervised polyp segmentation model (i.e., WeakPolyp) completely based on bounding box annotations. However, coarse bounding boxes contain too much noise. To avoid interference, we introduce the mask-to-box (M2B) transformation. By supervising the outer box mask of the prediction instead of the prediction itself, M2B greatly mitigates the mismatch between the coarse label and the precise prediction. But, M2B only provides sparse supervision, leading to non-unique predictions. Therefore, we further propose a scale consistency (SC) loss for dense supervision. By explicitly aligning predictions across the same image at different scales, the SC loss largely reduces the variation of predictions. Note that our WeakPolyp is a plug-and-play model, which can be easily ported to other appealing backbones. Besides, the proposed modules are only used during training, bringing no computation cost to inference. Extensive experiments demonstrate the effectiveness of our proposed WeakPolyp, which surprisingly achieves a comparable performance with a fully supervised model, requiring no mask annotations at all.

* accepted by MICCAI 2023, codes are available at https://github.com/weijun88/WeakPolyp 
Viaarxiv icon

Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation

Jul 19, 2023
Wei Jin, Haitao Mao, Zheng Li, Haoming Jiang, Chen Luo, Hongzhi Wen, Haoyu Han, Hanqing Lu, Zhengyang Wang, Ruirui Li, Zhen Li, Monica Xiao Cheng, Rahul Goutam, Haiyang Zhang, Karthik Subbian, Suhang Wang, Yizhou Sun, Jiliang Tang, Bing Yin, Xianfeng Tang

Figure 1 for Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation
Figure 2 for Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation
Figure 3 for Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation
Figure 4 for Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation

Modeling customer shopping intentions is a crucial task for e-commerce, as it directly impacts user experience and engagement. Thus, accurately understanding customer preferences is essential for providing personalized recommendations. Session-based recommendation, which utilizes customer session data to predict their next interaction, has become increasingly popular. However, existing session datasets have limitations in terms of item attributes, user diversity, and dataset scale. As a result, they cannot comprehensively capture the spectrum of user behaviors and preferences. To bridge this gap, we present the Amazon Multilingual Multi-locale Shopping Session Dataset, namely Amazon-M2. It is the first multilingual dataset consisting of millions of user sessions from six different locales, where the major languages of products are English, German, Japanese, French, Italian, and Spanish. Remarkably, the dataset can help us enhance personalization and understanding of user preferences, which can benefit various existing tasks as well as enable new tasks. To test the potential of the dataset, we introduce three tasks in this work: (1) next-product recommendation, (2) next-product recommendation with domain shifts, and (3) next-product title generation. With the above tasks, we benchmark a range of algorithms on our proposed dataset, drawing new insights for further research and practice. In addition, based on the proposed dataset and tasks, we hosted a competition in the KDD CUP 2023 and have attracted thousands of users and submissions. The winning solutions and the associated workshop can be accessed at our website https://kddcup23.github.io/.

* Dataset for KDD Cup 2023, https://kddcup23.github.io/ 
Viaarxiv icon

Characterization of partial wetting by CMAS droplets using multiphase many-body dissipative particle dynamics and data-driven discovery based on PINNs

Jul 18, 2023
Elham Kiyani, Mahdi Kooshkbaghi, Khemraj Shukla, Rahul Babu Koneru, Zhen Li, Luis Bravo, Anindya Ghoshal, George Em Karniadakis, Mikko Karttunen

Figure 1 for Characterization of partial wetting by CMAS droplets using multiphase many-body dissipative particle dynamics and data-driven discovery based on PINNs
Figure 2 for Characterization of partial wetting by CMAS droplets using multiphase many-body dissipative particle dynamics and data-driven discovery based on PINNs
Figure 3 for Characterization of partial wetting by CMAS droplets using multiphase many-body dissipative particle dynamics and data-driven discovery based on PINNs
Figure 4 for Characterization of partial wetting by CMAS droplets using multiphase many-body dissipative particle dynamics and data-driven discovery based on PINNs

The molten sand, a mixture of calcia, magnesia, alumina, and silicate, known as CMAS, is characterized by its high viscosity, density, and surface tension. The unique properties of CMAS make it a challenging material to deal with in high-temperature applications, requiring innovative solutions and materials to prevent its buildup and damage to critical equipment. Here, we use multiphase many-body dissipative particle dynamics (mDPD) simulations to study the wetting dynamics of highly viscous molten CMAS droplets. The simulations are performed in three dimensions, with varying initial droplet sizes and equilibrium contact angles. We propose a coarse parametric ordinary differential equation (ODE) that captures the spreading radius behavior of the CMAS droplets. The ODE parameters are then identified based on the Physics-Informed Neural Network (PINN) framework. Subsequently, the closed form dependency of parameter values found by PINN on the initial radii and contact angles are given using symbolic regression. Finally, we employ Bayesian PINNs (B-PINNs) to assess and quantify the uncertainty associated with the discovered parameters. In brief, this study provides insight into spreading dynamics of CMAS droplets by fusing simple parametric ODE modeling and state-of-the-art machine learning techniques.

Viaarxiv icon

SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training

Jul 17, 2023
Hong Yan, Yang Liu, Yushen Wei, Zhen Li, Guanbin Li, Liang Lin

Figure 1 for SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training
Figure 2 for SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training
Figure 3 for SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training
Figure 4 for SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training

Skeleton sequence representation learning has shown great advantages for action recognition due to its promising ability to model human joints and topology. However, the current methods usually require sufficient labeled data for training computationally expensive models, which is labor-intensive and time-consuming. Moreover, these methods ignore how to utilize the fine-grained dependencies among different skeleton joints to pre-train an efficient skeleton sequence learning model that can generalize well across different datasets. In this paper, we propose an efficient skeleton sequence learning framework, named Skeleton Sequence Learning (SSL). To comprehensively capture the human pose and obtain discriminative skeleton sequence representation, we build an asymmetric graph-based encoder-decoder pre-training architecture named SkeletonMAE, which embeds skeleton joint sequence into Graph Convolutional Network (GCN) and reconstructs the masked skeleton joints and edges based on the prior human topology knowledge. Then, the pre-trained SkeletonMAE encoder is integrated with the Spatial-Temporal Representation Learning (STRL) module to build the SSL framework. Extensive experimental results show that our SSL generalizes well across different datasets and outperforms the state-of-the-art self-supervised skeleton-based action recognition methods on FineGym, Diving48, NTU 60 and NTU 120 datasets. Additionally, we obtain comparable performance to some fully supervised methods. The code is avaliable at https://github.com/HongYan1123/SkeletonMAE.

* Accepted by ICCV 2023 
Viaarxiv icon