Alert button
Picture for Liang Lin

Liang Lin

Alert button

Exploration and Exploitation of Unlabeled Data for Open-Set Semi-Supervised Learning

Jun 30, 2023
Ganlong Zhao, Guanbin Li, Yipeng Qin, Jinjin Zhang, Zhenhua Chai, Xiaolin Wei, Liang Lin, Yizhou Yu

Figure 1 for Exploration and Exploitation of Unlabeled Data for Open-Set Semi-Supervised Learning
Figure 2 for Exploration and Exploitation of Unlabeled Data for Open-Set Semi-Supervised Learning
Figure 3 for Exploration and Exploitation of Unlabeled Data for Open-Set Semi-Supervised Learning
Figure 4 for Exploration and Exploitation of Unlabeled Data for Open-Set Semi-Supervised Learning

In this paper, we address a complex but practical scenario in semi-supervised learning (SSL) named open-set SSL, where unlabeled data contain both in-distribution (ID) and out-of-distribution (OOD) samples. Unlike previous methods that only consider ID samples to be useful and aim to filter out OOD ones completely during training, we argue that the exploration and exploitation of both ID and OOD samples can benefit SSL. To support our claim, i) we propose a prototype-based clustering and identification algorithm that explores the inherent similarity and difference among samples at feature level and effectively cluster them around several predefined ID and OOD prototypes, thereby enhancing feature learning and facilitating ID/OOD identification; ii) we propose an importance-based sampling method that exploits the difference in importance of each ID and OOD sample to SSL, thereby reducing the sampling bias and improving the training. Our proposed method achieves state-of-the-art in several challenging benchmarks, and improves upon existing SSL methods even when ID samples are totally absent in unlabeled data.

Viaarxiv icon

CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning

Jun 30, 2023
Yang Liu, Weixing Chen, Guanbin Li, Liang Lin

Figure 1 for CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning
Figure 2 for CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning

We present CausalVLR (Causal Visual-Linguistic Reasoning), an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods for various visual-linguistic reasoning tasks, such as VQA, image/video captioning, medical report generation, model generalization and robustness, etc. These methods have been included in the toolbox with PyTorch implementations under NVIDIA computing system. It not only includes training and inference codes, but also provides model weights. We believe this toolbox is by far the most complete visual-linguitic causal reasoning toolbox. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible toolkit to re-implement existing methods and develop their own new causal reasoning methods. Code and models are available at https://github.com/HCPLab-SYSU/Causal-VLReasoning. The project is under active development by HCP-Lab's contributors and we will keep this document updated.

* CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning. https://github.com/HCPLab-SYSU/CausalVLR 
Viaarxiv icon

DreamEditor: Text-Driven 3D Scene Editing with Neural Fields

Jun 29, 2023
Jingyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, Guanbin Li

Figure 1 for DreamEditor: Text-Driven 3D Scene Editing with Neural Fields
Figure 2 for DreamEditor: Text-Driven 3D Scene Editing with Neural Fields
Figure 3 for DreamEditor: Text-Driven 3D Scene Editing with Neural Fields
Figure 4 for DreamEditor: Text-Driven 3D Scene Editing with Neural Fields

Neural fields have achieved impressive advancements in view synthesis and scene reconstruction. However, editing these neural fields remains challenging due to the implicit encoding of geometry and texture information. In this paper, we propose DreamEditor, a novel framework that enables users to perform controlled editing of neural fields using text prompts. By representing scenes as mesh-based neural fields, DreamEditor allows localized editing within specific regions. DreamEditor utilizes the text encoder of a pretrained text-to-Image diffusion model to automatically identify the regions to be edited based on the semantics of the text prompts. Subsequently, DreamEditor optimizes the editing region and aligns its geometry and texture with the text prompts through score distillation sampling [29]. Extensive experiments have demonstrated that DreamEditor can accurately edit neural fields of real-world scenes according to the given text prompts while ensuring consistency in irrelevant areas. DreamEditor generates highly realistic textures and geometry, significantly surpassing previous works in both quantitative and qualitative evaluations.

Viaarxiv icon

DenseLight: Efficient Control for Large-scale Traffic Signals with Dense Feedback

Jun 13, 2023
Junfan Lin, Yuying Zhu, Lingbo Liu, Yang Liu, Guanbin Li, Liang Lin

Figure 1 for DenseLight: Efficient Control for Large-scale Traffic Signals with Dense Feedback
Figure 2 for DenseLight: Efficient Control for Large-scale Traffic Signals with Dense Feedback
Figure 3 for DenseLight: Efficient Control for Large-scale Traffic Signals with Dense Feedback
Figure 4 for DenseLight: Efficient Control for Large-scale Traffic Signals with Dense Feedback

Traffic Signal Control (TSC) aims to reduce the average travel time of vehicles in a road network, which in turn enhances fuel utilization efficiency, air quality, and road safety, benefiting society as a whole. Due to the complexity of long-horizon control and coordination, most prior TSC methods leverage deep reinforcement learning (RL) to search for a control policy and have witnessed great success. However, TSC still faces two significant challenges. 1) The travel time of a vehicle is delayed feedback on the effectiveness of TSC policy at each traffic intersection since it is obtained after the vehicle has left the road network. Although several heuristic reward functions have been proposed as substitutes for travel time, they are usually biased and not leading the policy to improve in the correct direction. 2) The traffic condition of each intersection is influenced by the non-local intersections since vehicles traverse multiple intersections over time. Therefore, the TSC agent is required to leverage both the local observation and the non-local traffic conditions to predict the long-horizontal traffic conditions of each intersection comprehensively. To address these challenges, we propose DenseLight, a novel RL-based TSC method that employs an unbiased reward function to provide dense feedback on policy effectiveness and a non-local enhanced TSC agent to better predict future traffic conditions for more precise traffic control. Extensive experiments and ablation studies demonstrate that DenseLight can consistently outperform advanced baselines on various road networks with diverse traffic flows. The code is available at https://github.com/junfanlin/DenseLight.

* This work is accepted by IJCAI2023 
Viaarxiv icon

Long-term Wind Power Forecasting with Hierarchical Spatial-Temporal Transformer

May 30, 2023
Yang Zhang, Lingbo Liu, Xinyu Xiong, Guanbin Li, Guoli Wang, Liang Lin

Figure 1 for Long-term Wind Power Forecasting with Hierarchical Spatial-Temporal Transformer
Figure 2 for Long-term Wind Power Forecasting with Hierarchical Spatial-Temporal Transformer
Figure 3 for Long-term Wind Power Forecasting with Hierarchical Spatial-Temporal Transformer
Figure 4 for Long-term Wind Power Forecasting with Hierarchical Spatial-Temporal Transformer

Wind power is attracting increasing attention around the world due to its renewable, pollution-free, and other advantages. However, safely and stably integrating the high permeability intermittent power energy into electric power systems remains challenging. Accurate wind power forecasting (WPF) can effectively reduce power fluctuations in power system operations. Existing methods are mainly designed for short-term predictions and lack effective spatial-temporal feature augmentation. In this work, we propose a novel end-to-end wind power forecasting model named Hierarchical Spatial-Temporal Transformer Network (HSTTN) to address the long-term WPF problems. Specifically, we construct an hourglass-shaped encoder-decoder framework with skip-connections to jointly model representations aggregated in hierarchical temporal scales, which benefits long-term forecasting. Based on this framework, we capture the inter-scale long-range temporal dependencies and global spatial correlations with two parallel Transformer skeletons and strengthen the intra-scale connections with downsampling and upsampling operations. Moreover, the complementary information from spatial and temporal features is fused and propagated in each other via Contextual Fusion Blocks (CFBs) to promote the prediction further. Extensive experimental results on two large-scale real-world datasets demonstrate the superior performance of our HSTTN over existing solutions.

* Accepted to IJCAI 2023 
Viaarxiv icon

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

May 23, 2023
Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin

Figure 1 for Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models
Figure 2 for Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models
Figure 3 for Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models
Figure 4 for Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

This paper presents a controllable text-to-video (T2V) diffusion model, named Video-ControlNet, that generates videos conditioned on a sequence of control signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained conditional text-to-image (T2I) diffusion model by incorporating a spatial-temporal self-attention mechanism and trainable temporal layers for efficient cross-frame modeling. A first-frame conditioning strategy is proposed to facilitate the model to generate videos transferred from the image domain as well as arbitrary-length videos in an auto-regressive manner. Moreover, Video-ControlNet employs a novel residual-based noise initialization strategy to introduce motion prior from an input video, producing more coherent videos. With the proposed architecture and strategies, Video-ControlNet can achieve resource-efficient convergence and generate superior quality and consistent videos with fine-grained control. Extensive experiments demonstrate its success in various video generative tasks such as video editing and video style transfer, outperforming previous methods in terms of consistency and quality. Project Page: https://controlavideo.github.io/

Viaarxiv icon

Identity-Preserving Talking Face Generation with Landmark and Appearance Priors

May 15, 2023
Weizhi Zhong, Chaowei Fang, Yinqi Cai, Pengxu Wei, Gangming Zhao, Liang Lin, Guanbin Li

Figure 1 for Identity-Preserving Talking Face Generation with Landmark and Appearance Priors
Figure 2 for Identity-Preserving Talking Face Generation with Landmark and Appearance Priors
Figure 3 for Identity-Preserving Talking Face Generation with Landmark and Appearance Priors
Figure 4 for Identity-Preserving Talking Face Generation with Landmark and Appearance Priors

Generating talking face videos from audio attracts lots of research interest. A few person-specific methods can generate vivid videos but require the target speaker's videos for training or fine-tuning. Existing person-generic methods have difficulty in generating realistic and lip-synced videos while preserving identity information. To tackle this problem, we propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. First, we devise a novel Transformer-based landmark generator to infer lip and jaw landmarks from the audio. Prior landmark characteristics of the speaker's face are employed to make the generated landmarks coincide with the facial outline of the speaker. Then, a video rendering model is built to translate the generated landmarks into face images. During this stage, prior appearance information is extracted from the lower-half occluded target face and static reference images, which helps generate realistic and identity-preserving visual content. For effectively exploring the prior information of static reference images, we align static reference images with the target face's pose and expression based on motion fields. Moreover, auditory features are reused to guarantee that the generated face images are well synchronized with the audio. Extensive experiments demonstrate that our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.

* CVPR2023, Code: https://github.com/Weizhi-Zhong/IP_LAP 
Viaarxiv icon

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

May 12, 2023
Shanshan Zhong, Zhongzhan Huang, Wushao Wen, Jinghui Qin, Liang Lin

Figure 1 for SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
Figure 2 for SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
Figure 3 for SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
Figure 4 for SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

Diffusion models, which have emerged to become popular text-to-image generation models, can produce high-quality and content-rich images guided by textual prompts. However, there are limitations to semantic understanding and commonsense reasoning in existing models when the input prompts are concise narrative, resulting in low-quality image generation. To improve the capacities for narrative prompts, we propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. To reach this goal, we first collect and annotate a new dataset SURD which consists of more than 57,000 semantically corrected multi-modal samples. Each sample contains a simple narrative prompt, a complex keyword-based prompt, and a high-quality image. Then, we align the semantic representation of narrative prompts to the complex prompts and transfer knowledge of large language models (LLMs) to our SUR-adapter via knowledge distillation so that it can acquire the powerful semantic understanding and reasoning capabilities to build a high-quality textual semantic representation for text-to-image generation. We conduct experiments by integrating multiple LLMs and popular pre-trained diffusion models to show the effectiveness of our approach in enabling diffusion models to understand and reason concise natural language without image quality degradation. Our approach can make text-to-image diffusion models easier to use with better user experience, which demonstrates our approach has the potential for further advancing the development of user-friendly text-to-image generation models by bridging the semantic gap between simple narrative prompts and complex keyword-based prompts. The code is released at https://github.com/Qrange-group/SUR-adapter.

* work in progress 
Viaarxiv icon

Visual Causal Scene Refinement for Video Question Answering

May 07, 2023
Yushen Wei, Yang Liu, Hong Yan, Guanbin Li, Liang Lin

Figure 1 for Visual Causal Scene Refinement for Video Question Answering
Figure 2 for Visual Causal Scene Refinement for Video Question Answering
Figure 3 for Visual Causal Scene Refinement for Video Question Answering
Figure 4 for Visual Causal Scene Refinement for Video Question Answering

Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering.

* 12 pages,7 figures.The pioneer work to discover visual causal scenes for video question answering 
Viaarxiv icon