Alert button
Picture for Lei Bai

Lei Bai

Alert button

Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation

Sep 07, 2023
Zhuqiang Lu, Kun Hu, Chaoyue Wang, Lei Bai, Zhiyong Wang

Figure 1 for Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation
Figure 2 for Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation
Figure 3 for Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation
Figure 4 for Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation

A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by out-painting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code will be made publicly available.

* 10 pages 
Viaarxiv icon

Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training

Aug 31, 2023
Lei Bai, Dongang Wang, Michael Barnett, Mariano Cabezas, Weidong Cai, Fernando Calamante, Kain Kyle, Dongnan Liu, Linda Ly, Aria Nguyen, Chun-Chien Shieh, Ryan Sullivan, Hengrui Wang, Geng Zhan, Wanli Ouyang, Chenyu Wang

Accurately measuring the evolution of Multiple Sclerosis (MS) with magnetic resonance imaging (MRI) critically informs understanding of disease progression and helps to direct therapeutic strategy. Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. Obtaining sufficient data from a single clinical site is challenging and does not address the heterogeneous need for model robustness. Conversely, the collection of data from multiple sites introduces data privacy concerns and potential label noise due to varying annotation standards. To address this dilemma, we explore the use of the federated learning framework while considering label noise. Our approach enables collaboration among multiple clinical sites without compromising data privacy under a federated learning paradigm that incorporates a noise-robust training strategy based on label correction. Specifically, we introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions, enabling the correction of false annotations based on prediction confidence. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites, enhancing the reliability of the correction process. Extensive experiments conducted on two multi-site datasets demonstrate the effectiveness and robustness of our proposed methods, indicating their potential for clinical applications in multi-site collaborations.

* 11 pages, 4 figures, journal submission 
Viaarxiv icon

STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning

Aug 21, 2023
Tao Han, Lei Bai, Lingbo Liu, Wanli Ouyang

Figure 1 for STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning
Figure 2 for STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning
Figure 3 for STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning
Figure 4 for STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning

Scale variation is a deep-rooted problem in object counting, which has not been effectively addressed by existing scale-aware algorithms. An important factor is that they typically involve cooperative learning across multi-resolutions, which could be suboptimal for learning the most discriminative features from each scale. In this paper, we propose a novel method termed STEERER (\textbf{S}elec\textbf{T}iv\textbf{E} inh\textbf{ER}itance l\textbf{E}a\textbf{R}ning) that addresses the issue of scale variations in object counting. STEERER selects the most suitable scale for patch objects to boost feature extraction and only inherits discriminative features from lower to higher resolution progressively. The main insights of STEERER are a dedicated Feature Selection and Inheritance Adaptor (FSIA), which selectively forwards scale-customized features at each scale, and a Masked Selection and Inheritance Loss (MSIL) that helps to achieve high-quality density maps across all scales. Our experimental results on nine datasets with counting and localization tasks demonstrate the unprecedented scale generalization ability of STEERER. Code is available at \url{https://github.com/taohan10200/STEERER}.

* Accepted by ICCV2023, 9 pages 
Viaarxiv icon

Relation-Aware Distribution Representation Network for Person Clustering with Multiple Modalities

Aug 01, 2023
Kaijian Liu, Shixiang Tang, Ziyue Li, Zhishuai Li, Lei Bai, Feng Zhu, Rui Zhao

Figure 1 for Relation-Aware Distribution Representation Network for Person Clustering with Multiple Modalities
Figure 2 for Relation-Aware Distribution Representation Network for Person Clustering with Multiple Modalities
Figure 3 for Relation-Aware Distribution Representation Network for Person Clustering with Multiple Modalities
Figure 4 for Relation-Aware Distribution Representation Network for Person Clustering with Multiple Modalities

Person clustering with multi-modal clues, including faces, bodies, and voices, is critical for various tasks, such as movie parsing and identity-based movie editing. Related methods such as multi-view clustering mainly project multi-modal features into a joint feature space. However, multi-modal clue features are usually rather weakly correlated due to the semantic gap from the modality-specific uniqueness. As a result, these methods are not suitable for person clustering. In this paper, we propose a Relation-Aware Distribution representation Network (RAD-Net) to generate a distribution representation for multi-modal clues. The distribution representation of a clue is a vector consisting of the relation between this clue and all other clues from all modalities, thus being modality agnostic and good for person clustering. Accordingly, we introduce a graph-based method to construct distribution representation and employ a cyclic update policy to refine distribution representation progressively. Our method achieves substantial improvements of +6% and +8.2% in F-score on the Video Person-Clustering Dataset (VPCD) and VoxCeleb2 multi-view clustering dataset, respectively. Codes will be released publicly upon acceptance.

* Accepted in IEEE Transactions on Multimedia 
Viaarxiv icon

Retrieve Anyone: A General-purpose Person Re-identification Task with Instructions

Jul 04, 2023
Weizhen He, Shixiang Tang, Yiheng Deng, Qihao Chen, Qingsong Xie, Yizhou Wang, Lei Bai, Feng Zhu, Rui Zhao, Wanli Ouyang, Donglian Qi, Yunfeng Yan

Figure 1 for Retrieve Anyone: A General-purpose Person Re-identification Task with Instructions
Figure 2 for Retrieve Anyone: A General-purpose Person Re-identification Task with Instructions
Figure 3 for Retrieve Anyone: A General-purpose Person Re-identification Task with Instructions
Figure 4 for Retrieve Anyone: A General-purpose Person Re-identification Task with Instructions

Human intelligence can retrieve any person according to both visual and language descriptions. However, the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately, which limits the applications in the real world. This paper strives to resolve this problem by proposing a new instruct-ReID task that requires the model to retrieve images according to the given image or language instructions.Our instruct-ReID is a more general ReID setting, where existing ReID tasks can be viewed as special cases by designing different instructions. We propose a large-scale OmniReID benchmark and an adaptive triplet loss as a baseline method to facilitate research in this new setting. Experimental results show that the baseline model trained on our OmniReID benchmark can improve +0.6%, +1.4%, 0.2% mAP on Market1501, CUHK03, MSMT17 for traditional ReID, +0.8%, +2.0%, +13.4% mAP on PRCC, VC-Clothes, LTCC for clothes-changing ReID, +11.7% mAP on COCAS+ real2 for clothestemplate based clothes-changing ReID when using only RGB images, +25.4% mAP on COCAS+ real2 for our newly defined language-instructed ReID. The dataset, model, and code will be available at https://github.com/hwz-zju/Instruct-ReID.

Viaarxiv icon

Enhancing Mapless Trajectory Prediction through Knowledge Distillation

Jun 25, 2023
Yuning Wang, Pu Zhang, Lei Bai, Jianru Xue

Figure 1 for Enhancing Mapless Trajectory Prediction through Knowledge Distillation
Figure 2 for Enhancing Mapless Trajectory Prediction through Knowledge Distillation
Figure 3 for Enhancing Mapless Trajectory Prediction through Knowledge Distillation
Figure 4 for Enhancing Mapless Trajectory Prediction through Knowledge Distillation

Scene information plays a crucial role in trajectory forecasting systems for autonomous driving by providing semantic clues and constraints on potential future paths of traffic agents. Prevalent trajectory prediction techniques often take high-definition maps (HD maps) as part of the inputs to provide scene knowledge. Although HD maps offer accurate road information, they may suffer from the high cost of annotation or restrictions of law that limits their widespread use. Therefore, those methods are still expected to generate reliable prediction results in mapless scenarios. In this paper, we tackle the problem of improving the consistency of multi-modal prediction trajectories and the real road topology when map information is unavailable during the test phase. Specifically, we achieve this by training a map-based prediction teacher network on the annotated samples and transferring the knowledge to a student mapless prediction network using a two-fold knowledge distillation framework. Our solution is generalizable for common trajectory prediction networks and does not bring extra computation burden. Experimental results show that our method stably improves prediction performance in mapless mode on many widely used state-of-the-art trajectory prediction baselines, compensating for the gaps caused by the absence of HD maps. Qualitative visualization results demonstrate that our approach helps infer unseen map information.

* submitted to NeurIPS 2023 
Viaarxiv icon

Correlated Time Series Self-Supervised Representation Learning via Spatiotemporal Bootstrapping

Jun 20, 2023
Luxuan Wang, Lei Bai, Ziyue Li, Rui Zhao, Fugee Tsung

Figure 1 for Correlated Time Series Self-Supervised Representation Learning via Spatiotemporal Bootstrapping
Figure 2 for Correlated Time Series Self-Supervised Representation Learning via Spatiotemporal Bootstrapping
Figure 3 for Correlated Time Series Self-Supervised Representation Learning via Spatiotemporal Bootstrapping
Figure 4 for Correlated Time Series Self-Supervised Representation Learning via Spatiotemporal Bootstrapping

Correlated time series analysis plays an important role in many real-world industries. Learning an efficient representation of this large-scale data for further downstream tasks is necessary but challenging. In this paper, we propose a time-step-level representation learning framework for individual instances via bootstrapped spatiotemporal representation prediction. We evaluated the effectiveness and flexibility of our representation learning framework on correlated time series forecasting and cold-start transferring the forecasting model to new instances with limited data. A linear regression model trained on top of the learned representations demonstrates our model performs best in most cases. Especially compared to representation learning models, we reduce the RMSE, MAE, and MAPE by 37%, 49%, and 48% on the PeMS-BAY dataset, respectively. Furthermore, in real-world metro passenger flow data, our framework demonstrates the ability to transfer to infer future information of new cold-start instances, with gains of 15%, 19%, and 18%. The source code will be released under the GitHub https://github.com/bonaldli/Spatiotemporal-TS-Representation-Learning

* Accepted to IEEE CASE 2023 
Viaarxiv icon

MotionGPT: Finetuned LLMs are General-Purpose Motion Generators

Jun 19, 2023
Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, Wanli Ouyang

Figure 1 for MotionGPT: Finetuned LLMs are General-Purpose Motion Generators
Figure 2 for MotionGPT: Finetuned LLMs are General-Purpose Motion Generators
Figure 3 for MotionGPT: Finetuned LLMs are General-Purpose Motion Generators
Figure 4 for MotionGPT: Finetuned LLMs are General-Purpose Motion Generators

Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Codes shall be released upon acceptance.

* 18 pages, 8 figures 
Viaarxiv icon

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Jun 18, 2023
Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang

Figure 1 for LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Figure 2 for LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Figure 3 for LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
Figure 4 for LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Large language models have become a potential pathway toward achieving artificial general intelligence. Recent works on multi-modal large language models have demonstrated their effectiveness in handling visual modalities. In this work, we extend the research of MLLMs to point clouds and present the LAMM-Dataset and LAMM-Benchmark for 2D image and 3D point cloud understanding. We also establish an extensible framework to facilitate the extension of MLLMs to additional modalities. Our main contribution is three-fold: 1) We present the LAMM-Dataset and LAMM-Benchmark, which cover almost all high-level vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We demonstrate the detailed methods of constructing instruction-tuning datasets and benchmarks for MLLMs, which will enable future research on MLLMs to scale up and extend to other domains, tasks, and modalities faster. 3) We provide a primary but potential MLLM training framework optimized for modalities' extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research. Codes and datasets are now available at https://github.com/OpenLAMM/LAMM.

* 37 pages, 33 figures. Code available at https://github.com/OpenLAMM/LAMM ; Project page: https://openlamm.github.io/ 
Viaarxiv icon

LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting

Jun 14, 2023
Xu Liu, Yutong Xia, Yuxuan Liang, Junfeng Hu, Yiwei Wang, Lei Bai, Chao Huang, Zhenguang Liu, Bryan Hooi, Roger Zimmermann

Figure 1 for LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting
Figure 2 for LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting
Figure 3 for LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting
Figure 4 for LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting

Traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning in capturing non-linear patterns of traffic data. However, the promising results achieved on current public datasets may not be applicable to practical scenarios due to limitations within these datasets. First, the limited sizes of them may not reflect the real-world scale of traffic networks. Second, the temporal coverage of these datasets is typically short, posing hurdles in studying long-term patterns and acquiring sufficient samples for training deep models. Third, these datasets often lack adequate metadata for sensors, which compromises the reliability and interpretability of the data. To mitigate these limitations, we introduce the LargeST benchmark dataset. It encompasses a total number of 8,600 sensors with a 5-year time coverage and includes comprehensive metadata. Using LargeST, we perform in-depth data analysis to extract data insights, benchmark well-known baselines in terms of their performance and efficiency, and identify challenges as well as opportunities for future research. We release the datasets and baseline implementations at: https://github.com/liuxu77/LargeST.

Viaarxiv icon