Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tong Lu

CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Jan 03, 2024

Yi Rong, Haoran Zhou, Lixin Yuan, Cheng Mei, Jiahao Wang, Tong Lu

Figure 1 for CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Figure 2 for CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Figure 3 for CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Figure 4 for CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Abstract:Point cloud completion is an indispensable task for recovering complete point clouds due to incompleteness caused by occlusion, limited sensor resolution, etc. The family of coarse-to-fine generation architectures has recently exhibited great success in point cloud completion and gradually became mainstream. In this work, we unveil one of the key ingredients behind these methods: meticulously devised feature extraction operations with explicit cross-resolution aggregation. We present Cross-Resolution Transformer that efficiently performs cross-resolution aggregation with local attention mechanisms. With the help of our recursive designs, the proposed operation can capture more scales of features than common aggregation operations, which is beneficial for capturing fine geometric characteristics. While prior methodologies have ventured into various manifestations of inter-level cross-resolution aggregation, the effectiveness of intra-level one and their combination has not been analyzed. With unified designs, Cross-Resolution Transformer can perform intra- or inter-level cross-resolution aggregation by switching inputs. We integrate two forms of Cross-Resolution Transformers into one up-sampling block for point generation, and following the coarse-to-fine manner, we construct CRA-PCN to incrementally predict complete shapes with stacked up-sampling blocks. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin on several widely used benchmarks. Codes are available at https://github.com/EasyRy/CRA-PCN.

* Accepted to AAAI 2024

Via

Access Paper or Ask Questions

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Dec 05, 2023

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, Jose M. Alvarez

Figure 1 for Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Figure 2 for Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Figure 3 for Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Figure 4 for Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

Abstract:End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line, many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper, we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observed that the nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models incorporating ego status, such as the ego vehicle's velocity. These models tend to rely predominantly on the ego vehicle's status for future path planning. Beyond the limitations of the dataset, we also note that current metrics do not comprehensively assess the planning quality, leading to potentially biased conclusions drawn from existing benchmarks. To address this issue, we introduce a new metric to evaluate whether the predicted trajectories adhere to the road. We further propose a simple baseline able to achieve competitive results without relying on perception annotations. Given the current limitations on the benchmark and metrics, we suggest the community reassess relevant prevailing research and be cautious whether the continued pursuit of state-of-the-art would yield convincing and universal conclusions. Code and models are available at \url{https://github.com/NVlabs/BEV-Planner}

Via

Access Paper or Ask Questions

Deep Video Restoration for Under-Display Camera

Sep 09, 2023

Xuanxi Chen, Tao Wang, Ziqian Shao, Kaihao Zhang, Wenhan Luo, Tong Lu, Zikun Liu, Tae-Kyun Kim, Hongdong Li

Figure 1 for Deep Video Restoration for Under-Display Camera

Figure 2 for Deep Video Restoration for Under-Display Camera

Figure 3 for Deep Video Restoration for Under-Display Camera

Figure 4 for Deep Video Restoration for Under-Display Camera

Abstract:Images or videos captured by the Under-Display Camera (UDC) suffer from severe degradation, such as saturation degeneration and color shift. While restoration for UDC has been a critical task, existing works of UDC restoration focus only on images. UDC video restoration (UDC-VR) has not been explored in the community. In this work, we first propose a GAN-based generation pipeline to simulate the realistic UDC degradation process. With the pipeline, we build the first large-scale UDC video restoration dataset called PexelsUDC, which includes two subsets named PexelsUDC-T and PexelsUDC-P corresponding to different displays for UDC. Using the proposed dataset, we conduct extensive benchmark studies on existing video restoration methods and observe their limitations on the UDC-VR task. To this end, we propose a novel transformer-based baseline method that adaptively enhances degraded videos. The key components of the method are a spatial branch with local-aware transformers, a temporal branch embedded temporal transformers, and a spatial-temporal fusion module. These components drive the model to fully exploit spatial and temporal information for UDC-VR. Extensive experiments show that our method achieves state-of-the-art performance on PexelsUDC. The benchmark and the baseline method are expected to promote the progress of UDC-VR in the community, which will be made public.

Via

Access Paper or Ask Questions

FB-BEV: BEV Representation from Forward-Backward View Transformations

Aug 17, 2023

Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, Jose M. Alvarez

Abstract:View Transformation Module (VTM), where transformations happen between multi-view image features and Bird-Eye-View (BEV) representation, is a crucial step in camera-based BEV perception systems. Currently, the two most prominent VTM paradigms are forward projection and backward projection. Forward projection, represented by Lift-Splat-Shoot, leads to sparsely projected BEV features without post-processing. Backward projection, with BEVFormer being an example, tends to generate false-positive BEV features from incorrect projections due to the lack of utilization on depth. To address the above limitations, we propose a novel forward-backward view transformation module. Our approach compensates for the deficiencies in both existing methods, allowing them to enhance each other to obtain higher quality BEV representations mutually. We instantiate the proposed module with FB-BEV, which achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set. Code and models are available at https://github.com/NVlabs/FB-BEV.

* Accept to ICCV 2023, camera-ready version

Via

Access Paper or Ask Questions

Memory-and-Anticipation Transformer for Online Action Understanding

Aug 15, 2023

Jiahao Wang, Guo Chen, Yifei Huang, Limin Wang, Tong Lu

Figure 1 for Memory-and-Anticipation Transformer for Online Action Understanding

Figure 2 for Memory-and-Anticipation Transformer for Online Action Understanding

Figure 3 for Memory-and-Anticipation Transformer for Online Action Understanding

Figure 4 for Memory-and-Anticipation Transformer for Online Action Understanding

Abstract:Most existing forecasting systems are memory-based methods, which attempt to mimic human forecasting ability by employing various memory mechanisms and have progressed in temporal modeling for memory dependency. Nevertheless, an obvious weakness of this paradigm is that it can only model limited historical dependence and can not transcend the past. In this paper, we rethink the temporal dependence of event evolution and propose a novel memory-anticipation-based paradigm to model an entire temporal structure, including the past, present, and future. Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks. In addition, owing to the inherent superiority of MAT, it can process online action detection and anticipation tasks in a unified manner. The proposed MAT model is tested on four challenging benchmarks TVSeries, THUMOS'14, HDD, and EPIC-Kitchens-100, for online action detection and anticipation tasks, and it significantly outperforms all existing methods. Code is available at https://github.com/Echo0125/Memory-and-Anticipation-Transformer.

* ICCV 2023 Camera Ready

Via

Access Paper or Ask Questions

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Aug 03, 2023

Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao(+4 more)

Abstract:We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world. Using a scalable data engine that incorporates human feedback and efficient models in the loop, we create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions. It covers a wide range of 3.5 million common and rare concepts in the real world, and has 132.2 billion tokens that describe the concepts and their attributes. Leveraging this new dataset, we develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding. The model is trained with open-ended language prompts and locations, which allows it to generalize to various vision and language tasks with remarkable zero-shot performance, including region-text retrieval, region recognition, captioning, and question-answering. We hope that this project can serve as a foundation for vision-language artificial general intelligence research. Models and the dataset shall be released at https://github.com/OpenGVLab/All-Seeing, and demo can be seen at https://huggingface.co/spaces/OpenGVLab/all-seeing.

* Technical Report

Via

Access Paper or Ask Questions

AVSegFormer: Audio-Visual Segmentation with Transformer

Jul 05, 2023

Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu

Figure 1 for AVSegFormer: Audio-Visual Segmentation with Transformer

Figure 2 for AVSegFormer: Audio-Visual Segmentation with Transformer

Figure 3 for AVSegFormer: Audio-Visual Segmentation with Transformer

Figure 4 for AVSegFormer: Audio-Visual Segmentation with Transformer

Abstract:The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture. Specifically, we introduce audio queries and learnable queries into the transformer decoder, enabling the network to selectively attend to interested visual features. Besides, we present an audio-visual mixer, which can dynamically adjust visual features by amplifying relevant and suppressing irrelevant spatial channels. Additionally, we devise an intermediate mask loss to enhance the supervision of the decoder, encouraging the network to produce more accurate intermediate predictions. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions

GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

May 29, 2023

Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tong Lu, Tae-Kyun Kim, Wei Liu, Hongdong Li

Figure 1 for GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

Figure 2 for GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

Figure 3 for GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

Figure 4 for GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions

Abstract:Image restoration in adverse weather conditions is a difficult task in computer vision. In this paper, we propose a novel transformer-based framework called GridFormer which serves as a backbone for image restoration under adverse weather conditions. GridFormer is designed in a grid structure using a residual dense transformer block, and it introduces two core designs. First, it uses an enhanced attention mechanism in the transformer layer. The mechanism includes stages of the sampler and compact self-attention to improve efficiency, and a local enhancement stage to strengthen local information. Second, we introduce a residual dense transformer block (RDTB) as the final GridFormer layer. This design further improves the network's ability to learn effective features from both preceding and current local features. The GridFormer framework achieves state-of-the-art results on five diverse image restoration tasks in adverse weather conditions, including image deraining, dehazing, deraining & dehazing, desnowing, and multi-weather restoration. The source code and pre-trained models will be released.

* 17 pages, 12 figures

Via

Access Paper or Ask Questions

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

May 25, 2023

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao(+1 more)

Figure 1 for VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Figure 2 for VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Figure 3 for VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Figure 4 for VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Abstract:Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60\% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on https://github.com/OpenGVLab/InternGPT. The code shall be released at https://github.com/OpenGVLab/VisionLLM.

* Technical Report

Via

Access Paper or Ask Questions

VideoLLM: Modeling Video Sequence with Large Language Models

May 23, 2023

Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu(+1 more)

Figure 1 for VideoLLM: Modeling Video Sequence with Large Language Models

Figure 2 for VideoLLM: Modeling Video Sequence with Large Language Models

Figure 3 for VideoLLM: Modeling Video Sequence with Large Language Models

Figure 4 for VideoLLM: Modeling Video Sequence with Large Language Models

Abstract:With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. The success of large language models (LLMs) like GPT has demonstrated their impressive abilities in sequence causal reasoning. Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding. VideoLLM incorporates a carefully designed Modality Encoder and Semantic Translator, which convert inputs from various modalities into a unified token sequence. This token sequence is then fed into a decoder-only LLM. Subsequently, with the aid of a simple task head, our VideoLLM yields an effective unified framework for different kinds of video understanding tasks. To evaluate the efficacy of VideoLLM, we conduct extensive experiments using multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks sourced from four different datasets. The experimental results demonstrate that the understanding and reasoning capabilities of LLMs can be effectively transferred to video understanding tasks. We release the code at https://github.com/cg1177/VideoLLM.

* Technical Report

Via

Access Paper or Ask Questions