Alert button
Picture for Wen Yang

Wen Yang

Alert button

2D-3D Pose Tracking with Multi-View Constraints

Sep 20, 2023
Huai Yu, Kuangyi Chen, Wen Yang, Sebastian Scherer, Gui-Song Xia

Camera localization in 3D LiDAR maps has gained increasing attention due to its promising ability to handle complex scenarios, surpassing the limitations of visual-only localization methods. However, existing methods mostly focus on addressing the cross-modal gaps, estimating camera poses frame by frame without considering the relationship between adjacent frames, which makes the pose tracking unstable. To alleviate this, we propose to couple the 2D-3D correspondences between adjacent frames using the 2D-2D feature matching, establishing the multi-view geometrical constraints for simultaneously estimating multiple camera poses. Specifically, we propose a new 2D-3D pose tracking framework, which consists: a front-end hybrid flow estimation network for consecutive frames and a back-end pose optimization module. We further design a cross-modal consistency-based loss to incorporate the multi-view constraints during the training and inference process. We evaluate our proposed framework on the KITTI and Argoverse datasets. Experimental results demonstrate its superior performance compared to existing frame-by-frame 2D-3D pose tracking methods and state-of-the-art vision-only pose tracking algorithms. More online pose tracking videos are available at \url{https://youtu.be/yfBRdg7gw5M}

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 
Viaarxiv icon

Generalizing Event-Based Motion Deblurring in Real-World Scenarios

Aug 11, 2023
Xiang Zhang, Lei Yu, Wen Yang, Jianzhuang Liu, Gui-Song Xia

Figure 1 for Generalizing Event-Based Motion Deblurring in Real-World Scenarios
Figure 2 for Generalizing Event-Based Motion Deblurring in Real-World Scenarios
Figure 3 for Generalizing Event-Based Motion Deblurring in Real-World Scenarios
Figure 4 for Generalizing Event-Based Motion Deblurring in Real-World Scenarios

Event-based motion deblurring has shown promising results by exploiting low-latency events. However, current approaches are limited in their practical usage, as they assume the same spatial resolution of inputs and specific blurriness distributions. This work addresses these limitations and aims to generalize the performance of event-based deblurring in real-world scenarios. We propose a scale-aware network that allows flexible input spatial scales and enables learning from different temporal scales of motion blur. A two-stage self-supervised learning scheme is then developed to fit real-world data distribution. By utilizing the relativity of blurriness, our approach efficiently ensures the restored brightness and structure of latent images and further generalizes deblurring performance to handle varying spatial and temporal scales of motion blur in a self-distillation manner. Our method is extensively evaluated, demonstrating remarkable performance, and we also introduce a real-world dataset consisting of multi-scale blurry frames and events to facilitate research in event-based deblurring.

* Accepted by ICCV 2023 
Viaarxiv icon

BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

May 29, 2023
Wen Yang, Chong Li, Jiajun Zhang, Chengqing Zong

Figure 1 for BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages
Figure 2 for BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages
Figure 3 for BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages
Figure 4 for BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

Large language models (LLMs) demonstrate promising translation performance among various natural languages. However, many LLMs especially the open-sourced ones, such as BLOOM and LLaMA, are English-dominant and support only dozens of natural languages, making the potential of LLMs on language translation less explored. In this work, we present BigTrans which adapts LLaMA that covers only 20 languages and enhances it with multilingual translation capability on more than 100 languages. BigTrans is built upon LLaMA-13B and it is optimized in three steps. First, we continue training LLaMA with massive Chinese monolingual data. Second, we continue training the model with a large-scale parallel dataset that covers 102 natural languages. Third, we instruct-tune the foundation model with multilingual translation instructions, leading to our BigTrans model. The preliminary experiments on multilingual translation show that BigTrans performs comparably with ChatGPT and Google Translate in many languages and even outperforms ChatGPT in 8 language pairs. We release the BigTrans model and hope it can advance the research progress.

* 12 pages, 4 figures. Our model is available at https://github.com/ZNLP/BigTrans 
Viaarxiv icon

Self-Supervised Scene Dynamic Recovery from Rolling Shutter Images and Events

Apr 19, 2023
Yangguang Wang, Xiang Zhang, Mingyuan Lin, Lei Yu, Boxin Shi, Wen Yang, Gui-Song Xia

Figure 1 for Self-Supervised Scene Dynamic Recovery from Rolling Shutter Images and Events
Figure 2 for Self-Supervised Scene Dynamic Recovery from Rolling Shutter Images and Events
Figure 3 for Self-Supervised Scene Dynamic Recovery from Rolling Shutter Images and Events
Figure 4 for Self-Supervised Scene Dynamic Recovery from Rolling Shutter Images and Events

Scene Dynamic Recovery (SDR) by inverting distorted Rolling Shutter (RS) images to an undistorted high frame-rate Global Shutter (GS) video is a severely ill-posed problem due to the missing temporal dynamic information in both RS intra-frame scanlines and inter-frame exposures, particularly when prior knowledge about camera/object motions is unavailable. Commonly used artificial assumptions on scenes/motions and data-specific characteristics are prone to producing sub-optimal solutions in real-world scenarios. To address this challenge, we propose an event-based SDR network within a self-supervised learning paradigm, i.e., SelfUnroll. We leverage the extremely high temporal resolution of event cameras to provide accurate inter/intra-frame dynamic information. Specifically, an Event-based Inter/intra-frame Compensator (E-IC) is proposed to predict the per-pixel dynamic between arbitrary time intervals, including the temporal transition and spatial translation. Exploring connections in terms of RS-RS, RS-GS, and GS-RS, we explicitly formulate mutual constraints with the proposed E-IC, resulting in supervisions without ground-truth GS images. Extensive evaluations over synthetic and real datasets demonstrate that the proposed method achieves state-of-the-art and shows remarkable performance for event-based RS2GS inversion in real-world scenarios. The dataset and code are available at https://w3un.github.io/selfunroll/.

Viaarxiv icon

Dynamic Coarse-to-Fine Learning for Oriented Tiny Object Detection

Apr 18, 2023
Chang Xu, Jian Ding, Jinwang Wang, Wen Yang, Huai Yu, Lei Yu, Gui-Song Xia

Figure 1 for Dynamic Coarse-to-Fine Learning for Oriented Tiny Object Detection
Figure 2 for Dynamic Coarse-to-Fine Learning for Oriented Tiny Object Detection
Figure 3 for Dynamic Coarse-to-Fine Learning for Oriented Tiny Object Detection
Figure 4 for Dynamic Coarse-to-Fine Learning for Oriented Tiny Object Detection

Detecting arbitrarily oriented tiny objects poses intense challenges to existing detectors, especially for label assignment. Despite the exploration of adaptive label assignment in recent oriented object detectors, the extreme geometry shape and limited feature of oriented tiny objects still induce severe mismatch and imbalance issues. Specifically, the position prior, positive sample feature, and instance are mismatched, and the learning of extreme-shaped objects is biased and unbalanced due to little proper feature supervision. To tackle these issues, we propose a dynamic prior along with the coarse-to-fine assigner, dubbed DCFL. For one thing, we model the prior, label assignment, and object representation all in a dynamic manner to alleviate the mismatch issue. For another, we leverage the coarse prior matching and finer posterior constraint to dynamically assign labels, providing appropriate and relatively balanced supervision for diverse instances. Extensive experiments on six datasets show substantial improvements to the baseline. Notably, we obtain the state-of-the-art performance for one-stage detectors on the DOTA-v1.5, DOTA-v2.0, and DIOR-R datasets under single-scale training and testing. Codes are available at https://github.com/Chasel-Tsui/mmrotate-dcfl.

* Accepted by CVPR2023 
Viaarxiv icon

Recovering Continuous Scene Dynamics from A Single Blurry Image with Events

Apr 05, 2023
Zhangyi Cheng, Xiang Zhang, Lei Yu, Jianzhuang Liu, Wen Yang, Gui-Song Xia

Figure 1 for Recovering Continuous Scene Dynamics from A Single Blurry Image with Events
Figure 2 for Recovering Continuous Scene Dynamics from A Single Blurry Image with Events
Figure 3 for Recovering Continuous Scene Dynamics from A Single Blurry Image with Events
Figure 4 for Recovering Continuous Scene Dynamics from A Single Blurry Image with Events

This paper aims at demystifying a single motion-blurred image with events and revealing temporally continuous scene dynamics encrypted behind motion blurs. To achieve this end, an Implicit Video Function (IVF) is learned to represent a single motion blurred image with concurrent events, enabling the latent sharp image restoration of arbitrary timestamps in the range of imaging exposures. Specifically, a dual attention transformer is proposed to efficiently leverage merits from both modalities, i.e., the high temporal resolution of event features and the smoothness of image features, alleviating temporal ambiguities while suppressing the event noise. The proposed network is trained only with the supervision of ground-truth images of limited referenced timestamps. Motion- and texture-guided supervisions are employed simultaneously to enhance restorations of the non-referenced timestamps and improve the overall sharpness. Experiments on synthetic, semi-synthetic, and real-world datasets demonstrate that our proposed method outperforms state-of-the-art methods by a large margin in terms of both objective PSNR and SSIM measurements and subjective evaluations.

Viaarxiv icon

Learning to Super-Resolve Blurry Images with Events

Feb 27, 2023
Lei Yu, Bishan Wang, Xiang Zhang, Haijian Zhang, Wen Yang, Jianzhuang Liu, Gui-Song Xia

Figure 1 for Learning to Super-Resolve Blurry Images with Events
Figure 2 for Learning to Super-Resolve Blurry Images with Events
Figure 3 for Learning to Super-Resolve Blurry Images with Events
Figure 4 for Learning to Super-Resolve Blurry Images with Events

Super-Resolution from a single motion Blurred image (SRB) is a severely ill-posed problem due to the joint degradation of motion blurs and low spatial resolution. In this paper, we employ events to alleviate the burden of SRB and propose an Event-enhanced SRB (E-SRB) algorithm, which can generate a sequence of sharp and clear images with High Resolution (HR) from a single blurry image with Low Resolution (LR). To achieve this end, we formulate an event-enhanced degeneration model to consider the low spatial resolution, motion blurs, and event noises simultaneously. We then build an event-enhanced Sparse Learning Network (eSL-Net++) upon a dual sparse learning scheme where both events and intensity frames are modeled with sparse representations. Furthermore, we propose an event shuffle-and-merge scheme to extend the single-frame SRB to the sequence-frame SRB without any additional training process. Experimental results on synthetic and real-world datasets show that the proposed eSL-Net++ outperforms state-of-the-art methods by a large margin. Datasets, codes, and more results are available at https://github.com/ShinyWang33/eSL-Net-Plusplus.

* Accepted by IEEE TPAMI 
Viaarxiv icon

High-Resolution Cloud Removal with Multi-Modal and Multi-Resolution Data Fusion: A New Baseline and Benchmark

Jan 09, 2023
Fang Xu, Yilei Shi, Patrick Ebel, Wen Yang, Xiao Xiang Zhu

Figure 1 for High-Resolution Cloud Removal with Multi-Modal and Multi-Resolution Data Fusion: A New Baseline and Benchmark
Figure 2 for High-Resolution Cloud Removal with Multi-Modal and Multi-Resolution Data Fusion: A New Baseline and Benchmark
Figure 3 for High-Resolution Cloud Removal with Multi-Modal and Multi-Resolution Data Fusion: A New Baseline and Benchmark
Figure 4 for High-Resolution Cloud Removal with Multi-Modal and Multi-Resolution Data Fusion: A New Baseline and Benchmark

In this paper, we introduce Planet-CR, a benchmark dataset for high-resolution cloud removal with multi-modal and multi-resolution data fusion. Planet-CR is the first public dataset for cloud removal to feature globally sampled high resolution optical observations, in combination with paired radar measurements as well as pixel-level land cover annotations. It provides solid basis for exhaustive evaluation in terms of generating visually pleasing textures and semantically meaningful structures. With this dataset, we consider the problem of cloud removal in high resolution optical remote sensing imagery by integrating multi-modal and multi-resolution information. Existing multi-modal data fusion based methods, which assume the image pairs are aligned pixel-to-pixel, are hence not appropriate for this problem. To this end, we design a new baseline named Align-CR to perform the low-resolution SAR image guided high-resolution optical image cloud removal. It implicitly aligns the multi-modal and multi-resolution data during the reconstruction process to promote the cloud removal performance. The experimental results demonstrate that the proposed Align-CR method gives the best performance in both visual recovery quality and semantic recovery quality. The project is available at https://github.com/zhu-xlab/Planet-CR, and hope this will inspire future research.

Viaarxiv icon

Attention-Enhanced Cross-modal Localization Between 360 Images and Point Clouds

Dec 06, 2022
Zhipeng Zhao, Huai Yu, Chenwei Lyv, Wen Yang, Sebastian Scherer

Figure 1 for Attention-Enhanced Cross-modal Localization Between 360 Images and Point Clouds
Figure 2 for Attention-Enhanced Cross-modal Localization Between 360 Images and Point Clouds
Figure 3 for Attention-Enhanced Cross-modal Localization Between 360 Images and Point Clouds
Figure 4 for Attention-Enhanced Cross-modal Localization Between 360 Images and Point Clouds

Visual localization plays an important role for intelligent robots and autonomous driving, especially when the accuracy of GNSS is unreliable. Recently, camera localization in LiDAR maps has attracted more and more attention for its low cost and potential robustness to illumination and weather changes. However, the commonly used pinhole camera has a narrow Field-of-View, thus leading to limited information compared with the omni-directional LiDAR data. To overcome this limitation, we focus on correlating the information of 360 equirectangular images to point clouds, proposing an end-to-end learnable network to conduct cross-modal visual localization by establishing similarity in high-dimensional feature space. Inspired by the attention mechanism, we optimize the network to capture the salient feature for comparing images and point clouds. We construct several sequences containing 360 equirectangular images and corresponding point clouds based on the KITTI-360 dataset and conduct extensive experiments. The results demonstrate the effectiveness of our approach.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 
Viaarxiv icon

Learning to See Through with Events

Dec 05, 2022
Lei Yu, Xiang Zhang, Wei Liao, Wen Yang, Gui-Song Xia

Figure 1 for Learning to See Through with Events
Figure 2 for Learning to See Through with Events
Figure 3 for Learning to See Through with Events
Figure 4 for Learning to See Through with Events

Although synthetic aperture imaging (SAI) can achieve the seeing-through effect by blurring out off-focus foreground occlusions while recovering in-focus occluded scenes from multi-view images, its performance is often deteriorated by dense occlusions and extreme lighting conditions. To address the problem, this paper presents an Event-based SAI (E-SAI) method by relying on the asynchronous events with extremely low latency and high dynamic range acquired by an event camera. Specifically, the collected events are first refocused by a Refocus-Net module to align in-focus events while scattering out off-focus ones. Following that, a hybrid network composed of spiking neural networks (SNNs) and convolutional neural networks (CNNs) is proposed to encode the spatio-temporal information from the refocused events and reconstruct a visual image of the occluded targets. Extensive experiments demonstrate that our proposed E-SAI method can achieve remarkable performance in dealing with very dense occlusions and extreme lighting conditions and produce high-quality images from pure events. Codes and datasets are available at https://dvs-whu.cn/projects/esai/.

* Accepted by IEEE TPAMI. arXiv admin note: text overlap with arXiv:2103.02376 
Viaarxiv icon