Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xu Jia

EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

May 29, 2024

Wangbo Yu, Chaoran Feng, Jiye Tang, Xu Jia, Li Yuan, Yonghong Tian

Figure 1 for EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

Figure 2 for EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

Figure 3 for EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

Figure 4 for EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

Abstract:3D Gaussian Splatting (3D-GS) has demonstrated exceptional capabilities in 3D scene reconstruction and novel view synthesis. However, its training heavily depends on high-quality, sharp images and accurate camera poses. Fulfilling these requirements can be challenging in non-ideal real-world scenarios, where motion-blurred images are commonly encountered in high-speed moving cameras or low-light environments that require long exposure times. To address these challenges, we introduce Event Stream Assisted Gaussian Splatting (EvaGaussians), a novel approach that integrates event streams captured by an event camera to assist in reconstructing high-quality 3D-GS from blurry images. Capitalizing on the high temporal resolution and dynamic range offered by the event camera, we leverage the event streams to explicitly model the formation process of motion-blurred images and guide the deblurring reconstruction of 3D-GS. By jointly optimizing the 3D-GS parameters and recovering camera motion trajectories during the exposure time, our method can robustly facilitate the acquisition of high-fidelity novel views with intricate texture details. We comprehensively evaluated our method and compared it with previous state-of-the-art deblurring rendering methods. Both qualitative and quantitative comparisons demonstrate that our method surpasses existing techniques in restoring fine details from blurry images and producing high-fidelity novel views.

* Project Page: https://drexubery.github.io/EvaGaussians/

Via

Access Paper or Ask Questions

CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models

Apr 27, 2024

Qinghe Wang, Baolu Li, Xiaomin Li, Bing Cao, Liqian Ma, Huchuan Lu, Xu Jia

Figure 1 for CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models

Figure 2 for CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models

Figure 3 for CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models

Figure 4 for CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models

Abstract:Recent advances in text-to-image models have opened new frontiers in human-centric generation. However, these models cannot be directly employed to generate images with consistent newly coined identities. In this work, we propose CharacterFactory, a framework that allows sampling new characters with consistent identities in the latent space of GANs for diffusion models. More specifically, we consider the word embeddings of celeb names as ground truths for the identity-consistent generation task and train a GAN model to learn the mapping from a latent space to the celeb embedding space. In addition, we design a context-consistent loss to ensure that the generated identity embeddings can produce identity-consistent images in various contexts. Remarkably, the whole model only takes 10 minutes for training, and can sample infinite characters end-to-end during inference. Extensive experiments demonstrate excellent performance of the proposed CharacterFactory on character creation in terms of identity consistency and editability. Furthermore, the generated characters can be seamlessly combined with the off-the-shelf image/video/3D diffusion models. We believe that the proposed CharacterFactory is an important step for identity-consistent character generation. Project page is available at: https://qinghew.github.io/CharacterFactory/.

* Code will be released very soon: https://github.com/qinghew/CharacterFactory

Via

Access Paper or Ask Questions

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Apr 16, 2024

Yanze Li, Wenhua Zhang, Kai Chen, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li(+3 more)

Figure 1 for Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Figure 2 for Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Figure 3 for Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Figure 4 for Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Abstract:Large Vision-Language Models (LVLMs), due to the remarkable visual reasoning ability to understand images and videos, have received widespread attention in the autonomous driving domain, which significantly advances the development of interpretable end-to-end autonomous driving. However, current evaluations of LVLMs primarily focus on the multi-faceted capabilities in common scenarios, lacking quantifiable and automated assessment in autonomous driving contexts, let alone severe road corner cases that even the state-of-the-art autonomous driving perception systems struggle to handle. In this paper, we propose CODA-LM, a novel vision-language benchmark for self-driving, which provides the first automatic and quantitative evaluation of LVLMs for interpretable autonomous driving including general perception, regional perception, and driving suggestions. CODA-LM utilizes the texts to describe the road images, exploiting powerful text-only large language models (LLMs) without image inputs to assess the capabilities of LVLMs in autonomous driving scenarios, which reveals stronger alignment with human preferences than LVLM judges. Experiments demonstrate that even the closed-sourced commercial LVLMs like GPT-4V cannot deal with road corner cases well, suggesting that we are still far from a strong LVLM-powered intelligent driving agent, and we hope our CODA-LM can become the catalyst to promote future development.

* Project Page: https://coda-dataset.github.io/coda-lm/

Via

Access Paper or Ask Questions

StableIdentity: Inserting Anybody into Anywhere at First Sight

Jan 29, 2024

Qinghe Wang, Xu Jia, Xiaomin Li, Taiqing Li, Liqian Ma, Yunzhi Zhuge, Huchuan Lu

Abstract:Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.

Via

Access Paper or Ask Questions

TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models

Dec 01, 2023

Pengxiang Li, Zhili Liu, Kai Chen, Lanqing Hong, Yunzhi Zhuge, Dit-Yan Yeung, Huchuan Lu, Xu Jia

Abstract:Diffusion models have gained prominence in generating data for perception tasks such as image classification and object detection. However, the potential in generating high-quality tracking sequences, a crucial aspect in the field of video perception, has not been fully investigated. To address this gap, we propose TrackDiffusion, a novel architecture designed to generate continuous video sequences from the tracklets. TrackDiffusion represents a significant departure from the traditional layout-to-image (L2I) generation and copy-paste synthesis focusing on static image elements like bounding boxes by empowering image diffusion models to encompass dynamic and continuous tracking trajectories, thereby capturing complex motion nuances and ensuring instance consistency among video frames. For the first time, we demonstrate that the generated video sequences can be utilized for training multi-object tracking (MOT) systems, leading to significant improvement in tracker performance. Experimental results show that our model significantly enhances instance consistency in generated video sequences, leading to improved perceptual metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the standards of video data generation for MOT tasks and beyond.

Via

Access Paper or Ask Questions

GenTKG: Generative Forecasting on Temporal Knowledge Graph

Oct 11, 2023

Ruotong Liao, Xu Jia, Yunpu Ma, Volker Tresp

Figure 1 for GenTKG: Generative Forecasting on Temporal Knowledge Graph

Figure 2 for GenTKG: Generative Forecasting on Temporal Knowledge Graph

Figure 3 for GenTKG: Generative Forecasting on Temporal Knowledge Graph

Figure 4 for GenTKG: Generative Forecasting on Temporal Knowledge Graph

Abstract:The rapid advancements in large language models (LLMs) have ignited interest in the temporal knowledge graph (tKG) domain, where conventional carefully designed embedding-based and rule-based models dominate. The question remains open of whether pre-trained LLMs can understand structured temporal relational data and replace them as the foundation model for temporal relational forecasting. Therefore, we bring temporal knowledge forecasting into the generative setting. However, challenges occur in the huge chasms between complex temporal graph data structure and sequential natural expressions LLMs can handle, and between the enormous data sizes of tKGs and heavy computation costs of finetuning LLMs. To address these challenges, we propose a novel retrieval augmented generation framework that performs generative forecasting on tKGs named GenTKG, which combines a temporal logical rule-based retrieval strategy and lightweight parameter-efficient instruction tuning. Extensive experiments have shown that GenTKG outperforms conventional methods of temporal relational forecasting under low computation resources. GenTKG also highlights remarkable transferability with exceeding performance on unseen datasets without re-training. Our work reveals the huge potential of LLMs in the tKG domain and opens a new frontier for generative forecasting on tKGs.

* 8 pages

Via

Access Paper or Ask Questions

UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory

Aug 28, 2023

Haiwen Diao, Bo Wan, Ying Zhang, Xu Jia, Huchuan Lu, Long Chen

Figure 1 for UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory

Figure 2 for UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory

Figure 3 for UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory

Figure 4 for UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory

Abstract:Fine-tuning pre-trained models has emerged as a powerful technique in numerous domains, owing to its ability to leverage enormous pre-existing knowledge and achieve remarkable performance on downstream tasks. However, updating the parameters of entire networks is computationally intensive. Although state-of-the-art parameter-efficient transfer learning (PETL) methods significantly reduce the trainable parameters and storage demand, almost all of them still need to back-propagate the gradients through large pre-trained networks. This memory-extensive characteristic extremely limits the applicability of PETL methods in real-world scenarios. To this end, we propose a new memory-efficient PETL strategy, dubbed Universal Parallel Tuning (UniPT). Specifically, we facilitate the transfer process via a lightweight learnable parallel network, which consists of two modules: 1) A parallel interaction module that decouples the inherently sequential connections and processes the intermediate activations detachedly of the pre-trained network. 2) A confidence aggregation module that learns optimal strategies adaptively for integrating cross-layer features. We evaluate UniPT with different backbones (e.g., VSE$\infty$, CLIP4Clip, Clip-ViL, and MDETR) on five challenging vision-and-language tasks (i.e., image-text retrieval, video-text retrieval, visual question answering, compositional question answering, and visual grounding). Extensive ablations on ten datasets have validated that our UniPT can not only dramatically reduce memory consumption and outperform the best memory-efficient competitor, but also achieve higher performance than existing PETL methods in a low-memory scenario on different architectures. Our code is publicly available at: https://github.com/Paranioar/UniPT.

* 13 pages, 5 figures

Via

Access Paper or Ask Questions

Neural Image Re-Exposure

May 23, 2023

Xinyu Zhang, Hefei Huang, Xu Jia, Dong Wang, Huchuan Lu

Abstract:The shutter strategy applied to the photo-shooting process has a significant influence on the quality of the captured photograph. An improper shutter may lead to a blurry image, video discontinuity, or rolling shutter artifact. Existing works try to provide an independent solution for each issue. In this work, we aim to re-expose the captured photo in post-processing to provide a more flexible way of addressing those issues within a unified framework. Specifically, we propose a neural network-based image re-exposure framework. It consists of an encoder for visual latent space construction, a re-exposure module for aggregating information to neural film with a desired shutter strategy, and a decoder for 'developing' neural film into a desired image. To compensate for information confusion and missing frames, event streams, which can capture almost continuous brightness changes, are leveraged in computing visual latent content. Both self-attention layers and cross-attention layers are employed in the re-exposure module to promote interaction between neural film and visual latent content and information aggregation to neural film. The proposed unified image re-exposure framework is evaluated on several shutter-related image recovery tasks and performs favorably against independent state-of-the-art methods.

Via

Access Paper or Ask Questions

Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion

May 13, 2023

Wenjie Xu, Ben Liu, Miao Peng, Xu Jia, Min Peng

Figure 1 for Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion

Figure 2 for Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion

Figure 3 for Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion

Figure 4 for Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion

Abstract:Temporal Knowledge graph completion (TKGC) is a crucial task that involves reasoning at known timestamps to complete the missing part of facts and has attracted more and more attention in recent years. Most existing methods focus on learning representations based on graph neural networks while inaccurately extracting information from timestamps and insufficiently utilizing the implied information in relations. To address these problems, we propose a novel TKGC model, namely Pre-trained Language Model with Prompts for TKGC (PPT). We convert a series of sampled quadruples into pre-trained language model inputs and convert intervals between timestamps into different prompts to make coherent sentences with implicit semantic information. We train our model with a masking strategy to convert TKGC task into a masked token prediction task, which can leverage the semantic information in pre-trained language models. Experiments on three benchmark datasets and extensive analysis demonstrate that our model has great competitiveness compared to other models with four metrics. Our model can effectively incorporate information from temporal knowledge graphs into the language models.

* Accepted to Findings of ACL 2023

Via

Access Paper or Ask Questions

GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images

Mar 24, 2023

Jianchuan Chen, Wentao Yi, Liqian Ma, Xu Jia, Huchuan Lu

Figure 1 for GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images

Figure 2 for GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images

Figure 3 for GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images

Figure 4 for GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images

Abstract:In this work, we focus on synthesizing high-fidelity novel view images for arbitrary human performers, given a set of sparse multi-view images. It is a challenging task due to the large variation among articulated body poses and heavy self-occlusions. To alleviate this, we introduce an effective generalizable framework Generalizable Model-based Neural Radiance Fields (GM-NeRF) to synthesize free-viewpoint images. Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy which can alleviate the misalignment between inaccurate geometry prior and pixel space. On top of that, we further conduct neural rendering and partial gradient backpropagation for efficient perceptual supervision and improvement of the perceptual quality of synthesis. To evaluate our method, we conduct experiments on synthesized datasets THuman2.0 and Multi-garment, and real-world datasets Genebody and ZJUMocap. The results demonstrate that our approach outperforms state-of-the-art methods in terms of novel view synthesis and geometric reconstruction.

* Accepted at CVPR 2023

Via

Access Paper or Ask Questions