Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haoqian Wang

TASR: Timestep-Aware Diffusion Model for Image Super-Resolution

Dec 04, 2024

Qinwei Lin, Xiaopeng Sun, Yu Gao, Yujie Zhong, Dengjie Li, Zheng Zhao, Haoqian Wang

Abstract:Diffusion models have recently achieved outstanding results in the field of image super-resolution. These methods typically inject low-resolution (LR) images via ControlNet.In this paper, we first explore the temporal dynamics of information infusion through ControlNet, revealing that the input from LR images predominantly influences the initial stages of the denoising process. Leveraging this insight, we introduce a novel timestep-aware diffusion model that adaptively integrates features from both ControlNet and the pre-trained Stable Diffusion (SD). Our method enhances the transmission of LR information in the early stages of diffusion to guarantee image fidelity and stimulates the generation ability of the SD model itself more in the later stages to enhance the detail of generated images. To train this method, we propose a timestep-aware training strategy that adopts distinct losses at varying timesteps and acts on disparate modules. Experiments on benchmark datasets demonstrate the effectiveness of our method. Code: https://github.com/SleepyLin/TASR

Via

Access Paper or Ask Questions

TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

Aug 25, 2024

Chuanrui Zhang, Yingshuang Zou, Zhuoling Li, Minmin Yi, Haoqian Wang

Figure 1 for TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

Figure 2 for TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

Figure 3 for TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

Figure 4 for TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

Abstract:Compared with previous 3D reconstruction methods like Nerf, recent Generalizable 3D Gaussian Splatting (G-3DGS) methods demonstrate impressive efficiency even in the sparse-view setting. However, the promising reconstruction performance of existing G-3DGS methods relies heavily on accurate multi-view feature matching, which is quite challenging. Especially for the scenes that have many non-overlapping areas between various views and contain numerous similar regions, the matching performance of existing methods is poor and the reconstruction precision is limited. To address this problem, we develop a strategy that utilizes a predicted depth confidence map to guide accurate local feature matching. In addition, we propose to utilize the knowledge of existing monocular depth estimation models as prior to boost the depth estimation precision in non-overlapping areas between views. Combining the proposed strategies, we present a novel G-3DGS method named TranSplat, which obtains the best performance on both the RealEstate10K and ACID benchmarks while maintaining competitive speed and presenting strong cross-dataset generalization ability. Our code, and demos will be available at: https://xingyoujun.github.io/transplat.

Via

Access Paper or Ask Questions

Spatiotemporal Graph Guided Multi-modal Network for Livestreaming Product Retrieval

Jul 24, 2024

Xiaowan Hu, Yiyi Chen, Yan Li, Minquan Wang, Haoqian Wang, Quan Chen, Han Li, Peng Jiang

Abstract:With the rapid expansion of e-commerce, more consumers have become accustomed to making purchases via livestreaming. Accurately identifying the products being sold by salespeople, i.e., livestreaming product retrieval (LPR), poses a fundamental and daunting challenge. The LPR task encompasses three primary dilemmas in real-world scenarios: 1) the recognition of intended products from distractor products present in the background; 2) the video-image heterogeneity that the appearance of products showcased in live streams often deviates substantially from standardized product images in stores; 3) there are numerous confusing products with subtle visual nuances in the shop. To tackle these challenges, we propose the Spatiotemporal Graphing Multi-modal Network (SGMN). First, we employ a text-guided attention mechanism that leverages the spoken content of salespeople to guide the model to focus toward intended products, emphasizing their salience over cluttered background products. Second, a long-range spatiotemporal graph network is further designed to achieve both instance-level interaction and frame-level matching, solving the misalignment caused by video-image heterogeneity. Third, we propose a multi-modal hard example mining, assisting the model in distinguishing highly similar products with fine-grained features across the video-image-text domain. Through extensive quantitative and qualitative experiments, we demonstrate the superior performance of our proposed SGMN model, surpassing the state-of-the-art methods by a substantial margin. The code is available at https://github.com/Huxiaowan/SGMN.

* 9 pages, 12 figures

Via

Access Paper or Ask Questions

Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

Jul 09, 2024

Chuanrui Zhang, Yonggen Ling, Minglei Lu, Minghan Qin, Haoqian Wang

Abstract:We study the 3D object understanding task for manipulating everyday objects with different material properties (diffuse, specular, transparent and mixed). Existing monocular and RGB-D methods suffer from scale ambiguity due to missing or imprecise depth measurements. We present CODERS, a one-stage approach for Category-level Object Detection, pose Estimation and Reconstruction from Stereo images. The base of our pipeline is an implicit stereo matching module that combines stereo image features with 3D position information. Concatenating this presented module and the following transform-decoder architecture leads to end-to-end learning of multiple tasks required by robot manipulation. Our approach significantly outperforms all competing methods in the public TOD dataset. Furthermore, trained on simulated data, CODERS generalize well to unseen category-level object instances in real-world robot manipulation experiments. Our dataset, code, and demos will be available on our project page.

Via

Access Paper or Ask Questions

Monocular Gaussian SLAM with Language Extended Loop Closure

May 22, 2024

Tian Lan, Qinwei Lin, Haoqian Wang

Figure 1 for Monocular Gaussian SLAM with Language Extended Loop Closure

Figure 2 for Monocular Gaussian SLAM with Language Extended Loop Closure

Figure 3 for Monocular Gaussian SLAM with Language Extended Loop Closure

Figure 4 for Monocular Gaussian SLAM with Language Extended Loop Closure

Abstract:Recently,3DGaussianSplattinghasshowngreatpotentialin visual Simultaneous Localization And Mapping (SLAM). Existing methods have achieved encouraging results on RGB-D SLAM, but studies of the monocular case are still scarce. Moreover, they also fail to correct drift errors due to the lack of loop closure and global optimization. In this paper, we present MG-SLAM, a monocular Gaussian SLAM with a language-extended loop closure module capable of performing drift-corrected tracking and high-fidelity reconstruction while achieving a high-level understanding of the environment. Our key idea is to represent the global map as 3D Gaussian and use it to guide the estimation of the scene geometry, thus mitigating the efforts of missing depth information. Further, an additional language-extended loop closure module which is based on CLIP feature is designed to continually perform global optimization to correct drift errors accumulated as the system runs. Our system shows promising results on multiple challenging datasets in both tracking and mapping and even surpasses some existing RGB-D methods.

Via

Access Paper or Ask Questions

M${^2}$Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation

May 03, 2024

Yingshuang Zou, Yikang Ding, Xi Qiu, Haoqian Wang, Haotian Zhang

$Figure 1 for M${^2}$Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation$

$Figure 2 for M${^2}$Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation$

$Figure 3 for M${^2}$Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation$

$Figure 4 for M${^2}$Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation$

Abstract:This paper presents a novel self-supervised two-frame multi-camera metric depth estimation network, termed M${^2}$Depth, which is designed to predict reliable scale-aware surrounding depth in autonomous driving. Unlike the previous works that use multi-view images from a single time-step or multiple time-step images from a single camera, M${^2}$Depth takes temporally adjacent two-frame images from multiple cameras as inputs and produces high-quality surrounding depth. We first construct cost volumes in spatial and temporal domains individually and propose a spatial-temporal fusion module that integrates the spatial-temporal information to yield a strong volume presentation. We additionally combine the neural prior from SAM features with internal features to reduce the ambiguity between foreground and background and strengthen the depth edges. Extensive experimental results on nuScenes and DDAD benchmarks show M${^2}$Depth achieves state-of-the-art performance. More results can be found in https://heiheishuang.xyz/M2Depth .

Via

Access Paper or Ask Questions

TexVocab: Texture Vocabulary-conditioned Human Avatars

Mar 31, 2024

Yuxiao Liu, Zhe Li, Yebin Liu, Haoqian Wang

Figure 1 for TexVocab: Texture Vocabulary-conditioned Human Avatars

Figure 2 for TexVocab: Texture Vocabulary-conditioned Human Avatars

Figure 3 for TexVocab: Texture Vocabulary-conditioned Human Avatars

Figure 4 for TexVocab: Texture Vocabulary-conditioned Human Avatars

Abstract:To adequately utilize the available image evidence in multi-view video-based avatar modeling, we propose TexVocab, a novel avatar representation that constructs a texture vocabulary and associates body poses with texture maps for animation. Given multi-view RGB videos, our method initially back-projects all the available images in the training videos to the posed SMPL surface, producing texture maps in the SMPL UV domain. Then we construct pairs of human poses and texture maps to establish a texture vocabulary for encoding dynamic human appearances under various poses. Unlike the commonly used joint-wise manner, we further design a body-part-wise encoding strategy to learn the structural effects of the kinematic chain. Given a driving pose, we query the pose feature hierarchically by decomposing the pose vector into several body parts and interpolating the texture features for synthesizing fine-grained human dynamics. Overall, our method is able to create animatable human avatars with detailed and dynamic appearances from RGB videos, and the experiments show that our method outperforms state-of-the-art approaches. The project page can be found at https://texvocab.github.io/.

Via

Access Paper or Ask Questions

Gaussian in the Wild: 3D Gaussian Splatting for Unconstrained Image Collections

Mar 23, 2024

Dongbin Zhang, Chuming Wang, Weitao Wang, Peihao Li, Minghan Qin, Haoqian Wang

Abstract:Novel view synthesis from unconstrained in-the-wild images remains a meaningful but challenging task. The photometric variation and transient occluders in those unconstrained images make it difficult to reconstruct the original scene accurately. Previous approaches tackle the problem by introducing a global appearance feature in Neural Radiance Fields (NeRF). However, in the real world, the unique appearance of each tiny point in a scene is determined by its independent intrinsic material attributes and the varying environmental impacts it receives. Inspired by this fact, we propose Gaussian in the wild (GS-W), a method that uses 3D Gaussian points to reconstruct the scene and introduces separated intrinsic and dynamic appearance feature for each point, capturing the unchanged scene appearance along with dynamic variation like illumination and weather. Additionally, an adaptive sampling strategy is presented to allow each Gaussian point to focus on the local and detailed information more effectively. We also reduce the impact of transient occluders using a 2D visibility map. More experiments have demonstrated better reconstruction quality and details of GS-W compared to previous methods, with a $1000\times$ increase in rendering speed.

* 14 pages, 5 figures

Via

Access Paper or Ask Questions

FFCA-Net: Stereo Image Compression via Fast Cascade Alignment of Side Information

Dec 29, 2023

Yichong Xia, Yujun Huang, Bin Chen, Haoqian Wang, Yaowei Wang

Figure 1 for FFCA-Net: Stereo Image Compression via Fast Cascade Alignment of Side Information

Figure 2 for FFCA-Net: Stereo Image Compression via Fast Cascade Alignment of Side Information

Figure 3 for FFCA-Net: Stereo Image Compression via Fast Cascade Alignment of Side Information

Figure 4 for FFCA-Net: Stereo Image Compression via Fast Cascade Alignment of Side Information

Abstract:Multi-view compression technology, especially Stereo Image Compression (SIC), plays a crucial role in car-mounted cameras and 3D-related applications. Interestingly, the Distributed Source Coding (DSC) theory suggests that efficient data compression of correlated sources can be achieved through independent encoding and joint decoding. This motivates the rapidly developed deep-distributed SIC methods in recent years. However, these approaches neglect the unique characteristics of stereo-imaging tasks and incur high decoding latency. To address this limitation, we propose a Feature-based Fast Cascade Alignment network (FFCA-Net) to fully leverage the side information on the decoder. FFCA adopts a coarse-to-fine cascaded alignment approach. In the initial stage, FFCA utilizes a feature domain patch-matching module based on stereo priors. This module reduces redundancy in the search space of trivial matching methods and further mitigates the introduction of noise. In the subsequent stage, we utilize an hourglass-based sparse stereo refinement network to further align inter-image features with a reduced computational cost. Furthermore, we have devised a lightweight yet high-performance feature fusion network, called a Fast Feature Fusion network (FFF), to decode the aligned features. Experimental results on InStereo2K, KITTI, and Cityscapes datasets demonstrate the significant superiority of our approach over traditional and learning-based SIC methods. In particular, our approach achieves significant gains in terms of 3 to 10-fold faster decoding speed than other methods.

Via

Access Paper or Ask Questions

LangSplat: 3D Language Gaussian Splatting

Dec 26, 2023

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, Hanspeter Pfister

Abstract:Human lives in a 3D world and commonly uses natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experiments on open-vocabulary 3D object localization and semantic segmentation demonstrate that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a {\speed} $\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io

* Project Page: https://langsplat.github.io

Via

Access Paper or Ask Questions