Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yang Tang

DidSee: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation

Jun 26, 2025

Wenzhou Lyu, Jialing Lin, Wenqi Ren, Ruihao Xia, Feng Qian, Yang Tang

Abstract:Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction. To address these issues, we propose \textbf{DidSee}, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps. DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic grasping.Project page: https://wenzhoulyu.github.io/DidSee/

Via

Access Paper or Ask Questions

On the Practice of Deep Hierarchical Ensemble Network for Ad Conversion Rate Prediction

Apr 10, 2025

Jinfeng Zhuang, Yinrui Li, Runze Su, Ke Xu, Zhixuan Shao, Kungang Li, Ling Leng, Han Sun, Meng Qi, Yixiong Meng(+4 more)

Abstract:The predictions of click through rate (CTR) and conversion rate (CVR) play a crucial role in the success of ad-recommendation systems. A Deep Hierarchical Ensemble Network (DHEN) has been proposed to integrate multiple feature crossing modules and has achieved great success in CTR prediction. However, its performance for CVR prediction is unclear in the conversion ads setting, where an ad bids for the probability of a user's off-site actions on a third party website or app, including purchase, add to cart, sign up, etc. A few challenges in DHEN: 1) What feature-crossing modules (MLP, DCN, Transformer, to name a few) should be included in DHEN? 2) How deep and wide should DHEN be to achieve the best trade-off between efficiency and efficacy? 3) What hyper-parameters to choose in each feature-crossing module? Orthogonal to the model architecture, the input personalization features also significantly impact model performance with a high degree of freedom. In this paper, we attack this problem and present our contributions biased to the applied data science side, including: First, we propose a multitask learning framework with DHEN as the single backbone model architecture to predict all CVR tasks, with a detailed study on how to make DHEN work effectively in practice; Second, we build both on-site real-time user behavior sequences and off-site conversion event sequences for CVR prediction purposes, and conduct ablation study on its importance; Last but not least, we propose a self-supervised auxiliary loss to predict future actions in the input sequence, to help resolve the label sparseness issue in CVR prediction. Our method achieves state-of-the-art performance compared to previous single feature crossing modules with pre-trained user personalization features.

* Accepted by WWW 2025

Via

Access Paper or Ask Questions

DRAN: A Distribution and Relation Adaptive Network for Spatio-temporal Forecasting

Apr 02, 2025

Xiaobei Zou, Luolin Xiong, Kexuan Zhang, Cesare Alippi, Yang Tang

Figure 1 for DRAN: A Distribution and Relation Adaptive Network for Spatio-temporal Forecasting

Figure 2 for DRAN: A Distribution and Relation Adaptive Network for Spatio-temporal Forecasting

Figure 3 for DRAN: A Distribution and Relation Adaptive Network for Spatio-temporal Forecasting

Figure 4 for DRAN: A Distribution and Relation Adaptive Network for Spatio-temporal Forecasting

Abstract:Accurate predictions of spatio-temporal systems' states are crucial for tasks such as system management, control, and crisis prevention. However, the inherent time variance of spatio-temporal systems poses challenges to achieving accurate predictions whenever stationarity is not granted. To address non-stationarity frameworks, we propose a Distribution and Relation Adaptive Network (DRAN) capable of dynamically adapting to relation and distribution changes over time. While temporal normalization and de-normalization are frequently used techniques to adapt to distribution shifts, this operation is not suitable for the spatio-temporal context as temporal normalization scales the time series of nodes and possibly disrupts the spatial relations among nodes. In order to address this problem, we develop a Spatial Factor Learner (SFL) module that enables the normalization and de-normalization process in spatio-temporal systems. To adapt to dynamic changes in spatial relationships among sensors, we propose a Dynamic-Static Fusion Learner (DSFL) module that effectively integrates features learned from both dynamic and static relations through an adaptive fusion ratio mechanism. Furthermore, we introduce a Stochastic Learner to capture the noisy components of spatio-temporal representations. Our approach outperforms state of the art methods in weather prediction and traffic flows forecasting tasks. Experimental results show that our SFL efficiently preserves spatial relationships across various temporal normalization operations. Visualizations of the learned dynamic and static relations demonstrate that DSFL can capture both local and distant relationships between nodes. Moreover, ablation studies confirm the effectiveness of each component.

* 15 pages, 9 figures

Via

Access Paper or Ask Questions

Causal Learning for Heterogeneous Subgroups Based on Nonlinear Causal Kernel Clustering

Jan 20, 2025

Lu Liu, Yang Tang, Kexuan Zhang, Qiyu Sun

Abstract:Due to the challenge posed by multi-source and heterogeneous data collected from diverse environments, causal relationships among features can exhibit variations influenced by different time spans, regions, or strategies. This diversity makes a single causal model inadequate for accurately representing complex causal relationships in all observational data, a crucial consideration in causal learning. To address this challenge, we introduce the nonlinear Causal Kernel Clustering method designed for heterogeneous subgroup causal learning, illuminating variations in causal relationships across diverse subgroups. It comprises two primary components. First, the construction of a sample mapping function forms the basis of the subsequent nonlinear causal kernel. This function assesses the differences in potential nonlinear causal relationships in various samples, supported by our causal identifiability theory. Second, a nonlinear causal kernel is proposed for clustering heterogeneous subgroups. Experimental results showcase the exceptional performance of our method in accurately identifying heterogeneous subgroups and effectively enhancing causal learning, leading to a great reduction in prediction error.

Via

Access Paper or Ask Questions

Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Oct 29, 2024

Ruihao Xia, Yu Liang, Peng-Tao Jiang, Hao Zhang, Bo Li, Yang Tang, Pan Zhou

Figure 1 for Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Figure 2 for Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Figure 3 for Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Figure 4 for Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

Abstract:Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in real-world multimodal scenarios. To address this issue, we propose Modality Adaptation with text-to-image Diffusion Models (MADM) for semantic segmentation task which utilizes text-to-image diffusion models pre-trained on extensive image-text pairs to enhance the model's cross-modality capabilities. Specifically, MADM comprises two key complementary components to tackle major challenges. First, due to the large modality gap, using one modal data to generate pseudo labels for another modality suffers from a significant drop in accuracy. To address this, MADM designs diffusion-based pseudo-label generation which adds latent noise to stabilize pseudo-labels and enhance label accuracy. Second, to overcome the limitations of latent low-resolution features in diffusion models, MADM introduces the label palette and latent regression which converts one-hot encoded labels into the RGB form by palette and regresses them in the latent space, thus ensuring the pre-trained decoder for up-sampling to obtain fine-grained features. Extensive experimental results demonstrate that MADM achieves state-of-the-art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities. We open-source our code and models at https://github.com/XiaRho/MADM.

* NeurIPS 2024

Via

Access Paper or Ask Questions

Towards Natural Image Matting in the Wild via Real-Scenario Prior

Oct 09, 2024

Ruihao Xia, Yu Liang, Peng-Tao Jiang, Hao Zhang, Qianru Sun, Yang Tang, Bo Li, Pan Zhou

Figure 1 for Towards Natural Image Matting in the Wild via Real-Scenario Prior

Figure 2 for Towards Natural Image Matting in the Wild via Real-Scenario Prior

Figure 3 for Towards Natural Image Matting in the Wild via Real-Scenario Prior

Figure 4 for Towards Natural Image Matting in the Wild via Real-Scenario Prior

Abstract:Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine-tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO-Matting. Specifically, the construction of our COCO-Matting includes accessory fusion and mask-to-matte, which selects real-world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO-Matting comprises an extensive collection of 38,251 human instance-level alpha mattes in complex natural scenarios. Furthermore, existing SAM-based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end-to-end matting losses, which do not fully exploit the potential of the pre-trained SAM. Thus, we propose SEMat which revamps the network architecture and training objectives. For network architecture, the proposed feature-aligned transformer learns to extract fine-grained edge and transparency features. The proposed matte-aligned decoder aims to segment matting-specific objects and convert coarse masks into high-precision mattes. For training objectives, the proposed regularization and trimap loss aim to retain the prior from the pre-trained model and push the matting logits extracted from the mask decoder to contain trimap-based semantic information. Extensive experiments across seven diverse datasets demonstrate the superior performance of our method, proving its efficacy in interactive natural image matting. We open-source our code, models, and dataset at https://github.com/XiaRho/SEMat.

Via

Access Paper or Ask Questions

Conan-embedding: General Text Embedding with More and Better Negative Samples

Aug 29, 2024

Shiyu Li, Yang Tang, Shizhe Chen, Xi Chen

Figure 1 for Conan-embedding: General Text Embedding with More and Better Negative Samples

Figure 2 for Conan-embedding: General Text Embedding with More and Better Negative Samples

Figure 3 for Conan-embedding: General Text Embedding with More and Better Negative Samples

Figure 4 for Conan-embedding: General Text Embedding with More and Better Negative Samples

Abstract:With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark

Via

Access Paper or Ask Questions

AIGC for Industrial Time Series: From Deep Generative Models to Large Generative Models

Jul 16, 2024

Lei Ren, Haiteng Wang, Yang Tang, Chunhua Yang

Figure 1 for AIGC for Industrial Time Series: From Deep Generative Models to Large Generative Models

Figure 2 for AIGC for Industrial Time Series: From Deep Generative Models to Large Generative Models

Figure 3 for AIGC for Industrial Time Series: From Deep Generative Models to Large Generative Models

Figure 4 for AIGC for Industrial Time Series: From Deep Generative Models to Large Generative Models

Abstract:With the remarkable success of generative models like ChatGPT, Artificial Intelligence Generated Content (AIGC) is undergoing explosive development. Not limited to text and images, generative models can generate industrial time series data, addressing challenges such as the difficulty of data collection and data annotation. Due to their outstanding generation ability, they have been widely used in Internet of Things, metaverse, and cyber-physical-social systems to enhance the efficiency of industrial production. In this paper, we present a comprehensive overview of generative models for industrial time series from deep generative models (DGMs) to large generative models (LGMs). First, a DGM-based AIGC framework is proposed for industrial time series generation. Within this framework, we survey advanced industrial DGMs and present a multi-perspective categorization. Furthermore, we systematically analyze the critical technologies required to construct industrial LGMs from four aspects: large-scale industrial dataset, LGMs architecture for complex industrial characteristics, self-supervised training for industrial time series, and fine-tuning of industrial downstream tasks. Finally, we conclude the challenges and future directions to enable the development of generative models in industry.

* 17 pages, 4 figures.This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

Computer vision tasks for intelligent aerospace missions: An overview

Jul 09, 2024

Huilin Chen, Qiyu Sun, Fangfei Li, Yang Tang

Figure 1 for Computer vision tasks for intelligent aerospace missions: An overview

Figure 2 for Computer vision tasks for intelligent aerospace missions: An overview

Figure 3 for Computer vision tasks for intelligent aerospace missions: An overview

Figure 4 for Computer vision tasks for intelligent aerospace missions: An overview

Abstract:Computer vision tasks are crucial for aerospace missions as they help spacecraft to understand and interpret the space environment, such as estimating position and orientation, reconstructing 3D models, and recognizing objects, which have been extensively studied to successfully carry out the missions. However, traditional methods like Kalman Filtering, Structure from Motion, and Multi-View Stereo are not robust enough to handle harsh conditions, leading to unreliable results. In recent years, deep learning (DL)-based perception technologies have shown great potential and outperformed traditional methods, especially in terms of their robustness to changing environments. To further advance DL-based aerospace perception, various frameworks, datasets, and strategies have been proposed, indicating significant potential for future applications. In this survey, we aim to explore the promising techniques used in perception tasks and emphasize the importance of DL-based aerospace perception. We begin by providing an overview of aerospace perception, including classical space programs developed in recent years, commonly used sensors, and traditional perception methods. Subsequently, we delve into three fundamental perception tasks in aerospace missions: pose estimation, 3D reconstruction, and recognition, as they are basic and crucial for subsequent decision-making and control. Finally, we discuss the limitations and possibilities in current research and provide an outlook on future developments, including the challenges of working with limited datasets, the need for improved algorithms, and the potential benefits of multi-source information fusion.

* 23 pages, 7 figures, journal

Via

Access Paper or Ask Questions

Self-Supervised Monocular Depth Estimation in the Dark: Towards Data Distribution Compensation

Apr 22, 2024

Haolin Yang, Chaoqiang Zhao, Lu Sheng, Yang Tang

Figure 1 for Self-Supervised Monocular Depth Estimation in the Dark: Towards Data Distribution Compensation

Figure 2 for Self-Supervised Monocular Depth Estimation in the Dark: Towards Data Distribution Compensation

Figure 3 for Self-Supervised Monocular Depth Estimation in the Dark: Towards Data Distribution Compensation

Figure 4 for Self-Supervised Monocular Depth Estimation in the Dark: Towards Data Distribution Compensation

Abstract:Nighttime self-supervised monocular depth estimation has received increasing attention in recent years. However, using night images for self-supervision is unreliable because the photometric consistency assumption is usually violated in the videos taken under complex lighting conditions. Even with domain adaptation or photometric loss repair, performance is still limited by the poor supervision of night images on trainable networks. In this paper, we propose a self-supervised nighttime monocular depth estimation method that does not use any night images during training. Our framework utilizes day images as a stable source for self-supervision and applies physical priors (e.g., wave optics, reflection model and read-shot noise model) to compensate for some key day-night differences. With day-to-night data distribution compensation, our framework can be trained in an efficient one-stage self-supervised manner. Though no nighttime images are considered during training, qualitative and quantitative results demonstrate that our method achieves SoTA depth estimating results on the challenging nuScenes-Night and RobotCar-Night compared with existing methods.

* Accepted by IJCAI2024

Via

Access Paper or Ask Questions