Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yi Li

Victor

Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks

Jan 15, 2025

Shuang Cui, Yi Li, Jiangmeng Li, Xiongxin Tang, Bing Su, Fanjiang Xu, Hui Xiong

Figure 1 for Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks

Figure 2 for Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks

Figure 3 for Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks

Figure 4 for Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks

Abstract:Single image defocus deblurring (SIDD) aims to restore an all-in-focus image from a defocused one. Distribution shifts in defocused images generally lead to performance degradation of existing methods during out-of-distribution inferences. In this work, we gauge the intrinsic reason behind the performance degradation, which is identified as the heterogeneity of lens-specific point spread functions. Empirical evidence supports this finding, motivating us to employ a continual test-time adaptation (CTTA) paradigm for SIDD. However, traditional CTTA methods, which primarily rely on entropy minimization, cannot sufficiently explore task-dependent information for pixel-level regression tasks like SIDD. To address this issue, we propose a novel Siamese networks-based continual test-time adaptation framework, which adapts source models to continuously changing target domains only requiring unlabeled target data in an online manner. To further mitigate semantically erroneous textures introduced by source SIDD models under severe degradation, we revisit the learning paradigm through a structural causal model and propose Causal Siamese networks (CauSiam). Our method leverages large-scale pre-trained vision-language models to derive discriminative universal semantic priors and integrates these priors into Siamese networks, ensuring causal identifiability between blurry inputs and restored images. Extensive experiments demonstrate that CauSiam effectively improves the generalization performance of existing SIDD methods in continuously changing domains.

Via

Access Paper or Ask Questions

Complex-Cycle-Consistent Diffusion Model for Monaural Speech Enhancement

Dec 12, 2024

Yi Li, Yang Sun, Plamen Angelov

Figure 1 for Complex-Cycle-Consistent Diffusion Model for Monaural Speech Enhancement

Figure 2 for Complex-Cycle-Consistent Diffusion Model for Monaural Speech Enhancement

Figure 3 for Complex-Cycle-Consistent Diffusion Model for Monaural Speech Enhancement

Figure 4 for Complex-Cycle-Consistent Diffusion Model for Monaural Speech Enhancement

Abstract:In this paper, we present a novel diffusion model-based monaural speech enhancement method. Our approach incorporates the separate estimation of speech spectra's magnitude and phase in two diffusion networks. Throughout the diffusion process, noise clips from real-world noise interferences are added gradually to the clean speech spectra and a noise-aware reverse process is proposed to learn how to generate both clean speech spectra and noise spectra. Furthermore, to fully leverage the intrinsic relationship between magnitude and phase, we introduce a complex-cycle-consistent (CCC) mechanism that uses the estimated magnitude to map the phase, and vice versa. We implement this algorithm within a phase-aware speech enhancement diffusion model (SEDM). We conduct extensive experiments on public datasets to demonstrate the effectiveness of our method, highlighting the significant benefits of exploiting the intrinsic relationship between phase and magnitude information to enhance speech. The comparison to conventional diffusion models demonstrates the superiority of SEDM.

* AAAI 2025

Via

Access Paper or Ask Questions

Toward Fair Graph Neural Networks Via Dual-Teacher Knowledge Distillation

Nov 30, 2024

Chengyu Li, Debo Cheng, Guixian Zhang, Yi Li, Shichao Zhang

Abstract:Graph Neural Networks (GNNs) have demonstrated strong performance in graph representation learning across various real-world applications. However, they often produce biased predictions caused by sensitive attributes, such as religion or gender, an issue that has been largely overlooked in existing methods. Recently, numerous studies have focused on reducing biases in GNNs. However, these approaches often rely on training with partial data (e.g., using either node features or graph structure alone), which can enhance fairness but frequently compromises model utility due to the limited utilization of available graph information. To address this tradeoff, we propose an effective strategy to balance fairness and utility in knowledge distillation. Specifically, we introduce FairDTD, a novel Fair representation learning framework built on Dual-Teacher Distillation, leveraging a causal graph model to guide and optimize the design of the distillation process. Specifically, FairDTD employs two fairness-oriented teacher models: a feature teacher and a structure teacher, to facilitate dual distillation, with the student model learning fairness knowledge from the teachers while also leveraging full data to mitigate utility loss. To enhance information transfer, we incorporate graph-level distillation to provide an indirect supplement of graph information during training, as well as a node-specific temperature module to improve the comprehensive transfer of fair knowledge. Experiments on diverse benchmark datasets demonstrate that FairDTD achieves optimal fairness while preserving high model utility, showcasing its effectiveness in fair representation learning for GNNs.

Via

Access Paper or Ask Questions

GRAPE: Generalizing Robot Policy via Preference Alignment

Nov 28, 2024

Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Chaoqi Wang, Mingyu Ding, Dieter Fox, Huaxiu Yao

Figure 1 for GRAPE: Generalizing Robot Policy via Preference Alignment

Figure 2 for GRAPE: Generalizing Robot Policy via Preference Alignment

Figure 3 for GRAPE: Generalizing Robot Policy via Preference Alignment

Figure 4 for GRAPE: Generalizing Robot Policy via Preference Alignment

Abstract:Despite the recent advancements of vision-language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typically fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 44.31% and rollout step-length by 11.15%, respectively. All code, models, and data are available at https://grape-vla.github.io/

* Website: https://grape-vla.github.io/

Via

Access Paper or Ask Questions

Dynamic Attention and Bi-directional Fusion for Safety Helmet Wearing Detection

Nov 28, 2024

Junwei Feng, Xueyan Fan, Yuyang Chen, Yi Li

Abstract:Ensuring construction site safety requires accurate and real-time detection of workers' safety helmet use, despite challenges posed by cluttered environments, densely populated work areas, and hard-to-detect small or overlapping objects caused by building obstructions. This paper proposes a novel algorithm for safety helmet wearing detection, incorporating a dynamic attention within the detection head to enhance multi-scale perception. The mechanism combines feature-level attention for scale adaptation, spatial attention for spatial localization, and channel attention for task-specific insights, improving small object detection without additional computational overhead. Furthermore, a two-way fusion strategy enables bidirectional information flow, refining feature fusion through adaptive multi-scale weighting, and enhancing recognition of occluded targets. Experimental results demonstrate a 1.7% improvement in mAP@[.5:.95] compared to the best baseline while reducing GFLOPs by 11.9% on larger sizes. The proposed method surpasses existing models, providing an efficient and practical solution for real-world construction safety monitoring.

Via

Access Paper or Ask Questions

Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared Spectrum

Nov 26, 2024

Jing-Yang Wei, Hao Huang, Xin Zhang, De-Mao Ye, Yi Li, Le Wang, Yao-Guang Ma, Yang-Hui Li

Figure 1 for Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared Spectrum

Figure 2 for Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared Spectrum

Figure 3 for Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared Spectrum

Figure 4 for Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared Spectrum

Abstract:To provide a lightweight and cost-effective solution for the long-wave infrared imaging using a singlet, we develop a camera by integrating a High-Frequency-Enhancing Cycle-GAN neural network into a metalens imaging system. The High-Frequency-Enhancing Cycle-GAN improves the quality of the original metalens images by addressing inherent frequency loss introduced by the metalens. In addition to the bidirectional cyclic generative adversarial network, it incorporates a high-frequency adversarial learning module. This module utilizes wavelet transform to extract high-frequency components, and then establishes a high-frequency feedback loop. It enables the generator to enhance the camera outputs by integrating adversarial feedback from the high-frequency discriminator. This ensures that the generator adheres to the constraints imposed by the high-frequency adversarial loss, thereby effectively recovering the camera's frequency loss. This recovery guarantees high-fidelity image output from the camera, facilitating smooth video production. Our camera is capable of achieving dynamic imaging at 125 frames per second with an End Point Error value of 12.58. We also achieve 0.42 for Fr\'echet Inception Distance, 30.62 for Peak Signal to Noise Ratio, and 0.69 for Structural Similarity in the recorded videos.

Via

Access Paper or Ask Questions

LEADRE: Multi-Faceted Knowledge Enhanced LLM Empowered Display Advertisement Recommender System

Nov 21, 2024

Fengxin Li, Yi Li, Yue Liu, Chao Zhou, Yuan Wang, Xiaoxiang Deng, Wei Xue, Dapeng Liu, Lei Xiao, Haijie Gu(+4 more)

Figure 1 for LEADRE: Multi-Faceted Knowledge Enhanced LLM Empowered Display Advertisement Recommender System

Figure 2 for LEADRE: Multi-Faceted Knowledge Enhanced LLM Empowered Display Advertisement Recommender System

Figure 3 for LEADRE: Multi-Faceted Knowledge Enhanced LLM Empowered Display Advertisement Recommender System

Figure 4 for LEADRE: Multi-Faceted Knowledge Enhanced LLM Empowered Display Advertisement Recommender System

Abstract:Display advertising provides significant value to advertisers, publishers, and users. Traditional display advertising systems utilize a multi-stage architecture consisting of retrieval, coarse ranking, and final ranking. However, conventional retrieval methods rely on ID-based learning to rank mechanisms and fail to adequately utilize the content information of ads, which hampers their ability to provide diverse recommendation lists. To address this limitation, we propose leveraging the extensive world knowledge of LLMs. However, three key challenges arise when attempting to maximize the effectiveness of LLMs: "How to capture user interests", "How to bridge the knowledge gap between LLMs and advertising system", and "How to efficiently deploy LLMs". To overcome these challenges, we introduce a novel LLM-based framework called LLM Empowered Display ADvertisement REcommender system (LEADRE). LEADRE consists of three core modules: (1) The Intent-Aware Prompt Engineering introduces multi-faceted knowledge and designs intent-aware <Prompt, Response> pairs that fine-tune LLMs to generate ads tailored to users' personal interests. (2) The Advertising-Specific Knowledge Alignment incorporates auxiliary fine-tuning tasks and Direct Preference Optimization (DPO) to align LLMs with ad semantic and business value. (3) The Efficient System Deployment deploys LEADRE in an online environment by integrating both latency-tolerant and latency-sensitive service. Extensive offline experiments demonstrate the effectiveness of LEADRE and validate the contributions of individual modules. Online A/B test shows that LEADRE leads to a 1.57% and 1.17% GMV lift for serviced users on WeChat Channels and Moments separately. LEADRE has been deployed on both platforms, serving tens of billions of requests each day.

Via

Access Paper or Ask Questions

YOSO: You-Only-Sample-Once via Compressed Sensing for Graph Neural Network Training

Nov 08, 2024

Yi Li, Zhichun Guo, Guanpeng Li, Bingzhe Li

Figure 1 for YOSO: You-Only-Sample-Once via Compressed Sensing for Graph Neural Network Training

Figure 2 for YOSO: You-Only-Sample-Once via Compressed Sensing for Graph Neural Network Training

Figure 3 for YOSO: You-Only-Sample-Once via Compressed Sensing for Graph Neural Network Training

Figure 4 for YOSO: You-Only-Sample-Once via Compressed Sensing for Graph Neural Network Training

Abstract:Graph neural networks (GNNs) have become essential tools for analyzing non-Euclidean data across various domains. During training stage, sampling plays an important role in reducing latency by limiting the number of nodes processed, particularly in large-scale applications. However, as the demand for better prediction performance grows, existing sampling algorithms become increasingly complex, leading to significant overhead. To mitigate this, we propose YOSO (You-Only-Sample-Once), an algorithm designed to achieve efficient training while preserving prediction accuracy. YOSO introduces a compressed sensing (CS)-based sampling and reconstruction framework, where nodes are sampled once at input layer, followed by a lossless reconstruction at the output layer per epoch. By integrating the reconstruction process with the loss function of specific learning tasks, YOSO not only avoids costly computations in traditional compressed sensing (CS) methods, such as orthonormal basis calculations, but also ensures high-probability accuracy retention which equivalent to full node participation. Experimental results on node classification and link prediction demonstrate the effectiveness and efficiency of YOSO, reducing GNN training by an average of 75\% compared to state-of-the-art methods, while maintaining accuracy on par with top-performing baselines.

Via

Access Paper or Ask Questions

OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

Sep 29, 2024

Soofiyan Atar, Yi Li, Markus Grotz, Michael Wolf, Dieter Fox, Joshua Smith

Figure 1 for OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

Figure 2 for OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

Figure 3 for OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

Figure 4 for OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

Abstract:In warehouse environments, robots require robust picking capabilities to manage a wide variety of objects. Effective deployment demands minimal hardware, strong generalization to new products, and resilience in diverse settings. Current methods often rely on depth sensors for structural information, which suffer from high costs, complex setups, and technical limitations. Inspired by recent advancements in computer vision, we propose an innovative approach that leverages foundation models to enhance suction grasping using only RGB images. Trained solely on a synthetic dataset, our method generalizes its grasp prediction capabilities to real-world robots and a diverse range of novel objects not included in the training set. Our network achieves an 82.3\% success rate in real-world applications. The project website with code and data will be available at http://optigrasp.github.io.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions

Contact Compliance Visuo-Proprioceptive Policy for Contact-Rich Manipulation with Cost-Efficient Haptic Hand-Arm Teleoperation System

Sep 22, 2024

Bo Zhou, Ruixuan Jiao, Yi Li, Fang Fang, Fu Chen

Figure 1 for Contact Compliance Visuo-Proprioceptive Policy for Contact-Rich Manipulation with Cost-Efficient Haptic Hand-Arm Teleoperation System

Figure 2 for Contact Compliance Visuo-Proprioceptive Policy for Contact-Rich Manipulation with Cost-Efficient Haptic Hand-Arm Teleoperation System

Figure 3 for Contact Compliance Visuo-Proprioceptive Policy for Contact-Rich Manipulation with Cost-Efficient Haptic Hand-Arm Teleoperation System

Figure 4 for Contact Compliance Visuo-Proprioceptive Policy for Contact-Rich Manipulation with Cost-Efficient Haptic Hand-Arm Teleoperation System

Abstract:Learning robot manipulation skills in real-world environments is extremely challenging. Robots learning manipulation skills in real-world environments is extremely challenging. Recent research on imitation learning and visuomotor policies has significantly enhanced the ability of robots to perform manipulation tasks. In this paper, we propose Admit Policy, a visuo-proprioceptive imitation learning framework with force compliance, designed to reduce contact force fluctuations during robot execution of contact-rich manipulation tasks. This framework also includes a hand-arm teleoperation system with vibrotactile feedback for efficient data collection. Our framework utilizes RGB images, robot joint positions, and contact forces as observations and leverages a consistency-constrained teacher-student probabilistic diffusion model to generate future trajectories for end-effector positions and contact forces. An admittance model is then employed to track these trajectories, enabling effective force-position control across various tasks.We validated our framework on five challenging contact-rich manipulation tasks. Among these tasks, while improving success rates, our approach most significantly reduced the mean contact force required to complete the tasks by up to 53.92% and decreased the standard deviation of contact force fluctuations by 76.51% compared to imitation learning algorithms without dynamic contact force prediction and tracking.

* 8 pages, 6 figures. This is the first version of the letter, and it is subject to further revisions. The current submission does not necessarily reflect the final quality or content of the letter

Via

Access Paper or Ask Questions