Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lin Zhang

Devision of Biostatistics, School of Public Health, University of Minnesota

CLIP-driven Dual Feature Enhancing Network for Gaze Estimation

Feb 27, 2025

Lin Zhang, Yi Tian, Wanru Xu, Yi Jin, Yaping Huang

Figure 1 for CLIP-driven Dual Feature Enhancing Network for Gaze Estimation

Figure 2 for CLIP-driven Dual Feature Enhancing Network for Gaze Estimation

Figure 3 for CLIP-driven Dual Feature Enhancing Network for Gaze Estimation

Figure 4 for CLIP-driven Dual Feature Enhancing Network for Gaze Estimation

Abstract:The complex application scenarios have raised critical requirements for precise and generalizable gaze estimation methods. Recently, the pre-trained CLIP has achieved remarkable performance on various vision tasks, but its potentials have not been fully exploited in gaze estimation. In this paper, we propose a novel CLIP-driven Dual Feature Enhancing Network (CLIP-DFENet), which boosts gaze estimation performance with the help of CLIP under a novel `main-side' collaborative enhancing strategy. Accordingly, a Language-driven Differential Module (LDM) is designed on the basis of the CLIP's text encoder to reveal the semantic difference of gaze. This module could empower our Core Feature Extractor with the capability of characterizing the gaze-related semantic information. Moreover, a Vision-driven Fusion Module (VFM) is introduced to strengthen the generalized and valuable components of visual embeddings obtained via CLIP's image encoder, and utilizes them to further improve the generalization of the features captured by Core Feature Extractor. Finally, a robust Double-head Gaze Regressor is adopted to map the enhanced features to gaze directions. Extensive experimental results on four challenging datasets over within-domain and cross-domain tasks demonstrate the discriminability and generalizability of our CLIP-DFENet.

Via

Access Paper or Ask Questions

CRTrack: Low-Light Semi-Supervised Multi-object Tracking Based on Consistency Regularization

Feb 24, 2025

Zijing Zhao, Jianlong Yu, Lin Zhang, Shunli Zhang

Figure 1 for CRTrack: Low-Light Semi-Supervised Multi-object Tracking Based on Consistency Regularization

Figure 2 for CRTrack: Low-Light Semi-Supervised Multi-object Tracking Based on Consistency Regularization

Figure 3 for CRTrack: Low-Light Semi-Supervised Multi-object Tracking Based on Consistency Regularization

Figure 4 for CRTrack: Low-Light Semi-Supervised Multi-object Tracking Based on Consistency Regularization

Abstract:Multi-object tracking under low-light environments is prevalent in real life. Recent years have seen rapid development in the field of multi-object tracking. However, due to the lack of datasets and the high cost of annotations, multi-object tracking under low-light environments remains a persistent challenge. In this paper, we focus on multi-object tracking under low-light conditions. To address the issues of limited data and the lack of dataset, we first constructed a low-light multi-object tracking dataset (LLMOT). This dataset comprises data from MOT17 that has been enhanced for nighttime conditions as well as multiple unannotated low-light videos. Subsequently, to tackle the high annotation costs and address the issue of image quality degradation, we propose a semi-supervised multi-object tracking method based on consistency regularization named CRTrack. First, we calibrate a consistent adaptive sampling assignment to replace the static IoU-based strategy, enabling the semi-supervised tracking method to resist noisy pseudo-bounding boxes. Then, we design a adaptive semi-supervised network update method, which effectively leverages unannotated data to enhance model performance. Dataset and Code: https://github.com/ZJZhao123/CRTrack.

Via

Access Paper or Ask Questions

BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference

Feb 24, 2025

Zewen Jin, Shengnan Wang, Jiaan Zhu, Hongrui Zhan, Youhui Bai, Lin Zhang, Zhenyu Ming, Cheng Li

Abstract:The Mixture-of-Experts (MoE) structure scales the Transformer-based large language models (LLMs) and improves their performance with only the sub-linear increase in computation resources. Recently, a fine-grained DeepSeekMoE structure is proposed, which can further improve the computing efficiency of MoE without performance degradation. However, the All-to-All communication introduced by MoE has become a bottleneck, especially for the fine-grained structure, which typically involves and activates more experts, hence contributing to heavier communication overhead. In this paper, we propose a novel MoE structure named BigMac, which is also fine-grained but with high communication efficiency. The innovation of BigMac is mainly due to that we abandon the \textbf{c}ommunicate-\textbf{d}escend-\textbf{a}scend-\textbf{c}ommunicate (CDAC) manner used by fine-grained MoE, which leads to the All-to-All communication always taking place at the highest dimension. Instead, BigMac designs an efficient \textbf{d}escend-\textbf{c}ommunicate-\textbf{c}ommunicate-\textbf{a}scend (DCCA) manner. Specifically, we add a descending and ascending projection at the entrance and exit of the expert, respectively, which enables the communication to perform at a very low dimension. Furthermore, to adapt to DCCA, we re-design the structure of small experts, ensuring that the expert in BigMac has enough complexity to address tokens. Experimental results show that BigMac achieves comparable or even better model quality than fine-grained MoEs with the same number of experts and a similar number of total parameters. Equally importantly, BigMac reduces the end-to-end latency by up to 3.09$\times$ for training and increases the throughput by up to 3.11$\times$ for inference on state-of-the-art AI computing frameworks including Megatron, Tutel, and DeepSpeed-Inference.

Via

Access Paper or Ask Questions

High-resolution Rainy Image Synthesis: Learning from Rendering

Feb 23, 2025

Kaibin Zhou, Shengjie Zhao, Hao Deng, Lin Zhang

Figure 1 for High-resolution Rainy Image Synthesis: Learning from Rendering

Figure 2 for High-resolution Rainy Image Synthesis: Learning from Rendering

Figure 3 for High-resolution Rainy Image Synthesis: Learning from Rendering

Figure 4 for High-resolution Rainy Image Synthesis: Learning from Rendering

Abstract:Currently, there are few effective methods for synthesizing a mass of high-resolution rainy images in complex illumination conditions. However, these methods are essential for synthesizing large-scale high-quality paired rainy-clean image datasets, which can train deep learning-based single image rain removal models capable of generalizing to various illumination conditions. Therefore, we propose a practical two-stage learning-from-rendering pipeline for high-resolution rainy image synthesis. The pipeline combines the benefits of the realism of rendering-based methods and the high-efficiency of learning-based methods, providing the possibility of creating large-scale high-quality paired rainy-clean image datasets. In the rendering stage, we use a rendering-based method to create a High-resolution Rainy Image (HRI) dataset, which contains realistic high-resolution paired rainy-clean images of multiple scenes and various illumination conditions. In the learning stage, to learn illumination information from background images for high-resolution rainy image generation, we propose a High-resolution Rainy Image Generation Network (HRIGNet). HRIGNet is designed to introduce a guiding diffusion model in the Latent Diffusion Model, which provides additional guidance information for high-resolution image synthesis. In our experiments, HRIGNet is able to synthesize high-resolution rainy images up to 2048x1024 resolution. Rain removal experiments on real dataset validate that our method can help improve the robustness of deep derainers to real rainy images. To make our work reproducible, source codes and the dataset have been released at https://kb824999404.github.io/HRIG/.

Via

Access Paper or Ask Questions

Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning

Feb 13, 2025

Lin Zhang, Lijie Hu, Di Wang

Abstract:Transformer-based language models have achieved notable success, yet their internal reasoning mechanisms remain largely opaque due to complex non-linear interactions and high-dimensional operations. While previous research suggests that these models implicitly encode reasoning structures, it is still unclear which specific multi-step thought processes they employ to solve complex tasks. To address this gap, we propose a novel mechanistic interpretability framework, SICAF, designed to trace and analyze the reasoning strategies that language models use in multi-step inference tasks. By employing circuit analysis and self-influence functions, we quantify the evolving importance of each token throughout the reasoning process, thereby mapping the pathways the model uses for inference. Applying SICAF to the GPT-2 model on the Indirect Object Identification (IOI) prediction task, we demonstrate how underlying circuits can reveal a reasoning process that aligns with human interpretability, offering new insights into the model's internal logic.

* Accepted by NAACL2025

Via

Access Paper or Ask Questions

EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

Feb 07, 2025

Lin Zhang, Wenshuo Dong, Zhuoran Zhang, Shu Yang, Lijie Hu, Ninghao Liu, Pan Zhou, Di Wang

Figure 1 for EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

Figure 2 for EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

Figure 3 for EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

Figure 4 for EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

Abstract:Understanding the internal mechanisms of transformer-based language models remains challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer neural networks by analyzing their internal processes at the level of computational subgraphs. In this paper, we revisit existing gradient-based circuit identification methods and find that their performance is either affected by the zero-gradient problem or saturation effects, where edge attribution scores become insensitive to input changes, resulting in noisy and unreliable attribution evaluations for circuit components. To address the saturation effect, we propose Edge Attribution Patching with GradPath (EAP-GP), EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region. This approach enhances attribution reliability and improves the faithfulness of circuit identification. We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL. Experimental results demonstrate that EAP-GP outperforms existing methods in circuit faithfulness, achieving improvements up to 17.7%. Comparisons with manually annotated ground-truth circuits demonstrate that EAP-GP achieves precision and recall comparable to or better than previous approaches, highlighting its effectiveness in identifying accurate circuits.

Via

Access Paper or Ask Questions

FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

Jan 18, 2025

Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, Xiaowen Chu

Figure 1 for FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

Figure 2 for FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

Figure 3 for FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

Figure 4 for FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

Abstract:Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42$\times$ speedup), and 2) FSMoE outperforms the state-of-the-art MoE training systems (DeepSpeed-MoE and Tutel) by 1.18$\times$-1.22$\times$ on 1458 MoE layers and 1.19$\times$-3.01$\times$ on real-world MoE models based on GPT-2 and Mixtral using a popular routing function.

Via

Access Paper or Ask Questions

CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset

Jan 14, 2025

Jiawei Du, Xuanjun Chen, Haibin Wu, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang(+1 more)

Figure 1 for CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset

Figure 2 for CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset

Figure 3 for CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset

Figure 4 for CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset

Abstract:With the rapid advancement of codec-based speech generation (CoSG) systems, creating fake speech that mimics an individual's identity and spreads misinformation has become remarkably easy. Addressing the risks posed by such deepfake speech has attracted significant attention. However, most existing studies focus on detecting fake data generated by traditional speech generation models. Research on detecting fake speech generated by CoSG systems remains limited and largely unexplored. In this paper, we introduce CodecFake-Omni, a large-scale dataset specifically designed to advance the study of neural codec-based deepfake speech (CodecFake) detection and promote progress within the anti-spoofing community. To the best of our knowledge, CodecFake-Omni is the largest dataset of its kind till writing this paper, encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using nearly all publicly available open-source 31 neural audio codec models across 21 different codec families (one codec family with different configurations will result in multiple different codec models). The evaluation set includes web-sourced data collected from websites generated by 17 advanced CoSG models with eight codec families. Using this large-scale dataset, we reaffirm our previous findings that anti-spoofing models trained on traditional spoofing datasets generated by vocoders struggle to detect synthesized speech from current CoSG systems. Additionally, we propose a comprehensive neural audio codec taxonomy, categorizing neural audio codecs by their root components: vector quantizer, auxiliary objectives, and decoder types, with detailed explanations and representative examples for each. Using this comprehensive taxonomy, we conduct stratified analysis to provide valuable insights for future CodecFake detection research.

* Work in Progress: The first two authors contributed equally to this work. Their names are listed alphabetically by first name

Via

Access Paper or Ask Questions

A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation Based on Large Language Models Enhanced by Domain Knowledge Retrieval

Jan 08, 2025

Shuo Tong, Han Liu, Runyuan Guo, Wenqing Wang, Xueqiong Tian, Lingyun Wei, Lin Zhang, Huayong Wu, Ding Liu, Youmin Zhang

Figure 1 for A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation Based on Large Language Models Enhanced by Domain Knowledge Retrieval

Figure 2 for A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation Based on Large Language Models Enhanced by Domain Knowledge Retrieval

Figure 3 for A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation Based on Large Language Models Enhanced by Domain Knowledge Retrieval

Figure 4 for A Soft Sensor Method with Uncertainty-Awareness and Self-Explanation Based on Large Language Models Enhanced by Domain Knowledge Retrieval

Abstract:Data-driven soft sensors are crucial in predicting key performance indicators in industrial systems. However, current methods predominantly rely on the supervised learning paradigms of parameter updating, which inherently faces challenges such as high development costs, poor robustness, training instability, and lack of interpretability. Recently, large language models (LLMs) have demonstrated significant potential across various domains, notably through In-Context Learning (ICL), which enables high-performance task execution with minimal input-label demonstrations and no prior training. This paper aims to replace supervised learning with the emerging ICL paradigm for soft sensor modeling to address existing challenges and explore new avenues for advancement. To achieve this, we propose a novel framework called the Few-shot Uncertainty-aware and self-Explaining Soft Sensor (LLM-FUESS), which includes the Zero-shot Auxiliary Variable Selector (LLM-ZAVS) and the Uncertainty-aware Few-shot Soft Sensor (LLM-UFSS). The LLM-ZAVS retrieves from the Industrial Knowledge Vector Storage to enhance LLMs' domain-specific knowledge, enabling zero-shot auxiliary variable selection. In the LLM-UFSS, we utilize text-based context demonstrations of structured data to prompt LLMs to execute ICL for predicting and propose a context sample retrieval augmentation strategy to improve performance. Additionally, we explored LLMs' AIGC and probabilistic characteristics to propose self-explanation and uncertainty quantification methods for constructing a trustworthy soft sensor. Extensive experiments demonstrate that our method achieved state-of-the-art predictive performance, strong robustness, and flexibility, effectively mitigates training instability found in traditional methods. To the best of our knowledge, this is the first work to establish soft sensor utilizing LLMs.

Via

Access Paper or Ask Questions

GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Dec 15, 2024

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha(+10 more)

Figure 1 for GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Figure 2 for GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Figure 3 for GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Figure 4 for GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Abstract:We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.

Via

Access Paper or Ask Questions