Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jian Yang

additional authors not shown

Towards Better Spherical Sliced-Wasserstein Distance Learning with Data-Adaptive Discriminative Projection Direction

Dec 26, 2024

Hongliang Zhang, Shuo Chen, Lei Luo, Jian Yang

Figure 1 for Towards Better Spherical Sliced-Wasserstein Distance Learning with Data-Adaptive Discriminative Projection Direction

Figure 2 for Towards Better Spherical Sliced-Wasserstein Distance Learning with Data-Adaptive Discriminative Projection Direction

Figure 3 for Towards Better Spherical Sliced-Wasserstein Distance Learning with Data-Adaptive Discriminative Projection Direction

Figure 4 for Towards Better Spherical Sliced-Wasserstein Distance Learning with Data-Adaptive Discriminative Projection Direction

Abstract:Spherical Sliced-Wasserstein (SSW) has recently been proposed to measure the discrepancy between spherical data distributions in various fields, such as geology, medical domains, computer vision, and deep representation learning. However, in the original SSW, all projection directions are treated equally, which is too idealistic and cannot accurately reflect the importance of different projection directions for various data distributions. To address this issue, we propose a novel data-adaptive Discriminative Spherical Sliced-Wasserstein (DSSW) distance, which utilizes a projected energy function to determine the discriminative projection direction for SSW. In our new DSSW, we introduce two types of projected energy functions to generate the weights for projection directions with complete theoretical guarantees. The first type employs a non-parametric deterministic function that transforms the projected Wasserstein distance into its corresponding weight in each projection direction. This improves the performance of the original SSW distance with negligible additional computational overhead. The second type utilizes a neural network-induced function that learns the projection direction weight through a parameterized neural network based on data projections. This further enhances the performance of the original SSW distance with less extra computational overhead. Finally, we evaluate the performance of our proposed DSSW by comparing it with several state-of-the-art methods across a variety of machine learning tasks, including gradient flows, density estimation on real earth data, and self-supervised learning.

* Accepted by AAAI 2025

Via

Access Paper or Ask Questions

ERGNN: Spectral Graph Neural Network with Explicitly-optimized Rational Graph Filters

Dec 26, 2024

Guoming Li, Jian Yang, Shangsong Liang

Figure 1 for ERGNN: Spectral Graph Neural Network with Explicitly-optimized Rational Graph Filters

Figure 2 for ERGNN: Spectral Graph Neural Network with Explicitly-optimized Rational Graph Filters

Figure 3 for ERGNN: Spectral Graph Neural Network with Explicitly-optimized Rational Graph Filters

Figure 4 for ERGNN: Spectral Graph Neural Network with Explicitly-optimized Rational Graph Filters

Abstract:Approximation-based spectral graph neural networks, which construct graph filters with function approximation, have shown substantial performance in graph learning tasks. Despite their great success, existing works primarily employ polynomial approximation to construct the filters, whereas another superior option, namely ration approximation, remains underexplored. Although a handful of prior works have attempted to deploy the rational approximation, their implementations often involve intensive computational demands or still resort to polynomial approximations, hindering full potential of the rational graph filters. To address the issues, this paper introduces ERGNN, a novel spectral GNN with explicitly-optimized rational filter. ERGNN adopts a unique two-step framework that sequentially applies the numerator filter and the denominator filter to the input signals, thus streamlining the model paradigm while enabling explicit optimization of both numerator and denominator of the rational filter. Extensive experiments validate the superiority of ERGNN over state-of-the-art methods, establishing it as a practical solution for deploying rational-based GNNs.

* Accepted in 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

Via

Access Paper or Ask Questions

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Dec 20, 2024

Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, Jian Yang

Figure 1 for Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Figure 2 for Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Figure 3 for Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Figure 4 for Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Abstract:Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios. Code is available at: https://github.com/NJU-PCALab/STTrack.

Via

Access Paper or Ask Questions

Qwen2.5 Technical Report

Dec 19, 2024

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu(+33 more)

Abstract:In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.

Via

Access Paper or Ask Questions

Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation

Dec 18, 2024

Xiaoqi An, Lin Zhao, Chen Gong, Jun Li, Jian Yang

Figure 1 for Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation

Figure 2 for Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation

Figure 3 for Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation

Figure 4 for Pre-training a Density-Aware Pose Transformer for Robust LiDAR-based 3D Human Pose Estimation

Abstract:With the rapid development of autonomous driving, LiDAR-based 3D Human Pose Estimation (3D HPE) is becoming a research focus. However, due to the noise and sparsity of LiDAR-captured point clouds, robust human pose estimation remains challenging. Most of the existing methods use temporal information, multi-modal fusion, or SMPL optimization to correct biased results. In this work, we try to obtain sufficient information for 3D HPE only by modeling the intrinsic properties of low-quality point clouds. Hence, a simple yet powerful method is proposed, which provides insights both on modeling and augmentation of point clouds. Specifically, we first propose a concise and effective density-aware pose transformer (DAPT) to get stable keypoint representations. By using a set of joint anchors and a carefully designed exchange module, valid information is extracted from point clouds with different densities. Then 1D heatmaps are utilized to represent the precise locations of the keypoints. Secondly, a comprehensive LiDAR human synthesis and augmentation method is proposed to pre-train the model, enabling it to acquire a better human body prior. We increase the diversity of point clouds by randomly sampling human positions and orientations and by simulating occlusions through the addition of laser-level masks. Extensive experiments have been conducted on multiple datasets, including IMU-annotated LidarHuman26M, SLOPER4D, and manually annotated Waymo Open Dataset v2.0 (Waymo), HumanM3. Our method demonstrates SOTA performance in all scenarios. In particular, compared with LPFormer on Waymo, we reduce the average MPJPE by $10.0mm$. Compared with PRN on SLOPER4D, we notably reduce the average MPJPE by $20.7mm$.

* Accepted to AAAI 2025

Via

Access Paper or Ask Questions

ExecRepoBench: Multi-level Executable Code Completion Evaluation

Dec 16, 2024

Jian Yang, Jiajun Zhang, Jiaxi Yang, Ke Jin, Lei Zhang, Qiyao Peng, Ken Deng, Yibo Miao, Tianyu Liu, Zeyu Cui(+2 more)

Figure 1 for ExecRepoBench: Multi-level Executable Code Completion Evaluation

Figure 2 for ExecRepoBench: Multi-level Executable Code Completion Evaluation

Figure 3 for ExecRepoBench: Multi-level Executable Code Completion Evaluation

Figure 4 for ExecRepoBench: Multi-level Executable Code Completion Evaluation

Abstract:Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant challenges, including limited context length, reliance on superficial evaluation metrics, and potential overfitting to training datasets. In this work, we introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench and the instruction corpora Repo-Instruct, aim at improving the functionality of open-source large language models (LLMs) in real-world coding scenarios that involve complex interdependencies across multiple files. ExecRepoBench includes 1.2K samples from active Python repositories. Plus, we present a multi-level grammar-based completion methodology conditioned on the abstract syntax tree to mask code fragments at various logical units (e.g. statements, expressions, and functions). Then, we fine-tune the open-source LLM with 7B parameters on Repo-Instruct to produce a strong code completion baseline model Qwen2.5-Coder-Instruct-C based on the open-source model. Qwen2.5-Coder-Instruct-C is rigorously evaluated against existing benchmarks, including MultiPL-E and ExecRepoBench, which consistently outperforms prior baselines across all programming languages. The deployment of \ourmethod{} can be used as a high-performance, local service for programming development\footnote{\url{https://execrepobench.github.io/}}.

Via

Access Paper or Ask Questions

StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Dec 16, 2024

Xiaokun Sun, Zeyu Cai, Zhenyu Zhang, Ying Tai, Jian Yang

Figure 1 for StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Figure 2 for StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Figure 3 for StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Figure 4 for StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Abstract:While haircut indicates distinct personality, existing avatar generation methods fail to model practical hair due to the general or entangled representation. We propose StrandHead, a novel text to 3D head avatar generation method capable of generating disentangled 3D hair with strand representation. Without using 3D data for supervision, we demonstrate that realistic hair strands can be generated from prompts by distilling 2D generative diffusion models. To this end, we propose a series of reliable priors on shape initialization, geometric primitives, and statistical haircut features, leading to a stable optimization and text-aligned performance. Extensive experiments show that StrandHead achieves the state-of-the-art reality and diversity of generated 3D head and hair. The generated 3D hair can also be easily implemented in the Unreal Engine for physical simulation and other applications. The code will be available at https://xiaokunsun.github.io/StrandHead.github.io.

* Project page: https://xiaokunsun.github.io/StrandHead.github.io

Via

Access Paper or Ask Questions

Depth-Centric Dehazing and Depth-Estimation from Real-World Hazy Driving Video

Dec 16, 2024

Junkai Fan, Kun Wang, Zhiqiang Yan, Xiang Chen, Shangbing Gao, Jun Li, Jian Yang

Figure 1 for Depth-Centric Dehazing and Depth-Estimation from Real-World Hazy Driving Video

Figure 2 for Depth-Centric Dehazing and Depth-Estimation from Real-World Hazy Driving Video

Figure 3 for Depth-Centric Dehazing and Depth-Estimation from Real-World Hazy Driving Video

Figure 4 for Depth-Centric Dehazing and Depth-Estimation from Real-World Hazy Driving Video

Abstract:In this paper, we study the challenging problem of simultaneously removing haze and estimating depth from real monocular hazy videos. These tasks are inherently complementary: enhanced depth estimation improves dehazing via the atmospheric scattering model (ASM), while superior dehazing contributes to more accurate depth estimation through the brightness consistency constraint (BCC). To tackle these intertwined tasks, we propose a novel depth-centric learning framework that integrates the ASM model with the BCC constraint. Our key idea is that both ASM and BCC rely on a shared depth estimation network. This network simultaneously exploits adjacent dehazed frames to enhance depth estimation via BCC and uses the refined depth cues to more effectively remove haze through ASM. Additionally, we leverage a non-aligned clear video and its estimated depth to independently regularize the dehazing and depth estimation networks. This is achieved by designing two discriminator networks: $D_{MFIR}$ enhances high-frequency details in dehazed videos, and $D_{MDR}$ reduces the occurrence of black holes in low-texture regions. Extensive experiments demonstrate that the proposed method outperforms current state-of-the-art techniques in both video dehazing and depth estimation tasks, especially in real-world hazy scenes. Project page: https://fanjunkai1.github.io/projectpage/DCL/index.html.

* Accepted by AAAI 20205, Project page: https://fanjunkai1.github.io/projectpage/DCL/index.html

Via

Access Paper or Ask Questions

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Dec 12, 2024

Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, Qingming Huang

Figure 1 for EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Figure 2 for EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Figure 3 for EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Figure 4 for EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Abstract:Given a piece of text, a video clip, and a reference audio, the movie dubbing task aims to generate speech that aligns with the video while cloning the desired voice. The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation; (2) They lack the capacity to express user-defined emotions. To address these problems, we propose EmoDubber, an emotion-controllable dubbing architecture that allows users to specify emotion type and emotional intensity while satisfying high-quality lip sync and pronunciation. Specifically, we first design Lip-related Prosody Aligning (LPA), which focuses on learning the inherent consistency between lip motion and prosody variation by duration level contrastive learning to incorporate reasonable alignment. Then, we design Pronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences by efficient conformer to improve speech intelligibility. Next, the speaker identity adapting module aims to decode acoustics prior and inject the speaker style embedding. After that, the proposed Flow-based User Emotion Controlling (FUEC) is used to synthesize waveform by flow matching prediction network conditioned on acoustics prior. In this process, the FUEC determines the gradient direction and guidance scale based on the user's emotion instructions by the positive and negative guidance mechanism, which focuses on amplifying the desired emotion while suppressing others. Extensive experimental results on three benchmark datasets demonstrate favorable performance compared to several state-of-the-art methods.

* Under review

Via

Access Paper or Ask Questions

ATPrompt: Textual Prompt Learning with Embedded Attributes

Dec 12, 2024

Zheng Li, Yibing Song, Penghai Zhao, Ming-Ming Cheng, Xiang Li, Jian Yang

Figure 1 for ATPrompt: Textual Prompt Learning with Embedded Attributes

Figure 2 for ATPrompt: Textual Prompt Learning with Embedded Attributes

Figure 3 for ATPrompt: Textual Prompt Learning with Embedded Attributes

Figure 4 for ATPrompt: Textual Prompt Learning with Embedded Attributes

Abstract:Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text prompt inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories. In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form. To finalize the attributes for downstream tasks, we propose a differentiable attribute search method that learns to identify representative and suitable attributes from a candidate pool summarized by a large language model. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format of textual-based methods, offering general improvements at a negligible computational cost. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.

* Technical Report. Project Page: https://zhengli97.github.io/ATPrompt/

Via

Access Paper or Ask Questions