Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Min Xu

Carnegie Mellon University

Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment

Mar 28, 2025

Jiahao Xia, Min Xu, Wenjian Huang, Jianguo Zhang, Haimin Zhang, Chunxia Xiao

Figure 1 for Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment

Figure 2 for Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment

Figure 3 for Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment

Figure 4 for Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment

Abstract:Despite the similar structures of human faces, existing face alignment methods cannot learn unified knowledge from multiple datasets with different landmark annotations. The limited training samples in a single dataset commonly result in fragile robustness in this field. To mitigate knowledge discrepancies among different datasets and train a task-agnostic unified face alignment (TUFA) framework, this paper presents a strategy to unify knowledge from multiple datasets. Specifically, we calculate a mean face shape for each dataset. To explicitly align these mean shapes on an interpretable plane based on their semantics, each shape is then incorporated with a group of semantic alignment embeddings. The 2D coordinates of these aligned shapes can be viewed as the anchors of the plane. By encoding them into structure prompts and further regressing the corresponding facial landmarks using image features, a mapping from the plane to the target faces is finally established, which unifies the learning target of different datasets. Consequently, multiple datasets can be utilized to boost the generalization ability of the model. The successful mitigation of discrepancies also enhances the efficiency of knowledge transferring to a novel dataset, significantly boosts the performance of few-shot face alignment. Additionally, the interpretable plane endows TUFA with a task-agnostic characteristic, enabling it to locate landmarks unseen during training in a zero-shot manner. Extensive experiments are carried on seven benchmarks and the results demonstrate an impressive improvement in face alignment brought by knowledge discrepancies mitigation.

* 24 Pages, 9 Figures

Via

Access Paper or Ask Questions

OLMD: Orientation-aware Long-term Motion Decoupling for Continuous Sign Language Recognition

Mar 11, 2025

Yiheng Yu, Sheng Liu, Yuan Feng, Min Xu, Zhelun Jin, Xuhua Yang

Figure 1 for OLMD: Orientation-aware Long-term Motion Decoupling for Continuous Sign Language Recognition

Figure 2 for OLMD: Orientation-aware Long-term Motion Decoupling for Continuous Sign Language Recognition

Figure 3 for OLMD: Orientation-aware Long-term Motion Decoupling for Continuous Sign Language Recognition

Figure 4 for OLMD: Orientation-aware Long-term Motion Decoupling for Continuous Sign Language Recognition

Abstract:The primary challenge in continuous sign language recognition (CSLR) mainly stems from the presence of multi-orientational and long-term motions. However, current research overlooks these crucial aspects, significantly impacting accuracy. To tackle these issues, we propose a novel CSLR framework: Orientation-aware Long-term Motion Decoupling (OLMD), which efficiently aggregates long-term motions and decouples multi-orientational signals into easily interpretable components. Specifically, our innovative Long-term Motion Aggregation (LMA) module filters out static redundancy while adaptively capturing abundant features of long-term motions. We further enhance orientation awareness by decoupling complex movements into horizontal and vertical components, allowing for motion purification in both orientations. Additionally, two coupling mechanisms are proposed: stage and cross-stage coupling, which together enrich multi-scale features and improve the generalization capabilities of the model. Experimentally, OLMD shows SOTA performance on three large-scale datasets: PHOENIX14, PHOENIX14-T, and CSL-Daily. Notably, we improved the word error rate (WER) on PHOENIX14 by an absolute 1.6% compared to the previous SOTA

Via

Access Paper or Ask Questions

Two-Dimensional Deep ReLU CNN Approximation for Korobov Functions: A Constructive Approach

Mar 11, 2025

Qin Fang, Lei Shi, Min Xu, Ding-Xuan Zhou

Abstract:This paper investigates approximation capabilities of two-dimensional (2D) deep convolutional neural networks (CNNs), with Korobov functions serving as a benchmark. We focus on 2D CNNs, comprising multi-channel convolutional layers with zero-padding and ReLU activations, followed by a fully connected layer. We propose a fully constructive approach for building 2D CNNs to approximate Korobov functions and provide rigorous analysis of the complexity of the constructed networks. Our results demonstrate that 2D CNNs achieve near-optimal approximation rates under the continuous weight selection model, significantly alleviating the curse of dimensionality. This work provides a solid theoretical foundation for 2D CNNs and illustrates their potential for broader applications in function approximation.

Via

Access Paper or Ask Questions

StructGS: Adaptive Spherical Harmonics and Rendering Enhancements for Superior 3D Gaussian Splatting

Mar 09, 2025

Zexu Huang, Min Xu, Stuart Perry

Abstract:Recent advancements in 3D reconstruction coupled with neural rendering techniques have greatly improved the creation of photo-realistic 3D scenes, influencing both academic research and industry applications. The technique of 3D Gaussian Splatting and its variants incorporate the strengths of both primitive-based and volumetric representations, achieving superior rendering quality. While 3D Geometric Scattering (3DGS) and its variants have advanced the field of 3D representation, they fall short in capturing the stochastic properties of non-local structural information during the training process. Additionally, the initialisation of spherical functions in 3DGS-based methods often fails to engage higher-order terms in early training rounds, leading to unnecessary computational overhead as training progresses. Furthermore, current 3DGS-based approaches require training on higher resolution images to render higher resolution outputs, significantly increasing memory demands and prolonging training durations. We introduce StructGS, a framework that enhances 3D Gaussian Splatting (3DGS) for improved novel-view synthesis in 3D reconstruction. StructGS innovatively incorporates a patch-based SSIM loss, dynamic spherical harmonics initialisation and a Multi-scale Residual Network (MSRN) to address the above-mentioned limitations, respectively. Our framework significantly reduces computational redundancy, enhances detail capture and supports high-resolution rendering from low-resolution inputs. Experimentally, StructGS demonstrates superior performance over state-of-the-art (SOTA) models, achieving higher quality and more detailed renderings with fewer artifacts.

Via

Access Paper or Ask Questions

PersonaX: A Recommendation Agent Oriented User Modeling Framework for Long Behavior Sequence

Mar 04, 2025

Yunxiao Shi, Wujiang Xu, Zeqi Zhang, Xing Zi, Qiang Wu, Min Xu

Abstract:Recommendation agents leverage large language models for user modeling LLM UM to construct textual personas guiding alignment with real users. However existing LLM UM methods struggle with long user generated content UGC due to context limitations and performance degradation. To address this sampling strategies prioritize relevance or recency are often applied yet they inevitably neglect the diverse user interests embedded within the discarded behaviors resulting in incomplete modeling and degraded profiling quality. Furthermore relevance based sampling requires real time retrieval forcing the user modeling process to operate online which introduces significant latency overhead. In this paper we propose PersonaX an agent agnostic LLM UM framework that tackles these challenges through sub behavior sequence SBS selection and offline multi persona construction. PersonaX extracts compact SBS segments offline to capture diverse user interests generating fine grained textual personas that are cached for efficient online retrieval. This approach ensures that the user persona used for prompting remains highly relevant to the current context while eliminating the need for online user modeling. For SBS selection we ensure both efficiency length less than five and high representational quality by balancing prototypicality and diversity within the sampled data. Extensive experiments validate the effectiveness and versatility of PersonaX in high quality user profiling. Utilizing only 30 to 50 percent of the behavioral data with a sequence length of 480 integrating PersonaX with AgentCF yields an absolute performance improvement of 3 to 11 percent while integration with Agent4Rec results in a gain of 10 to 50 percent. PersonaX as an agent agnostic framework sets a new benchmark for scalable user modeling paving the way for more accurate and efficient LLM driven recommendation agents.

* draft paper

Via

Access Paper or Ask Questions

InstructAgent: Building User Controllable Recommender via LLM Agent

Feb 20, 2025

Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang

Abstract:Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform's recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform's benefits, which may hinder their ability to protect and capture users' true interests. Second, these models are typically optimized using data from all users, which may overlook individual user's preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure. To this end, we first construct four recommendation datasets, denoted as $\dataset$, along with user instructions for each record.

* WWW2025@HCRS

Via

Access Paper or Ask Questions

TD-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection

Jan 24, 2025

Xi Xiao, Zhengji Li, Wentao Wang, Jiacheng Xie, Houjie Lin, Swalpa Kumar Roy, Tianyang Wang, Min Xu

Figure 1 for TD-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection

Figure 2 for TD-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection

Figure 3 for TD-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection

Figure 4 for TD-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection

Abstract:Object detection has witnessed remarkable advancements over the past decade, largely driven by breakthroughs in deep learning and the proliferation of large scale datasets. However, the domain of road damage detection remains relatively under explored, despite its critical significance for applications such as infrastructure maintenance and road safety. This paper addresses this gap by introducing a novel top down benchmark that offers a complementary perspective to existing datasets, specifically tailored for road damage detection. Our proposed Top Down Road Damage Detection Dataset (TDRD) includes three primary categories of road damage cracks, potholes, and patches captured from a top down viewpoint. The dataset consists of 7,088 high resolution images, encompassing 12,882 annotated instances of road damage. Additionally, we present a novel real time object detection framework, TDYOLOV10, designed to handle the unique challenges posed by the TDRD dataset. Comparative studies with state of the art models demonstrate competitive baseline results. By releasing TDRD, we aim to accelerate research in this crucial area. A sample of the dataset will be made publicly available upon the paper's acceptance.

Via

Access Paper or Ask Questions

SGIA: Enhancing Fine-Grained Visual Classification with Sequence Generative Image Augmentation

Dec 09, 2024

Qiyu Liao, Xin Yuan, Min Xu, Dadong Wang

Figure 1 for SGIA: Enhancing Fine-Grained Visual Classification with Sequence Generative Image Augmentation

Figure 2 for SGIA: Enhancing Fine-Grained Visual Classification with Sequence Generative Image Augmentation

Figure 3 for SGIA: Enhancing Fine-Grained Visual Classification with Sequence Generative Image Augmentation

Figure 4 for SGIA: Enhancing Fine-Grained Visual Classification with Sequence Generative Image Augmentation

Abstract:In Fine-Grained Visual Classification (FGVC), distinguishing highly similar subcategories remains a formidable challenge, often necessitating datasets with extensive variability. The acquisition and annotation of such FGVC datasets are notably difficult and costly, demanding specialized knowledge to identify subtle distinctions among closely related categories. Our study introduces a novel approach employing the Sequence Latent Diffusion Model (SLDM) for augmenting FGVC datasets, called Sequence Generative Image Augmentation (SGIA). Our method features a unique Bridging Transfer Learning (BTL) process, designed to minimize the domain gap between real and synthetically augmented data. This approach notably surpasses existing methods in generating more realistic image samples, providing a diverse range of pose transformations that extend beyond the traditional rigid transformations and style changes in generative augmentation. We demonstrate the effectiveness of our augmented dataset with substantial improvements in FGVC tasks on various datasets, models, and training strategies, especially in few-shot learning scenarios. Our method outperforms conventional image augmentation techniques in benchmark tests on three FGVC datasets, showcasing superior realism, variability, and representational quality. Our work sets a new benchmark and outperforms the previous state-of-the-art models in classification accuracy by 0.5% for the CUB-200-2011 dataset and advances the application of generative models in FGVC data augmentation.

* 13 pages, 5 figures

Via

Access Paper or Ask Questions

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Nov 30, 2024

Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu

Figure 1 for LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Figure 2 for LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Figure 3 for LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Figure 4 for LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Abstract:It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.

Via

Access Paper or Ask Questions

J-Invariant Volume Shuffle for Self-Supervised Cryo-Electron Tomogram Denoising on Single Noisy Volume

Nov 22, 2024

Xiwei Liu, Mohamad Kassab, Min Xu, Qirong Ho

Figure 1 for J-Invariant Volume Shuffle for Self-Supervised Cryo-Electron Tomogram Denoising on Single Noisy Volume

Figure 2 for J-Invariant Volume Shuffle for Self-Supervised Cryo-Electron Tomogram Denoising on Single Noisy Volume

Figure 3 for J-Invariant Volume Shuffle for Self-Supervised Cryo-Electron Tomogram Denoising on Single Noisy Volume

Figure 4 for J-Invariant Volume Shuffle for Self-Supervised Cryo-Electron Tomogram Denoising on Single Noisy Volume

Abstract:Cryo-Electron Tomography (Cryo-ET) enables detailed 3D visualization of cellular structures in near-native states but suffers from low signal-to-noise ratio due to imaging constraints. Traditional denoising methods and supervised learning approaches often struggle with complex noise patterns and the lack of paired datasets. Self-supervised methods, which utilize noisy input itself as a target, have been studied; however, existing Cryo-ET self-supervised denoising methods face significant challenges due to losing information during training and the learned incomplete noise patterns. In this paper, we propose a novel self-supervised learning model that denoises Cryo-ET volumetric images using a single noisy volume. Our method features a U-shape J-invariant blind spot network with sparse centrally masked convolutions, dilated channel attention blocks, and volume unshuffle/shuffle technique. The volume-unshuffle/shuffle technique expands receptive fields and utilizes multi-scale representations, significantly improving noise reduction and structural preservation. Experimental results demonstrate that our approach achieves superior performance compared to existing methods, advancing Cryo-ET data processing for structural biology research

* 10 pages, 7 figures, 7 tables

Via

Access Paper or Ask Questions