Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yutong Zheng

Joint Training Scattering Matrix Learning and Channel Estimation for Beyond-Diagonal Reconfigurable Intelligent Surfaces

Mar 26, 2026

Yiyang Peng, Binggui Zhou, Yutong Zheng, Danilo Mandic, Bruno Clerckx

Abstract:Beyond-diagonal reconfigurable intelligent surface (BD-RIS) generalizes the conventional diagonal RIS (D-RIS) by introducing tunable inter-element connections, offering enhanced wave manipulation capabilities. However, realizing the advantages of BD-RIS requires accurate channel state information (CSI), whose acquisition becomes significantly more challenging due to the increased number of channel coefficients, leading to prohibitively large pilot training overhead in BD-RIS-aided multi-user multiple-input multiple-output (MU-MIMO) systems. Existing studies reduce pilot overhead by exploiting the channel correlations induced by the Kronecker-product or multi-linear structure of BD-RIS-aided channels, which neglect the spatial correlation among antennas and the statistical correlation across RIS-user channels. In this paper, we propose a learning-based channel estimation framework, namely the joint training scattering matrix learning and channel estimation framework (JTSMLCEF), which jointly optimizes the BD-RIS training scattering matrix and estimates the cascaded channels in an end-to-end manner to achieve accurate channel estimation and reduce the pilot overhead. The proposed JTSMLCEF follows a two-phase channel estimation protocol to enable adaptive training scattering matrix optimization with a training scattering matrix optimizer (TSMO) and cascaded channel estimation with a dual-attention channel estimator (DACE). Specifically, the DACE is designed with intra-user and inter-user attention modules to capture the multi-dimensional correlations in multi-user cascaded channels. Simulation results demonstrate the superiority of JTSMLCEF. Compared with the current state-of-the-art method, it reduces the pilot overhead by $80\%$ while further reducing the normalized mean squared error (NMSE) by $82.6\%$ and $92.5\%$ in indoor and urban micro-cell (UMi) scenarios, respectively.

Via

Access Paper or Ask Questions

Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding

Feb 28, 2026

Shangda Wu, Ziya Zhou, Yongyi Zang, Yutong Zheng, Dafang Liang, Ruibin Yuan, Qiuqiang Kong

Abstract:We introduce Voices of Civilizations, the first multilingual QA benchmark for evaluating audio LLMs' cultural comprehension on full-length music recordings. Covering 380 tracks across 38 languages, our automated pipeline yields 1,190 multiple-choice questions through four stages - each followed by manual verification: 1) compiling a representative music list; 2) generating cultural-background documents for each sample in the music list via LLMs; 3) extracting key attributes from those documents; and 4) constructing multiple-choice questions probing language, region associations, mood, and thematic content. We evaluate models under four conditions and report per-language accuracy. Our findings demonstrate that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context and exhibit systematic biases in interpreting music from different cultural traditions. The dataset is publicly available on Hugging Face to foster culturally inclusive music understanding research.

* 2 pages, 2 figures, 1 table, accepted by ISMIR 2025 LBD

Via

Access Paper or Ask Questions

ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving

Jan 27, 2026

Yujin Wang, Yutong Zheng, Wenxian Fan, Tianyi Wang, Hongqing Chu, Daxin Tian, Bingzhao Gao, Jianqiang Wang, Hong Chen

Abstract:In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.

Via

Access Paper or Ask Questions

A Reference-Based 3D Semantic-Aware Framework for Accurate Local Facial Attribute Editing

Jul 29, 2024

Yu-Kai Huang, Yutong Zheng, Yen-Shuo Su, Anudeepsekhar Bolimera, Han Zhang, Fangyi Chen, Marios Savvides

Figure 1 for A Reference-Based 3D Semantic-Aware Framework for Accurate Local Facial Attribute Editing

Figure 2 for A Reference-Based 3D Semantic-Aware Framework for Accurate Local Facial Attribute Editing

Figure 3 for A Reference-Based 3D Semantic-Aware Framework for Accurate Local Facial Attribute Editing

Figure 4 for A Reference-Based 3D Semantic-Aware Framework for Accurate Local Facial Attribute Editing

Abstract:Facial attribute editing plays a crucial role in synthesizing realistic faces with specific characteristics while maintaining realistic appearances. Despite advancements, challenges persist in achieving precise, 3D-aware attribute modifications, which are crucial for consistent and accurate representations of faces from different angles. Current methods struggle with semantic entanglement and lack effective guidance for incorporating attributes while maintaining image integrity. To address these issues, we introduce a novel framework that merges the strengths of latent-based and reference-based editing methods. Our approach employs a 3D GAN inversion technique to embed attributes from the reference image into a tri-plane space, ensuring 3D consistency and realistic viewing from multiple perspectives. We utilize blending techniques and predicted semantic masks to locate precise edit regions, merging them with the contextual guidance from the reference image. A coarse-to-fine inpainting strategy is then applied to preserve the integrity of untargeted areas, significantly enhancing realism. Our evaluations demonstrate superior performance across diverse editing tasks, validating our framework's effectiveness in realistic and applicable facial attribute editing.

Via

Access Paper or Ask Questions

Detecting subtle macroscopic changes in a finite temperature classical scalar field with machine learning

Nov 21, 2023

Jiming Yang, Yutong Zheng, Jiahong Zhou, Huiyu Li, Jun Yin

Abstract:The ability to detect macroscopic changes is important for probing the behaviors of experimental many-body systems from the classical to the quantum realm. Although abrupt changes near phase boundaries can easily be detected, subtle macroscopic changes are much more difficult to detect as the changes can be obscured by noise. In this study, as a toy model for detecting subtle macroscopic changes in many-body systems, we try to differentiate scalar field samples at varying temperatures. We compare different methods for making such differentiations, from physics method, statistics method, to AI method. Our finding suggests that the AI method outperforms both the statistical method and the physics method in its sensitivity. Our result provides a proof-of-concept that AI can potentially detect macroscopic changes in many-body systems that elude physical measures.

* 10 pages, 3 figures

Via

Access Paper or Ask Questions

Powering Finetuning in Few-shot Learning: Domain-Agnostic Feature Adaptation with Rectified Class Prototypes

Apr 07, 2022

Ran Tao, Han Zhang, Yutong Zheng, Marios Savvides

Figure 1 for Powering Finetuning in Few-shot Learning: Domain-Agnostic Feature Adaptation with Rectified Class Prototypes

Figure 2 for Powering Finetuning in Few-shot Learning: Domain-Agnostic Feature Adaptation with Rectified Class Prototypes

Figure 3 for Powering Finetuning in Few-shot Learning: Domain-Agnostic Feature Adaptation with Rectified Class Prototypes

Figure 4 for Powering Finetuning in Few-shot Learning: Domain-Agnostic Feature Adaptation with Rectified Class Prototypes

Abstract:In recent works, utilizing a deep network trained on meta-training set serves as a strong baseline in few-shot learning. In this paper, we move forward to refine novel-class features by finetuning a trained deep network. Finetuning is designed to focus on reducing biases in novel-class feature distributions, which we define as two aspects: class-agnostic and class-specific biases. Class-agnostic bias is defined as the distribution shifting introduced by domain difference, which we propose Distribution Calibration Module(DCM) to reduce. DCM owes good property of eliminating domain difference and fast feature adaptation during optimization. Class-specific bias is defined as the biased estimation using a few samples in novel classes, which we propose Selected Sampling(SS) to reduce. Without inferring the actual class distribution, SS is designed by running sampling using proposal distributions around support-set samples. By powering finetuning with DCM and SS, we achieve state-of-the-art results on Meta-Dataset with consistent performance boosts over ten datasets from different domains. We believe our simple yet effective method demonstrates its possibility to be applied on practical few-shot applications.

* published in AAAI-22

Via

Access Paper or Ask Questions

Unsupervised Disentanglement of Linear-Encoded Facial Semantics

Mar 30, 2021

Yutong Zheng, Yu-Kai Huang, Ran Tao, Zhiqiang Shen, Marios Savvides

Figure 1 for Unsupervised Disentanglement of Linear-Encoded Facial Semantics

Figure 2 for Unsupervised Disentanglement of Linear-Encoded Facial Semantics

Figure 3 for Unsupervised Disentanglement of Linear-Encoded Facial Semantics

Figure 4 for Unsupervised Disentanglement of Linear-Encoded Facial Semantics

Abstract:We propose a method to disentangle linear-encoded facial semantics from StyleGAN without external supervision. The method derives from linear regression and sparse representation learning concepts to make the disentangled latent representations easily interpreted as well. We start by coupling StyleGAN with a stabilized 3D deformable facial reconstruction method to decompose single-view GAN generations into multiple semantics. Latent representations are then extracted to capture interpretable facial semantics. In this work, we make it possible to get rid of labels for disentangling meaningful facial semantics. Also, we demonstrate that the guided extrapolation along the disentangled representations can help with data augmentation, which sheds light on handling unbalanced data. Finally, we provide an analysis of our learned localized facial representations and illustrate that the semantic information is encoded, which surprisingly complies with human intuition. The overall unsupervised design brings more flexibility to representation learning in the wild.

* Accepted in IEEE Conference on Computer Vision and Pattern Recognition 2021 (CVPR2021)

Via

Access Paper or Ask Questions

Adversarial-Based Knowledge Distillation for Multi-Model Ensemble and Noisy Data Refinement

Aug 22, 2019

Zhiqiang Shen, Zhankui He, Wanyun Cui, Jiahui Yu, Yutong Zheng, Chenchen Zhu, Marios Savvides

Figure 1 for Adversarial-Based Knowledge Distillation for Multi-Model Ensemble and Noisy Data Refinement

Figure 2 for Adversarial-Based Knowledge Distillation for Multi-Model Ensemble and Noisy Data Refinement

Figure 3 for Adversarial-Based Knowledge Distillation for Multi-Model Ensemble and Noisy Data Refinement

Figure 4 for Adversarial-Based Knowledge Distillation for Multi-Model Ensemble and Noisy Data Refinement

Abstract:Generic Image recognition is a fundamental and fairly important visual problem in computer vision. One of the major challenges of this task lies in the fact that single image usually has multiple objects inside while the labels are still one-hot, another one is noisy and sometimes missing labels when annotated by humans. In this paper, we focus on tackling these challenges accompanying with two different image recognition problems: multi-model ensemble and noisy data recognition with a unified framework. As is well-known, usually the best performing deep neural models are ensembles of multiple base-level networks, as it can mitigate the variation or noise containing in the dataset. Unfortunately, the space required to store these many networks, and the time required to execute them at runtime, prohibit their use in applications where test sets are large (e.g., ImageNet). In this paper, we present a method for compressing large, complex trained ensembles into a single network, where the knowledge from a variety of trained deep neural networks (DNNs) is distilled and transferred to a single DNN. In order to distill diverse knowledge from different trained (teacher) models, we propose to use adversarial-based learning strategy where we define a block-wise training loss to guide and optimize the predefined student network to recover the knowledge in teacher models, and to promote the discriminator network to distinguish teacher vs. student features simultaneously. Extensive experiments on CIFAR-10/100, SVHN, ImageNet and iMaterialist Challenge Dataset demonstrate the effectiveness of our MEAL method. On ImageNet, our ResNet-50 based MEAL achieves top-1/5 21.79%/5.99% val error, which outperforms the original model by 2.06%/1.14%. On iMaterialist Challenge Dataset, our MEAL obtains a remarkable improvement of top-3 1.15% (official evaluation metric) on a strong baseline model of ResNet-101.

* This is an extended version of our previous conference paper arXiv:1812.02425

Via

Access Paper or Ask Questions

Ring loss: Convex Feature Normalization for Face Recognition

Feb 28, 2018

Yutong Zheng, Dipan K. Pal, Marios Savvides

Figure 1 for Ring loss: Convex Feature Normalization for Face Recognition

Figure 2 for Ring loss: Convex Feature Normalization for Face Recognition

Figure 3 for Ring loss: Convex Feature Normalization for Face Recognition

Figure 4 for Ring loss: Convex Feature Normalization for Face Recognition

Abstract:We motivate and present Ring loss, a simple and elegant feature normalization approach for deep networks designed to augment standard loss functions such as Softmax. We argue that deep feature normalization is an important aspect of supervised classification problems where we require the model to represent each class in a multi-class problem equally well. The direct approach to feature normalization through the hard normalization operation results in a non-convex formulation. Instead, Ring loss applies soft normalization, where it gradually learns to constrain the norm to the scaled unit circle while preserving convexity leading to more robust features. We apply Ring loss to large-scale face recognition problems and present results on LFW, the challenging protocols of IJB-A Janus, Janus CS3 (a superset of IJB-A Janus), Celebrity Frontal-Profile (CFP) and MegaFace with 1 million distractors. Ring loss outperforms strong baselines, matches state-of-the-art performance on IJB-A Janus and outperforms all other results on the challenging Janus CS3 thereby achieving state-of-the-art. We also outperform strong baselines in handling extremely low resolution face matching.

* Accepted at CVPR 2018

Via

Access Paper or Ask Questions

Towards a Deep Learning Framework for Unconstrained Face Detection

Jan 02, 2017

Yutong Zheng, Chenchen Zhu, Khoa Luu, Chandrasekhar Bhagavatula, T. Hoang Ngan Le, Marios Savvides

Figure 1 for Towards a Deep Learning Framework for Unconstrained Face Detection

Figure 2 for Towards a Deep Learning Framework for Unconstrained Face Detection

Figure 3 for Towards a Deep Learning Framework for Unconstrained Face Detection

Figure 4 for Towards a Deep Learning Framework for Unconstrained Face Detection

Abstract:Robust face detection is one of the most important pre-processing steps to support facial expression analysis, facial landmarking, face recognition, pose estimation, building of 3D facial models, etc. Although this topic has been intensely studied for decades, it is still challenging due to numerous variants of face images in real-world scenarios. In this paper, we present a novel approach named Multiple Scale Faster Region-based Convolutional Neural Network (MS-FRCNN) to robustly detect human facial regions from images collected under various challenging conditions, e.g. large occlusions, extremely low resolutions, facial expressions, strong illumination variations, etc. The proposed approach is benchmarked on two challenging face detection databases, i.e. the Wider Face database and the Face Detection Dataset and Benchmark (FDDB), and compared against recent other face detection methods, e.g. Two-stage CNN, Multi-scale Cascade CNN, Faceness, Aggregate Chanel Features, HeadHunter, Multi-view Face Detection, Cascade CNN, etc. The experimental results show that our proposed approach consistently achieves highly competitive results with the state-of-the-art performance against other recent face detection methods.

* Accepted by BTAS 2016. arXiv admin note: substantial text overlap with arXiv:1606.05413

Via

Access Paper or Ask Questions