Alert button
Picture for Xiao Sun

Xiao Sun

Alert button

DEFormer: DCT-driven Enhancement Transformer for Low-light Image and Dark Vision

Sep 13, 2023
Xiangchen Yin, Zhenda Yu, Xin Gao, Ran Ju, Xiao Sun, Xinyu Zhang

Figure 1 for DEFormer: DCT-driven Enhancement Transformer for Low-light Image and Dark Vision
Figure 2 for DEFormer: DCT-driven Enhancement Transformer for Low-light Image and Dark Vision
Figure 3 for DEFormer: DCT-driven Enhancement Transformer for Low-light Image and Dark Vision
Figure 4 for DEFormer: DCT-driven Enhancement Transformer for Low-light Image and Dark Vision

The goal of low-light image enhancement is to restore the color and details of the image and is of great significance for high-level visual tasks in autonomous driving. However, it is difficult to restore the lost details in the dark area by relying only on the RGB domain. In this paper we introduce frequency as a new clue into the network and propose a novel DCT-driven enhancement transformer (DEFormer). First, we propose a learnable frequency branch (LFB) for frequency enhancement contains DCT processing and curvature-based frequency enhancement (CFE). CFE calculates the curvature of each channel to represent the detail richness of different frequency bands, then we divides the frequency features, which focuses on frequency bands with richer textures. In addition, we propose a cross domain fusion (CDF) for reducing the differences between the RGB domain and the frequency domain. We also adopt DEFormer as a preprocessing in dark detection, DEFormer effectively improves the performance of the detector, bringing 2.1% and 3.4% improvement in ExDark and DARK FACE datasets on mAP respectively.

* submit to ICRA2024 
Viaarxiv icon

PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine

Aug 23, 2023
Chenrui Zhang, Lin Liu, Jinpeng Wang, Chuyuan Wang, Xiao Sun, Hongyu Wang, Mingchen Cai

As an effective tool for eliciting the power of Large Language Models (LLMs), prompting has recently demonstrated unprecedented abilities across a variety of complex tasks. To further improve the performance, prompt ensemble has attracted substantial interest for tackling the hallucination and instability of LLMs. However, existing methods usually adopt a two-stage paradigm, which requires a pre-prepared set of prompts with substantial manual effort, and is unable to perform directed optimization for different weak learners. In this paper, we propose a simple, universal, and automatic method named PREFER (Pompt Ensemble learning via Feedback-Reflect-Refine) to address the stated limitations. Specifically, given the fact that weak learners are supposed to focus on hard examples during boosting, PREFER builds a feedback mechanism for reflecting on the inadequacies of existing weak learners. Based on this, the LLM is required to automatically synthesize new prompts for iterative refinement. Moreover, to enhance stability of the prompt effect evaluation, we propose a novel prompt bagging method involving forward and backward thinking, which is superior to majority voting and is beneficial for both feedback and weight calculation in boosting. Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin. We have made our code publicly available.

* 8 pages, 4 figures 
Viaarxiv icon

Revisiting Neural Retrieval on Accelerators

Jun 06, 2023
Jiaqi Zhai, Zhaojie Gong, Yueming Wang, Xiao Sun, Zheng Yan, Fu Li, Xing Liu

Figure 1 for Revisiting Neural Retrieval on Accelerators
Figure 2 for Revisiting Neural Retrieval on Accelerators
Figure 3 for Revisiting Neural Retrieval on Accelerators
Figure 4 for Revisiting Neural Retrieval on Accelerators

Retrieval finds a small number of relevant candidates from a large corpus for information retrieval and recommendation applications. A key component of retrieval is to model (user, item) similarity, which is commonly represented as the dot product of two learned embeddings. This formulation permits efficient inference, commonly known as Maximum Inner Product Search (MIPS). Despite its popularity, dot products cannot capture complex user-item interactions, which are multifaceted and likely high rank. We hence examine non-dot-product retrieval settings on accelerators, and propose \textit{mixture of logits} (MoL), which models (user, item) similarity as an adaptive composition of elementary similarity functions. This new formulation is expressive, capable of modeling high rank (user, item) interactions, and further generalizes to the long tail. When combined with a hierarchical retrieval strategy, \textit{h-indexer}, we are able to scale up MoL to 100M corpus on a single GPU with latency comparable to MIPS baselines. On public datasets, our approach leads to uplifts of up to 77.3\% in hit rate (HR). Experiments on a large recommendation surface at Meta showed strong metric gains and reduced popularity bias, validating the proposed approach's performance and improved generalization.

* To appear in the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2023) 
Viaarxiv icon

Bi-ViT: Pushing the Limit of Vision Transformer Quantization

May 21, 2023
Yanjing Li, Sheng Xu, Mingbao Lin, Xianbin Cao, Chuanjian Liu, Xiao Sun, Baochang Zhang

Figure 1 for Bi-ViT: Pushing the Limit of Vision Transformer Quantization
Figure 2 for Bi-ViT: Pushing the Limit of Vision Transformer Quantization
Figure 3 for Bi-ViT: Pushing the Limit of Vision Transformer Quantization
Figure 4 for Bi-ViT: Pushing the Limit of Vision Transformer Quantization

Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices. Fully-binarized ViTs (Bi-ViT) that pushes the quantization of ViTs to its limit remain largely unexplored and a very challenging task yet, due to their unacceptable performance. Through extensive empirical analyses, we identify the severe drop in ViT binarization is caused by attention distortion in self-attention, which technically stems from the gradient vanishing and ranking disorder. To address these issues, we first introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses. We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework. Bi-ViT achieves significant improvements over popular DeiT and Swin backbones in terms of Top-1 accuracy and FLOPs. For example, with DeiT-Tiny and Swin-Tiny, our method significantly outperforms baselines by 22.1% and 21.4% respectively, while 61.5x and 56.1x theoretical acceleration in terms of FLOPs compared with real-valued counterparts on ImageNet.

Viaarxiv icon

Long-Term Rhythmic Video Soundtracker

May 02, 2023
Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, Yu Qiao

Figure 1 for Long-Term Rhythmic Video Soundtracker
Figure 2 for Long-Term Rhythmic Video Soundtracker
Figure 3 for Long-Term Rhythmic Video Soundtracker
Figure 4 for Long-Term Rhythmic Video Soundtracker

We consider the problem of generating musical soundtracks in sync with rhythmic visual cues. Most existing works rely on pre-defined music representations, leading to the incompetence of generative flexibility and complexity. Other methods directly generating video-conditioned waveforms suffer from limited scenarios, short lengths, and unstable generation quality. To this end, we present Long-Term Rhythmic Video Soundtracker (LORIS), a novel framework to synthesize long-term conditional waveforms. Specifically, our framework consists of a latent conditional diffusion probabilistic model to perform waveform synthesis. Furthermore, a series of context-aware conditioning encoders are proposed to take temporal information into consideration for a long-term generation. Notably, we extend our model's applicability from dances to multiple sports scenarios such as floor exercise and figure skating. To perform comprehensive evaluations, we establish a benchmark for rhythmic video soundtracks including the pre-processed dataset, improved evaluation metrics, and robust generative baselines. Extensive experiments show that our model generates long-term soundtracks with state-of-the-art musical quality and rhythmic correspondence. Codes are available at \url{https://github.com/OpenGVLab/LORIS}.

* ICML2023 
Viaarxiv icon

Enhancing Personalized Ranking With Differentiable Group AUC Optimization

Apr 17, 2023
Xiao Sun, Bo Zhang, Chenrui Zhang, Han Ren, Mingchen Cai

Figure 1 for Enhancing Personalized Ranking With Differentiable Group AUC Optimization
Figure 2 for Enhancing Personalized Ranking With Differentiable Group AUC Optimization
Figure 3 for Enhancing Personalized Ranking With Differentiable Group AUC Optimization
Figure 4 for Enhancing Personalized Ranking With Differentiable Group AUC Optimization

AUC is a common metric for evaluating the performance of a classifier. However, most classifiers are trained with cross entropy, and it does not optimize the AUC metric directly, which leaves a gap between the training and evaluation stage. In this paper, we propose the PDAOM loss, a Personalized and Differentiable AUC Optimization method with Maximum violation, which can be directly applied when training a binary classifier and optimized with gradient-based methods. Specifically, we construct the pairwise exponential loss with difficult pair of positive and negative samples within sub-batches grouped by user ID, aiming to guide the classifier to pay attention to the relation between hard-distinguished pairs of opposite samples from the perspective of independent users. Compared to the origin form of pairwise exponential loss, the proposed PDAOM loss not only improves the AUC and GAUC metrics in the offline evaluation, but also reduces the computation complexity of the training objective. Furthermore, online evaluation of the PDAOM loss on the 'Guess What You Like' feed recommendation application in Meituan manifests 1.40% increase in click count and 0.65% increase in order count compared to the baseline model, which is a significant improvement in this well-developed online life service recommendation system.

* NeuRec@ICDM 2022 
Viaarxiv icon

Facial Affect Recognition based on Transformer Encoder and Audiovisual Fusion for the ABAW5 Challenge

Mar 20, 2023
Ziyang Zhang, Liuwei An, Zishun Cui, Ao xu, Tengteng Dong, Yueqi Jiang, Jingyi Shi, Xin Liu, Xiao Sun, Meng Wang

Figure 1 for Facial Affect Recognition based on Transformer Encoder and Audiovisual Fusion for the ABAW5 Challenge
Figure 2 for Facial Affect Recognition based on Transformer Encoder and Audiovisual Fusion for the ABAW5 Challenge
Figure 3 for Facial Affect Recognition based on Transformer Encoder and Audiovisual Fusion for the ABAW5 Challenge
Figure 4 for Facial Affect Recognition based on Transformer Encoder and Audiovisual Fusion for the ABAW5 Challenge

In this paper, we present our solutions for the 5th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW), which includes four sub-challenges of Valence-Arousal (VA) Estimation, Expression (Expr) Classification, Action Unit (AU) Detection and Emotional Reaction Intensity (ERI) Estimation. The 5th ABAW competition focuses on facial affect recognition utilizing different modalities and datasets. In our work, we extract powerful audio and visual features using a large number of sota models. These features are fused by Transformer Encoder and TEMMA. Besides, to avoid the possible impact of large dimensional differences between various features, we design an Affine Module to align different features to the same dimension. Extensive experiments demonstrate that the superiority of the proposed method. For the VA Estimation sub-challenge, our method obtains the mean Concordance Correlation Coefficient (CCC) of 0.6066. For the Expression Classification sub-challenge, the average F1 Score is 0.4055. For the AU Detection sub-challenge, the average F1 Score is 0.5296. For the Emotional Reaction Intensity Estimation sub-challenge, the average pearson's correlations coefficient on the validation set is 0.3968. All of the results of four sub-challenges outperform the baseline with a large margin.

Viaarxiv icon

Spatial-temporal Transformer for Affective Behavior Analysis

Mar 19, 2023
Peng Zou, Rui Wang, Kehua Wen, Yasi Peng, Xiao Sun

Figure 1 for Spatial-temporal Transformer for Affective Behavior Analysis
Figure 2 for Spatial-temporal Transformer for Affective Behavior Analysis

The in-the-wild affective behavior analysis has been an important study. In this paper, we submit our solutions for the 5th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW), which includes V-A Estimation, Facial Expression Classification and AU Detection Sub-challenges. We propose a Transformer Encoder with Multi-Head Attention framework to learn the distribution of both the spatial and temporal features. Besides, there are virious effective data augmentation strategies employed to alleviate the problems of sample imbalance during model training. The results fully demonstrate the effectiveness of our proposed model based on the Aff-Wild2 dataset.

Viaarxiv icon

Mutilmodal Feature Extraction and Attention-based Fusion for Emotion Estimation in Videos

Mar 18, 2023
Tao Shu, Xinke Wang, Ruotong Wang, Chuang Chen, Yixin Zhang, Xiao Sun

Figure 1 for Mutilmodal Feature Extraction and Attention-based Fusion for Emotion Estimation in Videos
Figure 2 for Mutilmodal Feature Extraction and Attention-based Fusion for Emotion Estimation in Videos
Figure 3 for Mutilmodal Feature Extraction and Attention-based Fusion for Emotion Estimation in Videos

The continuous improvement of human-computer interaction technology makes it possible to compute emotions. In this paper, we introduce our submission to the CVPR 2023 Competition on Affective Behavior Analysis in-the-wild (ABAW). Sentiment analysis in human-computer interaction should, as far as possible Start with multiple dimensions, fill in the single imperfect emotion channel, and finally determine the emotion tendency by fitting multiple results. Therefore, We exploited multimodal features extracted from video of different lengths from the competition dataset, including audio, pose and images. Well-informed emotion representations drive us to propose a Attention-based multimodal framework for emotion estimation. Our system achieves the performance of 0.361 on the validation dataset. The code is available at [https://github.com/xkwangcn/ABAW-5th-RT-IAI].

* 5 pages, 1 figures 
Viaarxiv icon

Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields

Mar 10, 2023
Ziteng Cui, Lin Gu, Xiao Sun, Yu Qiao, Tatsuya Harada

Figure 1 for Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields
Figure 2 for Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields
Figure 3 for Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields
Figure 4 for Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields

Common capture low-light scenes are challenging for most computer vision techniques, including Neural Radiance Fields (NeRF). Vanilla NeRF is viewer-centred that simplifies the rendering process only as light emission from 3D locations in the viewing direction, thus failing to model the low-illumination induced darkness. Inspired by emission theory of ancient Greek that visual perception is accomplished by rays casting from eyes, we make slight modifications on vanilla NeRF to train on multiple views of low-light scene, we can thus render out the well-lit scene in an unsupervised manner. We introduce a surrogate concept, Concealing Fields, that reduce the transport of light during the volume rendering stage. Specifically, our proposed method, Aleth-NeRF, directly learns from the dark image to understand volumetric object representation and concealing field under priors. By simply eliminating Concealing Fields, we can render a single or multi-view well-lit image(s) and gain superior performance over other 2D low light enhancement methods. Additionally, we collect the first paired LOw-light and normal-light Multi-view (LOM) datasets for future research.

* website page: https://cuiziteng.github.io/Aleth_NeRF_web/ 
Viaarxiv icon