Short-form UGC video platforms, like Kwai and TikTok, have been an emerging and irreplaceable mainstream media form, thriving on user-friendly engagement, and kaleidoscope creation, etc. However, the advancing content-generation modes, e.g., special effects, and sophisticated processing workflows, e.g., de-artifacts, have introduced significant challenges to recent UGC video quality assessment: (i) the ambiguous contents hinder the identification of quality-determined regions. (ii) the diverse and complicated hybrid distortions are hard to distinguish. To tackle the above challenges and assist in the development of short-form videos, we establish the first large-scale Kaleidoscope short Video database for Quality assessment, termed KVQ, which comprises 600 user-uploaded short videos and 3600 processed videos through the diverse practical processing workflows, including pre-processing, transcoding, and enhancement. Among them, the absolute quality score of each video and partial ranking score among indistinguishable samples are provided by a team of professional researchers specializing in image processing. Based on this database, we propose the first short-form video quality evaluator, i.e., KSVQE, which enables the quality evaluator to identify the quality-determined semantics with the content understanding of large vision language models (i.e., CLIP) and distinguish the distortions with the distortion understanding module. Experimental results have shown the effectiveness of KSVQE on our KVQ database and popular VQA databases.
This paper presents the speech restoration and enhancement system created by the 1024K team for the ICASSP 2024 Speech Signal Improvement (SSI) Challenge. Our system consists of a generative adversarial network (GAN) in complex-domain for speech restoration and a fine-grained multi-band fusion module for speech enhancement. In the blind test set of SSI, the proposed system achieves an overall mean opinion score (MOS) of 3.49 based on ITU-T P.804 and a Word Accuracy Rate (WAcc) of 0.78 for the real-time track, as well as an overall P.804 MOS of 3.43 and a WAcc of 0.78 for the non-real-time track, ranking 1st in both tracks.
* Accepted to ICASSP 2024; Rank 1st in ICASSP 2024 Speech Signal
Improvement (SSI) Challenge
Interpersonal relationship quality is pivotal in social and occupational contexts. Existing analysis of interpersonal relationships mostly rely on subjective self-reports, whereas objective quantification remains challenging. In this paper, we propose a novel social relationship analysis framework using spatio-temporal patterns derived from dyadic EEG signals, which can be applied to quantitatively measure team cooperation in corporate team building, and evaluate interpersonal dynamics between therapists and patients in psychiatric therapy. First, we constructed a dyadic-EEG dataset from 72 pairs of participants with two relationships (stranger or friend) when watching emotional videos simultaneously. Then we proposed a deep neural network on dyadic-subject EEG signals, in which we combine the dynamic graph convolutional neural network for characterizing the interpersonal relationships among the EEG channels and 1-dimension convolution for extracting the information from the time sequence. To obtain the feature vectors from two EEG recordings that well represent the relationship of two subjects, we integrate deep canonical correlation analysis and triplet loss for training the network. Experimental results show that the social relationship type (stranger or friend) between two individuals can be effectively identified through their EEG data.
This paper presents a Gaussian Process (GP) framework, a non-parametric technique widely acknowledged for regression and classification tasks, to address inverse problems in mean field games (MFGs). By leveraging GPs, we aim to recover agents' strategic actions and the environment's configurations from partial and noisy observations of the population of agents and the setup of the environment. Our method is a probabilistic tool to infer the behaviors of agents in MFGs from data in scenarios where the comprehensive dataset is either inaccessible or contaminated by noises.
Speech bandwidth extension (BWE) has demonstrated promising performance in enhancing the perceptual speech quality in real communication systems. Most existing BWE researches primarily focus on fixed upsampling ratios, disregarding the fact that the effective bandwidth of captured audio may fluctuate frequently due to various capturing devices and transmission conditions. In this paper, we propose a novel streaming adaptive bandwidth extension solution dubbed BAE-Net, which is suitable to handle the low-resolution speech with unknown and varying effective bandwidth. To address the challenges of recovering both the high-frequency magnitude and phase speech content blindly, we devise a dual-stream architecture that incorporates the magnitude inpainting and phase refinement. For potential applications on edge devices, this paper also introduces BAE-NET-lite, which is a lightweight, streaming and efficient framework. Quantitative results demonstrate the superiority of BAE-Net in terms of both performance and computational efficiency when compared with existing state-of-the-art BWE methods.
Semi-supervised learning (SSL) has been proven to be a powerful method for leveraging unlabelled data to alleviate models' dependence on large labelled datasets. The common framework among recent approaches is to train the model on a large amount of unlabelled data with consistency regularization to constrain the model predictions to be invariant to input perturbation. However, the existing SSL frameworks still have room for improvement in the consistency regularization method. Instead of regularizing category predictions in the label space as in existing frameworks, this paper proposes a feature space renormalization (FSR) mechanism for SSL. First, we propose a feature space renormalization mechanism to substitute for the commonly used consistency regularization mechanism to learn better discriminative features. To apply this mechanism, we start by building a basic model and an empirical model and then introduce our mechanism to renormalize the feature learning of the basic model with the guidance of the empirical model. Second, we combine the proposed mechanism with pseudo-labelling to obtain a novel effective SSL model named FreMatch. The experimental results show that our method can achieve better performance on a variety of standard SSL benchmark datasets, and the proposed feature space renormalization mechanism can also enhance the performance of other SSL approaches.
Collaborative filtering (CF) is a widely employed technique that predicts user preferences based on past interactions. Negative sampling plays a vital role in training CF-based models with implicit feedback. In this paper, we propose a novel perspective based on the sampling area to revisit existing sampling methods. We point out that current sampling methods mainly focus on Point-wise or Line-wise sampling, lacking flexibility and leaving a significant portion of the hard sampling area un-explored. To address this limitation, we propose Dimension Independent Mixup for Hard Negative Sampling (DINS), which is the first Area-wise sampling method for training CF-based models. DINS comprises three modules: Hard Boundary Definition, Dimension Independent Mixup, and Multi-hop Pooling. Experiments with real-world datasets on both matrix factorization and graph-based models demonstrate that DINS outperforms other negative sampling methods, establishing its effectiveness and superiority. Our work contributes a new perspective, introduces Area-wise sampling, and presents DINS as a novel approach that achieves state-of-the-art performance for negative sampling. Our implementations are available in PyTorch.
In this paper, we propose a simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS) empowered transmission scheme for symbiotic radio (SR) systems to make more flexibility for network deployment and enhance system performance. The STAR-RIS is utilized to not only beam the primary signals from the base station (BS) towards multiple primary users on the same side of the STAR-RIS, but also achieve the secondary transmission to the secondary users on another side. We consider both the broadcasting signal model and unicasting signal model at the BS. For each model, we aim for minimizing the transmit power of the BS by designing the active beamforming and simultaneous reflection and transmission coefficients under the practical phase correlation constraint. To address the challenge of solving the formulated problem, we propose a block coordinate descent based algorithm with the semidefinite relaxation, penalty dual decomposition and successive convex approximation methods, which decomposes the original problem into one sub-problem about active beamforming and the other sub-problem about simultaneous reflection and transmission coefficients, and iteratively solve them until the convergence is achieved. Numerical results indicate that the proposed scheme can reduce up to 150.6% transmit power compared to the backscattering device enabled scheme.
In this paper, we propose a robust secure transmission scheme for an active reconfigurable intelligent surface (RIS) enabled symbiotic radio (SR) system in the presence of multiple eavesdroppers (Eves). In the considered system, the active RIS is adopted to enable the secure transmission of primary signals from the primary transmitter to multiple primary users in a multicasting manner, and simultaneously achieve its own information delivery to the secondary user by riding over the primary signals. Taking into account the imperfect channel state information (CSI) related with Eves, we formulate the system power consumption minimization problem by optimizing the transmit beamforming and reflection beamforming for the bounded and statistical CSI error models, taking the worst-case SNR constraints and the SNR outage probability constraints at the Eves into considerations, respectively. Specifically, the S-Procedure and the Bernstein-Type Inequality are implemented to approximately transform the worst-case SNR and the SNR outage probability constraints into tractable forms, respectively. After that, the formulated problems can be solved by the proposed alternating optimization (AO) algorithm with the semi-definite relaxation and sequential rank-one constraint relaxation techniques. Numerical results show that the proposed active RIS scheme can reduce up to 27.0% system power consumption compared to the passive RIS.
* 32 Pages, 12 figures, accepted to IEEE Transactions on Wireless
A key challenge for LiDAR-based 3D object detection is to capture sufficient features from large scale 3D scenes especially for distant or/and occluded objects. Albeit recent efforts made by Transformers with the long sequence modeling capability, they fail to properly balance the accuracy and efficiency, suffering from inadequate receptive fields or coarse-grained holistic correlations. In this paper, we propose an Octree-based Transformer, named OcTr, to address this issue. It first constructs a dynamic octree on the hierarchical feature pyramid through conducting self-attention on the top level and then recursively propagates to the level below restricted by the octants, which captures rich global context in a coarse-to-fine manner while maintaining the computational complexity under control. Furthermore, for enhanced foreground perception, we propose a hybrid positional embedding, composed of the semantic-aware positional embedding and attention mask, to fully exploit semantic and geometry clues. Extensive experiments are conducted on the Waymo Open Dataset and KITTI Dataset, and OcTr reaches newly state-of-the-art results.