Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jin Xu

Multi-Block UAMP Detection for AFDM under Fractional Delay-Doppler Channel

Oct 15, 2024

Jin Xu, Zijian Liang, Kai Niu

Figure 1 for Multi-Block UAMP Detection for AFDM under Fractional Delay-Doppler Channel

Figure 2 for Multi-Block UAMP Detection for AFDM under Fractional Delay-Doppler Channel

Figure 3 for Multi-Block UAMP Detection for AFDM under Fractional Delay-Doppler Channel

Figure 4 for Multi-Block UAMP Detection for AFDM under Fractional Delay-Doppler Channel

Abstract:Affine Frequency Division Multiplexing (AFDM) is considered as a promising solution for next-generation wireless systems due to its satisfactory performance in high-mobility scenarios. By adjusting AFDM parameters to match the multi-path delay and Doppler shift, AFDM can achieve two-dimensional time-frequency diversity gain. However, under fractional delay-Doppler channels, AFDM encounters energy dispersion in the affine domain, which poses significant challenges for signal detection. This paper first investigates the AFDM system model under fractional delay-Doppler channels. To address the energy dispersion in the affine domain, a unitary transformation based approximate message passing (UAMP) algorithm is proposed. The algorithm performs unitary transformations and message passing in the time domain to avoid the energy dispersion issue. Additionally, we implemented block-wise processing to reduce computational complexity. Finally, the empirical extrinsic information transfer (E-EXIT) chart is used to evaluate iterative detection performance. Simulation results show that UAMP significantly outperforms GAMP under fractional delay-Doppler conditions.

* 6 pages, 6 figures, submitted to IEEE Wireless Communications and Networking Conference (WCNC) 2025

Via

Access Paper or Ask Questions

Perception Compressor:A training-free prompt compression method in long context scenarios

Sep 28, 2024

Jiwei Tang, Jin Xu, Tingwei Lu, Hai Lin, Yiming Zhao, Hai-Tao Zheng

Figure 1 for Perception Compressor:A training-free prompt compression method in long context scenarios

Figure 2 for Perception Compressor:A training-free prompt compression method in long context scenarios

Figure 3 for Perception Compressor:A training-free prompt compression method in long context scenarios

Figure 4 for Perception Compressor:A training-free prompt compression method in long context scenarios

Abstract:Large Language Models (LLMs) demonstrate exceptional capabilities in various scenarios. However, they suffer from much redundant information and tend to be lost in the middle in long context scenarios, leading to inferior performance. To address these challenges, we present Perception Compressor, a training-free prompt compression method. It includes a dual-slope ratio allocator to dynamically assign compression ratios and open-book ratios, a perception retriever that leverages guiding questions and instruction to retrieve the most relevant demonstrations, and a semi-guided iterative compression that retains key information at the token level while removing tokens that distract the LLM. We conduct extensive experiments on long context benchmarks, i.e., NaturalQuestions, LongBench, and MuSiQue. Experiment results show that Perception Compressor outperforms existing methods by a large margin, achieving state-of-the-art performance.

* 9 pages, 2 figures

Via

Access Paper or Ask Questions

Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

Sep 28, 2024

Wenrui Liu, Zhifang Guo, Jin Xu, Yuanjun Lv, Yunfei Chu, Zhou Zhao, Junyang Lin

Figure 1 for Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

Figure 2 for Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

Figure 3 for Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

Figure 4 for Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

Abstract:Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training audio generation tasks with discrete audio token sequences. However, directly discretizing audio by neural audio codecs often results in sequences that fundamentally differ from text sequences. Unlike text, where text token sequences are deterministic, discrete audio tokens can exhibit significant variability based on contextual factors, while still producing perceptually identical audio segments. We refer to this phenomenon as \textbf{Discrete Representation Inconsistency (DRI)}. This inconsistency can lead to a single audio segment being represented by multiple divergent sequences, which creates confusion in neural codec language models and results in omissions and repetitions during speech generation. In this paper, we quantitatively analyze the DRI phenomenon within popular audio tokenizers such as EnCodec. Our approach effectively mitigates the DRI phenomenon of the neural audio codec. Furthermore, extensive experiments on the neural codec language model over LibriTTS and large-scale MLS datases (44,000 hours) demonstrate the effectiveness and generality of our method. The demo of audio samples is available online~\footnote{\url{https://consistencyinneuralcodec.github.io}}.

* e.g.: 15 pages, 4 figures

Via

Access Paper or Ask Questions

Leveraging Annotator Disagreement for Text Classification

Sep 26, 2024

Jin Xu, Mariët Theune, Daniel Braun

Figure 1 for Leveraging Annotator Disagreement for Text Classification

Figure 2 for Leveraging Annotator Disagreement for Text Classification

Figure 3 for Leveraging Annotator Disagreement for Text Classification

Figure 4 for Leveraging Annotator Disagreement for Text Classification

Abstract:It is common practice in text classification to only use one majority label for model training even if a dataset has been annotated by multiple annotators. Doing so can remove valuable nuances and diverse perspectives inherent in the annotators' assessments. This paper proposes and compares three different strategies to leverage annotator disagreement for text classification: a probability-based multi-label method, an ensemble system, and instruction tuning. All three approaches are evaluated on the tasks of hate speech and abusive conversation detection, which inherently entail a high degree of subjectivity. Moreover, to evaluate the effectiveness of embracing annotation disagreements for model training, we conduct an online survey that compares the performance of the multi-label model against a baseline model, which is trained with the majority label. The results show that in hate speech detection, the multi-label method outperforms the other two approaches, while in abusive conversation detection, instruction tuning achieves the best performance. The results of the survey also show that the outputs from the multi-label models are considered a better representation of the texts than the single-label model.

Via

Access Paper or Ask Questions

SongTrans: An unified song transcription and alignment method for lyrics and notes

Sep 22, 2024

Siwei Wu, Jinzheng He, Ruibin Yuan, Haojie Wei, Xipin Wei, Chenghua Lin, Jin Xu, Junyang Lin

Figure 1 for SongTrans: An unified song transcription and alignment method for lyrics and notes

Figure 2 for SongTrans: An unified song transcription and alignment method for lyrics and notes

Figure 3 for SongTrans: An unified song transcription and alignment method for lyrics and notes

Figure 4 for SongTrans: An unified song transcription and alignment method for lyrics and notes

Abstract:The quantity of processed data is crucial for advancing the field of singing voice synthesis. While there are tools available for lyric or note transcription tasks, they all need pre-processed data which is relatively time-consuming (e.g., vocal and accompaniment separation). Besides, most of these tools are designed to address a single task and struggle with aligning lyrics and notes (i.e., identifying the corresponding notes of each word in lyrics). To address those challenges, we first design a pipeline by optimizing existing tools and annotating numerous lyric-note pairs of songs. Then, based on the annotated data, we train a unified SongTrans model that can directly transcribe lyrics and notes while aligning them simultaneously, without requiring pre-processing songs. Our SongTrans model consists of two modules: (1) the \textbf{Autoregressive module} predicts the lyrics, along with the duration and note number corresponding to each word in a lyric. (2) the \textbf{Non-autoregressive module} predicts the pitch and duration of the notes. Our experiments demonstrate that SongTrans achieves state-of-the-art (SOTA) results in both lyric and note transcription tasks. Furthermore, it is the first model capable of aligning lyrics with notes. Experimental results demonstrate that the SongTrans model can effectively adapt to different types of songs (e.g., songs with accompaniment), showcasing its versatility for real-world applications.

Via

Access Paper or Ask Questions

Qwen2 Technical Report

Jul 16, 2024

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang(+49 more)

Abstract:This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.

* 25 pages, 1 figure

Via

Access Paper or Ask Questions

Qwen2-Audio Technical Report

Jul 15, 2024

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin(+2 more)

Abstract:We introduce the latest progress of Qwen-Audio, a large-scale audio-language model called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.

* https://github.com/QwenLM/Qwen2-Audio. Checkpoints, codes and scripts will be opensoursed soon

Via

Access Paper or Ask Questions

VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

Jul 07, 2024

Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, Xiangmin Xu

Figure 1 for VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

Figure 2 for VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

Figure 3 for VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

Figure 4 for VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

Abstract:Multimodal large language models (MLLMs) are flourishing, but mainly focus on images with less attention than videos, especially in sub-fields such as prompt engineering, video chain-of-thought (CoT), and instruction tuning on videos. Therefore, we try to explore the collection of CoT datasets in videos to lead to video OpenQA and improve the reasoning ability of MLLMs. Unfortunately, making such video CoT datasets is not an easy task. Given that human annotation is too cumbersome and expensive, while machine-generated is not reliable due to the hallucination issue, we develop an automatic annotation tool that combines machine and human experts, under the active learning paradigm. Active learning is an interactive strategy between the model and human experts, in this way, the workload of human labeling can be reduced and the quality of the dataset can be guaranteed. With the help of the automatic annotation tool, we strive to contribute three datasets, namely VideoCoT, TopicQA, TopicCoT. Furthermore, we propose a simple but effective benchmark based on the collected datasets, which exploits CoT to maximize the complex reasoning capabilities of MLLMs. Extensive experiments demonstrate the effectiveness our solution.

* ACL 2024 Workshop

Via

Access Paper or Ask Questions

cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers

Jun 12, 2024

Anirudh Sundar, Jin Xu, William Gay, Christopher Richardson, Larry Heck

Abstract:An emerging area of research in situated and multimodal interactive conversations (SIMMC) includes interactions in scientific papers. Since scientific papers are primarily composed of text, equations, figures, and tables, SIMMC methods must be developed specifically for each component to support the depth of inquiry and interactions required by research scientists. This work introduces Conversational Papers (cPAPERS), a dataset of conversational question-answer pairs from reviews of academic papers grounded in these paper components and their associated references from scientific documents available on arXiv. We present a data collection strategy to collect these question-answer pairs from OpenReview and associate them with contextual information from LaTeX source files. Additionally, we present a series of baseline approaches utilizing Large Language Models (LLMs) in both zero-shot and fine-tuned configurations to address the cPAPERS dataset.

* 14 pages, 1 figure

Via

Access Paper or Ask Questions

Secrecy Performance Analysis of Multi-Functional RIS-Assisted NOMA Networks

May 17, 2024

Yingjie Pei, Wanli Ni, Jin Xu, Xinwei Yue, Xiaofeng Tao, Dusit Niyato

Abstract:Although reconfigurable intelligent surface (RIS) can improve the secrecy communication performance of wireless users, it still faces challenges such as limited coverage and double-fading effect. To address these issues, in this paper, we utilize a novel multi-functional RIS (MF-RIS) to enhance the secrecy performance of wireless users, and investigate the physical layer secrecy problem in non-orthogonal multiple access (NOMA) networks. Specifically, we derive closed-form expressions for the secrecy outage probability (SOP) and secrecy throughput of users in the MF-RIS-assisted NOMA networks with external and internal eavesdroppers. The asymptotic expressions for SOP and secrecy diversity order are also analyzed under high signal-to-noise ratio (SNR) conditions. Additionally, we examine the impact of receiver hardware limitations and error transmission-induced imperfect successive interference cancellation (SIC) on the secrecy performance. Numerical results indicate that: i) under the same power budget, the secrecy performance achieved by MF-RIS significantly outperforms active RIS and simultaneously transmitting and reflecting RIS; ii) with increasing power budget, residual interference caused by imperfect SIC surpasses thermal noise as the primary factor affecting secrecy capacity; and iii) deploying additional elements at the MF-RIS brings significant secrecy enhancements for the external eavesdropping scenario, in contrast to the internal eavesdropping case.

* 14 pages, 9 figures, submitted to IEEE transactions on wireless communication

Via

Access Paper or Ask Questions