Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Peng Liu

Alzheimer's Disease Neuroimaging Initiative, the Australian Imaging Biomarkers and Lifestyle flagship study of ageing

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Feb 18, 2025

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen(+135 more)

Abstract:Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.

Via

Access Paper or Ask Questions

Relational Norms for Human-AI Cooperation

Feb 17, 2025

Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh(+52 more)

Abstract:How we should design and interact with social artificial intelligence depends on the socio-relational role the AI is meant to emulate or occupy. In human society, relationships such as teacher-student, parent-child, neighbors, siblings, or employer-employee are governed by specific norms that prescribe or proscribe cooperative functions including hierarchy, care, transaction, and mating. These norms shape our judgments of what is appropriate for each partner. For example, workplace norms may allow a boss to give orders to an employee, but not vice versa, reflecting hierarchical and transactional expectations. As AI agents and chatbots powered by large language models are increasingly designed to serve roles analogous to human positions - such as assistant, mental health provider, tutor, or romantic partner - it is imperative to examine whether and how human relational norms should extend to human-AI interactions. Our analysis explores how differences between AI systems and humans, such as the absence of conscious experience and immunity to fatigue, may affect an AI's capacity to fulfill relationship-specific functions and adhere to corresponding norms. This analysis, which is a collaborative effort by philosophers, psychologists, relationship scientists, ethicists, legal experts, and AI researchers, carries important implications for AI systems design, user behavior, and regulation. While we accept that AI systems can offer significant benefits such as increased availability and consistency in certain socio-relational roles, they also risk fostering unhealthy dependencies or unrealistic expectations that could spill over into human-human relationships. We propose that understanding and thoughtfully shaping (or implementing) suitable human-AI relational norms will be crucial for ensuring that human-AI interactions are ethical, trustworthy, and favorable to human well-being.

* 76 pages, 2 figures

Via

Access Paper or Ask Questions

Mamba-MOC: A Multicategory Remote Object Counting via State Space Model

Jan 12, 2025

Peng Liu, Sen Lei, Heng-Chao Li

Figure 1 for Mamba-MOC: A Multicategory Remote Object Counting via State Space Model

Figure 2 for Mamba-MOC: A Multicategory Remote Object Counting via State Space Model

Figure 3 for Mamba-MOC: A Multicategory Remote Object Counting via State Space Model

Figure 4 for Mamba-MOC: A Multicategory Remote Object Counting via State Space Model

Abstract:Multicategory remote object counting is a fundamental task in computer vision, aimed at accurately estimating the number of objects of various categories in remote images. Existing methods rely on CNNs and Transformers, but CNNs struggle to capture global dependencies, and Transformers are computationally expensive, which limits their effectiveness in remote applications. Recently, Mamba has emerged as a promising solution in the field of computer vision, offering a linear complexity for modeling global dependencies. To this end, we propose Mamba-MOC, a mamba-based network designed for multi-category remote object counting, which represents the first application of Mamba to remote sensing object counting. Specifically, we propose a cross-scale interaction module to facilitate the deep integration of hierarchical features. Then we design a context state space model to capture both global and local contextual information and provide local neighborhood information during the scan process. Experimental results in large-scale realistic scenarios demonstrate that our proposed method achieves state-of-the-art performance compared with some mainstream counting algorithms.

Via

Access Paper or Ask Questions

Explainable Neural Networks with Guarantees: A Sparse Estimation Approach

Jan 02, 2025

Antoine Ledent, Peng Liu

Figure 1 for Explainable Neural Networks with Guarantees: A Sparse Estimation Approach

Figure 2 for Explainable Neural Networks with Guarantees: A Sparse Estimation Approach

Figure 3 for Explainable Neural Networks with Guarantees: A Sparse Estimation Approach

Figure 4 for Explainable Neural Networks with Guarantees: A Sparse Estimation Approach

Abstract:Balancing predictive power and interpretability has long been a challenging research area, particularly in powerful yet complex models like neural networks, where nonlinearity obstructs direct interpretation. This paper introduces a novel approach to constructing an explainable neural network that harmonizes predictiveness and explainability. Our model, termed SparXnet, is designed as a linear combination of a sparse set of jointly learned features, each derived from a different trainable function applied to a single 1-dimensional input feature. Leveraging the ability to learn arbitrarily complex relationships, our neural network architecture enables automatic selection of a sparse set of important features, with the final prediction being a linear combination of rescaled versions of these features. We demonstrate the ability to select significant features while maintaining comparable predictive performance and direct interpretability through extensive experiments on synthetic and real-world datasets. We also provide theoretical analysis on the generalization bounds of our framework, which is favorably linear in the number of selected features and only logarithmic in the number of input features. We further lift any dependence of sample complexity on the number of parameters or the architectural details under very mild conditions. Our research paves the way for further research on sparse and explainable neural networks with guarantee.

Via

Access Paper or Ask Questions

Model-driven deep neural network for enhanced direction finding with commodity 5G gNodeB

Dec 14, 2024

Shengheng Liu, Zihuan Mao, Xingkang Li, Mengguan Pan, Peng Liu, Yongming Huang, Xiaohu You

Figure 1 for Model-driven deep neural network for enhanced direction finding with commodity 5G gNodeB

Figure 2 for Model-driven deep neural network for enhanced direction finding with commodity 5G gNodeB

Figure 3 for Model-driven deep neural network for enhanced direction finding with commodity 5G gNodeB

Figure 4 for Model-driven deep neural network for enhanced direction finding with commodity 5G gNodeB

Abstract:Pervasive and high-accuracy positioning has become increasingly important as a fundamental enabler for intelligent connected devices in mobile networks. Nevertheless, current wireless networks heavily rely on pure model-driven techniques to achieve positioning functionality, often succumbing to performance deterioration due to hardware impairments in practical scenarios. Here we reformulate the direction finding or angle-of-arrival (AoA) estimation problem as an image recovery task of the spatial spectrum and propose a new model-driven deep neural network (MoD-DNN) framework. The proposed MoD-DNN scheme comprises three modules: a multi-task autoencoder-based beamformer, a coarray spectrum generation module, and a model-driven deep learning-based spatial spectrum reconstruction module. Our technique enables automatic calibration of angular-dependent phase error thereby enhancing the resilience of direction-finding precision against realistic system non-idealities. We validate the proposed scheme both using numerical simulations and field tests. The results show that the proposed MoD-DNN framework enables effective spectrum calibration and accurate AoA estimation. To the best of our knowledge, this study marks the first successful demonstration of hybrid data-and-model-driven direction finding utilizing readily available commodity 5G gNodeB.

* To appear in ACM TOSN. A preliminary version of this article was presented at the AAAI'2024 Main Technical Track

Via

Access Paper or Ask Questions

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Dec 12, 2024

Javier de la Rosa, Vladislav Mikhailov, Lemei Zhang, Freddy Wetjen, David Samuel, Peng Liu, Rolv-Arild Braaten, Petter Mæhlum, Magnus Breder Birkenes, Andrey Kutuzov(+8 more)

Figure 1 for The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Figure 2 for The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Figure 3 for The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Figure 4 for The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Abstract:The use of copyrighted materials in training generative language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of copyrighted materials on the performance of large language models (LLMs) for Norwegian. We found that both books and newspapers contribute positively when the models are evaluated on a diverse set of Norwegian benchmarks, while fiction works possibly lead to decreased performance. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

* pre-print, under review

Via

Access Paper or Ask Questions

Radiology Report Generation via Multi-objective Preference Optimization

Dec 12, 2024

Ting Xiao, Lei Shi, Peng Liu, Zhe Wang, Chenjia Bai

Figure 1 for Radiology Report Generation via Multi-objective Preference Optimization

Figure 2 for Radiology Report Generation via Multi-objective Preference Optimization

Figure 3 for Radiology Report Generation via Multi-objective Preference Optimization

Figure 4 for Radiology Report Generation via Multi-objective Preference Optimization

Abstract:Automatic Radiology Report Generation (RRG) is an important topic for alleviating the substantial workload of radiologists. Existing RRG approaches rely on supervised regression based on different architectures or additional knowledge injection,while the generated report may not align optimally with radiologists' preferences. Especially, since the preferences of radiologists are inherently heterogeneous and multidimensional, e.g., some may prioritize report fluency, while others emphasize clinical accuracy. To address this problem,we propose a new RRG method via Multi-objective Preference Optimization (MPO) to align the pre-trained RRG model with multiple human preferences, which can be formulated by multi-dimensional reward functions and optimized by multi-objective reinforcement learning (RL). Specifically, we use a preference vector to represent the weight of preferences and use it as a condition for the RRG model. Then, a linearly weighed reward is obtained via a dot product between the preference vector and multi-dimensional reward.Next,the RRG model is optimized to align with the preference vector by optimizing such a reward via RL. In the training stage,we randomly sample diverse preference vectors from the preference space and align the model by optimizing the weighted multi-objective rewards, which leads to an optimal policy on the entire preference space. When inference,our model can generate reports aligned with specific preferences without further fine-tuning. Extensive experiments on two public datasets show the proposed method can generate reports that cater to different preferences in a single model and achieve state-of-the-art performance.

* 11 pages,3 figures

Via

Access Paper or Ask Questions

5G NR monostatic positioning with array impairments: Data-and-model-driven framework and experiment results

Dec 11, 2024

Shengheng Liu, Hao Wang, Mengguan Pan, Peng Liu, Yahui Ma, Yongming Huang

Figure 1 for 5G NR monostatic positioning with array impairments: Data-and-model-driven framework and experiment results

Figure 2 for 5G NR monostatic positioning with array impairments: Data-and-model-driven framework and experiment results

Figure 3 for 5G NR monostatic positioning with array impairments: Data-and-model-driven framework and experiment results

Figure 4 for 5G NR monostatic positioning with array impairments: Data-and-model-driven framework and experiment results

Abstract:In this article, we present an intelligent framework for 5G new radio (NR) indoor positioning under a monostatic configuration. The primary objective is to estimate both the angle of arrival and time of arrival simultaneously. This requires capturing the pertinent information from both the antenna and subcarrier dimensions of the receive signals. To tackle the challenges posed by the intricacy of the high-dimensional information matrix, coupled with the impact of irregular array errors, we design a deep learning scheme. Recognizing that the phase difference between any two subcarriers and antennas encodes spatial information of the target, we contend that the transformer network is better suited for this problem compared to the convolutional neural network which excels in local feature extraction. To further enhance the network's fitting capability, we integrate the transformer with a model-based multiple-signal-classification (MUSIC) region decision mechanism. Numerical results and field tests demonstrate the effectiveness of the proposed framework in accurately calibrating the irregular angle-dependent array error and improving positioning accuracy.

* Presented at MobiCom 2023

Via

Access Paper or Ask Questions

Yi-Lightning Technical Report

Dec 03, 2024

01. AI, :, Alan Wake, Albert Wang, Bei Chen, C. X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng(+33 more)

Figure 1 for Yi-Lightning Technical Report

Figure 2 for Yi-Lightning Technical Report

Figure 3 for Yi-Lightning Technical Report

Figure 4 for Yi-Lightning Technical Report

Abstract:This technical report presents Yi-Lightning, our latest flagship large language model (LLM). It achieves exceptional performance, ranking 6th overall on Chatbot Arena, with particularly strong results (2nd to 4th place) in specialized categories including Chinese, Math, Coding, and Hard Prompts. Yi-Lightning leverages an enhanced Mixture-of-Experts (MoE) architecture, featuring advanced expert segmentation and routing mechanisms coupled with optimized KV-caching techniques. Our development process encompasses comprehensive pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), where we devise deliberate strategies for multi-stage training, synthetic data construction, and reward modeling. Furthermore, we implement RAISE (Responsible AI Safety Engine), a four-component framework to address safety issues across pre-training, post-training, and serving phases. Empowered by our scalable super-computing infrastructure, all these innovations substantially reduce training, deployment and inference costs while maintaining high-performance standards. With further evaluations on public academic benchmarks, Yi-Lightning demonstrates competitive performance against top-tier LLMs, while we observe a notable disparity between traditional, static benchmark results and real-world, dynamic human preferences. This observation prompts a critical reassessment of conventional benchmarks' utility in guiding the development of more intelligent and powerful AI systems for practical applications. Yi-Lightning is now available through our developer platform at https://platform.lingyiwanwu.com.

Via

Access Paper or Ask Questions

SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

Nov 19, 2024

Yongyan Wen, Siyuan Li, Rongchang Zuo, Lei Yuan, Hangyu Mao, Peng Liu

Figure 1 for SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

Figure 2 for SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

Figure 3 for SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

Figure 4 for SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

Abstract:Deep reinforcement learning (DRL) has achieved remarkable success in various research domains. However, its reliance on neural networks results in a lack of transparency, which limits its practical applications. To achieve explainability, decision trees have emerged as a popular and promising alternative to neural networks. Nonetheless, due to their limited expressiveness, traditional decision trees struggle with high-dimensional long-horizon continuous control tasks. In this paper, we proposes SkillTree, a novel framework that reduces complex continuous action spaces into discrete skill spaces. Our hierarchical approach integrates a differentiable decision tree within the high-level policy to generate skill embeddings, which subsequently guide the low-level policy in executing skills. By making skill decisions explainable, we achieve skill-level explainability, enhancing the understanding of the decision-making process in complex tasks. Experimental results demonstrate that our method achieves performance comparable to skill-based neural networks in complex robotic arm control domains. Furthermore, SkillTree offers explanations at the skill level, thereby increasing the transparency of the decision-making process.

Via

Access Paper or Ask Questions