In this paper, we propose a personalized neural speech codec, envisioning that personalization can reduce the model complexity or improve perceptual speech quality. Despite the common usage of speech codecs where only a single talker is involved on each side of the communication, personalizing a codec for the specific user has rarely been explored in the literature. First, we assume speakers can be grouped into smaller subsets based on their perceptual similarity. Then, we also postulate that a group-specific codec can focus on the group's speech characteristics to improve its perceptual quality and computational efficiency. To this end, we first develop a Siamese network that learns the speaker embeddings from the LibriSpeech dataset, which are then grouped into underlying speaker clusters. Finally, we retrain the LPCNet-based speech codec baselines on each of the speaker clusters. Subjective listening tests show that the proposed personalization scheme introduces model compression while maintaining speech quality. In other words, with the same model complexity, personalized codecs produce better speech quality.
This paper provides a computational analysis of poetry reading audio signals at a large scale to unveil the musicality within professionally-read poems. Although the acoustic characteristics of other types of spoken language have been extensively studied, most of the literature is limited to narrative speech or singing voice, discussing how different they are from each other. In this work, we develop signal processing methods, which are tailored to capture the unique acoustic characteristics of poetry reading based on their silence patterns, temporal variations of local pitch, and beat stability. Our large-scale statistical analyses on three big corpora, each of which consists of narration (LibriSpeech), singing voice (Intonation), and poetry reading (from The Poetry Foundation), discover that poetry reading does share some musical characteristics with singing voice, although it may also resemble narrative speech.
Limited capacity of fronthaul links in a cell-free massive multiple-input multiple-output (MIMO) system can cause quantization errors at a central processing unit (CPU) during data transmission, complicating the centralized rate optimization problem. Addressing this challenge, we propose a harmony search (HS)-based algorithm that renders the combinatorial non-convex problem tractable. One of the distinctive features of our algorithm is its hierarchical structure: it first allocates resources at the access point (AP) level and subsequently optimizes for user equipment (UE), ensuring a more efficient and structured approach to resource allocation. Our proposed algorithm deals with rigorous conditions, such as asymmetric fronthaul bit allocation and distinct quantization error levels at each AP, which were not considered in previous works. We derive a closed-form expression of signal-to-interference-plusnoise ratio (SINR), in which additive quantization noise model (AQNM) based distortion error is taken into account, to define the mathematical expression of spectral efficiency (SE) for each UE. Also, we provide analyses on computational complexity and convergence to investigate the practicality of proposed algorithm. By leveraging various performance metrics such as total SE and max-min fairness, we demonstrate that the proposed algorithm can adaptively optimize the fronthaul bit allocation depending on system requirements. Finally, simulation results show that the proposed algorithm can achieve satisfactory performance while maintaining low computational complexity, as compared to the exhaustive search method
Creating personalized hand avatars is important to offer a realistic experience to users on AR / VR platforms. While most prior studies focused on reconstructing 3D hand shapes, some recent work has tackled the reconstruction of hand textures on top of shapes. However, these methods are often limited to capturing pixels on the visible side of a hand, requiring diverse views of the hand in a video or multiple images as input. In this paper, we propose a novel method, BiTT(Bi-directional Texture reconstruction of Two hands), which is the first end-to-end trainable method for relightable, pose-free texture reconstruction of two interacting hands taking only a single RGB image, by three novel components: 1) bi-directional (left $\leftrightarrow$ right) texture reconstruction using the texture symmetry of left / right hands, 2) utilizing a texture parametric model for hand texture recovery, and 3) the overall coarse-to-fine stage pipeline for reconstructing personalized texture of two interacting hands. BiTT first estimates the scene light condition and albedo image from an input image, then reconstructs the texture of both hands through the texture parametric model and bi-directional texture reconstructor. In experiments using InterHand2.6M and RGB2Hands datasets, our method significantly outperforms state-of-the-art hand texture reconstruction methods quantitatively and qualitatively. The code is available at https://github.com/yunminjin2/BiTT
Indoor scenes we are living in are visually homogenous or textureless, while they inherently have structural forms and provide enough structural priors for 3D scene reconstruction. Motivated by this fact, we propose a structure-aware online signed distance fields (SDF) reconstruction framework in indoor scenes, especially under the Atlanta world (AW) assumption. Thus, we dub this incremental SDF reconstruction for AW as AiSDF. Within the online framework, we infer the underlying Atlanta structure of a given scene and then estimate planar surfel regions supporting the Atlanta structure. This Atlanta-aware surfel representation provides an explicit planar map for a given scene. In addition, based on these Atlanta planar surfel regions, we adaptively sample and constrain the structural regularity in the SDF reconstruction, which enables us to improve the reconstruction quality by maintaining a high-level structure while enhancing the details of a given scene. We evaluate the proposed AiSDF on the ScanNet and ReplicaCAD datasets, where we demonstrate that the proposed framework is capable of reconstructing fine details of objects implicitly, as well as structures explicitly in room-scale scenes.
Satellite networks are emerging as vital solutions for global connectivity beyond 5G. As companies such as SpaceX, OneWeb, and Amazon are poised to launch a large number of satellites in low Earth orbit, the heightened inter-satellite interference caused by mega-constellations has become a significant concern. To address this challenge, recent works have introduced the concept of satellite cluster networks where multiple satellites in a cluster collaborate to enhance the network performance. In order to investigate the performance of these networks, we propose mathematical analyses by modeling the locations of satellites and users using Poisson point processes, building on the success of stochastic geometry-based analyses for satellite networks. In particular, we suggest the lower and upper bounds of the coverage probability as functions of the system parameters, including satellite density, satellite altitude, satellite cluster area, path loss exponent, and Nakagami parameter $m$. We validate the analytical expressions by comparing them with simulation results. Our analyses can be used to design reliable satellite cluster networks by effectively estimating the impact of system parameters on the coverage performance.
In this work, we explore the task of hierarchical distance-based speech separation defined on a hyperbolic manifold. Based on the recent advent of audio-related tasks performed in non-Euclidean spaces, we propose to make use of the Poincar\'e ball to effectively unveil the inherent hierarchical structure found in complex speaker mixtures. We design two sets of experiments in which the distance-based parent sound classes, namely "near" and "far", can contain up to two or three speakers (i.e., children) each. We show that our hyperbolic approach is suitable for unveiling hierarchical structure from the problem definition, resulting in improved child-level separation. We further show that a clear correlation emerges between the notion of hyperbolic certainty (i.e., the distance to the ball's origin) and acoustic semantics such as speaker density, inter-source location, and microphone-to-speaker distance.
In low-bitrate speech coding, end-to-end speech coding networks aim to learn compact yet expressive features and a powerful decoder in a single network. A challenging problem as such results in unwelcome complexity increase and inferior speech quality. In this paper, we propose to separate the representation learning and information reconstruction tasks. We leverage an end-to-end codec for learning low-dimensional discrete tokens and employ a latent diffusion model to de-quantize coded features into a high-dimensional continuous space, relieving the decoder's burden of de-quantizing and upsampling. To mitigate the issue of over-smooth generation, we introduce midway-infilling with less noise reduction and stronger conditioning. In ablation studies, we investigate the hyperparameters for midway-infilling and latent diffusion space with different dimensions. Subjective listening tests show that our model outperforms the state-of-the-art at two low bitrates, 1.5 and 3 kbps. Codes and samples of this work are available on our webpage.
Spectral sub-bands do not portray the same perceptual relevance. In audio coding, it is therefore desirable to have independent control over each of the constituent bands so that bitrate assignment and signal reconstruction can be achieved efficiently. In this work, we present a novel neural audio coding network that natively supports a multi-band coding paradigm. Our model extends the idea of compressed skip connections in the U-Net-based codec, allowing for independent control over both core and high band-specific reconstructions and bit allocation. Our system reconstructs the full-band signal mainly from the condensed core-band code, therefore exploiting and showcasing its bandwidth extension capabilities to its fullest. Meanwhile, the low-bitrate high-band code helps the high-band reconstruction similarly to MPEG audio codecs' spectral bandwidth replication. MUSHRA tests show that the proposed model not only improves the quality of the core band by explicitly assigning more bits to it but retains a good quality in the high-band as well.