Abstract:Speech tokenization is crucial in digital speech processing, converting continuous speech signals into discrete units for various computational tasks. This paper introduces a novel speech tokenizer with broad applicability across downstream tasks. While recent advances in residual vector quantization (RVQ) have incorporated semantic elements, they often neglect critical acoustic features. We propose an advanced approach that simultaneously encodes both linguistic and acoustic information, preserving prosodic and emotional content. Our method significantly enhances speech representation fidelity across diverse applications. Empirical evaluations demonstrate its effectiveness in speech coding, voice conversion, emotion recognition, and multimodal language modeling, without requiring additional training. This versatility underscores its potential as a key tool for advancing AI-driven speech processing.
Abstract:Partial Adaptation (PDA) addresses a practical scenario in which the target domain contains only a subset of classes in the source domain. While PDA should take into account both class-level and sample-level to mitigate negative transfer, current approaches mostly rely on only one of them. In this paper, we propose a novel approach to fully exploit multi-level associations that can arise in PDA. Our Associative Partial Domain Adaptation (APDA) utilizes intra-domain association to actively select out non-trivial anomaly samples in each source-private class that sample-level weighting cannot handle. Additionally, our method considers inter-domain association to encourage positive transfer by mapping between nearby target samples and source samples with high label-commonness. For this, we exploit feature propagation in a proposed label space consisting of source ground-truth labels and target probabilistic labels. We further propose a geometric guidance loss based on the label commonness of each source class to encourage positive transfer. Our APDA consistently achieves state-of-the-art performance across public datasets.
Abstract:This paper proposes key instance selection based on video saliency covering objectness and dynamics for unsupervised video object segmentation (UVOS). Our method takes frames sequentially and extracts object proposals with corresponding masks for each frame. We link objects according to their similarity until the M-th frame and then assign them unique IDs (i.e., instances). Similarity measure takes into account multiple properties such as ReID descriptor, expected trajectory, and semantic co-segmentation result. After M-th frame, we select K IDs based on video saliency and frequency of appearance; then only these key IDs are tracked through the remaining frames. Thanks to these technical contributions, our results are ranked third on the leaderboard of UVOS DAVIS challenge.