Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bhiksha Raj

Language Technologies Institute, Carnegie Mellon University, Mohammed bin Zayed University of AI

Tessellated Linear Model for Age Prediction from Voice

Jan 16, 2025

Dareen Alharthi, Mahsa Zamani, Bhiksha Raj, Rita Singh

Figure 1 for Tessellated Linear Model for Age Prediction from Voice

Figure 2 for Tessellated Linear Model for Age Prediction from Voice

Figure 3 for Tessellated Linear Model for Age Prediction from Voice

Figure 4 for Tessellated Linear Model for Age Prediction from Voice

Abstract:Voice biometric tasks, such as age estimation require modeling the often complex relationship between voice features and the biometric variable. While deep learning models can handle such complexity, they typically require large amounts of accurately labeled data to perform well. Such data are often scarce for biometric tasks such as voice-based age prediction. On the other hand, simpler models like linear regression can work with smaller datasets but often fail to generalize to the underlying non-linear patterns present in the data. In this paper we propose the Tessellated Linear Model (TLM), a piecewise linear approach that combines the simplicity of linear models with the capacity of non-linear functions. TLM tessellates the feature space into convex regions and fits a linear model within each region. We optimize the tessellation and the linear models using a hierarchical greedy partitioning. We evaluated TLM on the TIMIT dataset on the task of age prediction from voice, where it outperformed state-of-the-art deep learning models.

Via

Access Paper or Ask Questions

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Dec 14, 2024

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, Emad Barsoum

Figure 1 for SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Figure 2 for SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Figure 3 for SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Figure 4 for SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Abstract:Efficient image tokenization with high compression ratios remains a critical challenge for training generative models. We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL. It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VQE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models. Code and model are released.

* Code and model: https://github.com/Hhhhhhao/continuous_tokenizer

Via

Access Paper or Ask Questions

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Dec 02, 2024

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Jindong Wang, Zhe Lin, Bhiksha Raj

Figure 1 for XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Figure 2 for XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Figure 3 for XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Figure 4 for XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Abstract:Image tokenizers play a critical role in shaping the performance of subsequent generative models. Since the introduction of VQ-GAN, discrete image tokenization has undergone remarkable advancements. Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality. In this paper, we present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks. Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ), within a highly flexible and customizable training environment. On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID). Furthermore, we demonstrate that using XQ-GAN as a tokenizer improves gFID metrics alongside rFID. For instance, with the same VAR architecture, XQ-GAN+VAR achieves a gFID of 2.6, outperforming VAR's 3.3 gFID by a notable margin. To support further research, we provide pre-trained weights of different image tokenizers for the community to directly train the subsequent generative models on it or fine-tune for specialized tasks.

* Code: https://github.com/lxa9867/ImageFolder

Via

Access Paper or Ask Questions

Perturbation Ontology based Graph Attention Networks

Nov 27, 2024

Yichen Wang, Jie Wang, Fulin Wang, Xiang Li, Hao Yin, Bhiksha Raj

Figure 1 for Perturbation Ontology based Graph Attention Networks

Figure 2 for Perturbation Ontology based Graph Attention Networks

Figure 3 for Perturbation Ontology based Graph Attention Networks

Abstract:In recent years, graph representation learning has undergone a paradigm shift, driven by the emergence and proliferation of graph neural networks (GNNs) and their heterogeneous counterparts. Heterogeneous GNNs have shown remarkable success in extracting low-dimensional embeddings from complex graphs that encompass diverse entity types and relationships. While meta-path-based techniques have long been recognized for their ability to capture semantic affinities among nodes, their dependence on manual specification poses a significant limitation. In contrast, matrix-focused methods accelerate processing by utilizing structural cues but often overlook contextual richness. In this paper, we challenge the current paradigm by introducing ontology as a fundamental semantic primitive within complex graphs. Our goal is to integrate the strengths of both matrix-centric and meta-path-based approaches into a unified framework. We propose perturbation Ontology-based Graph Attention Networks (POGAT), a novel methodology that combines ontology subgraphs with an advanced self-supervised learning paradigm to achieve a deep contextual understanding. The core innovation of POGAT lies in our enhanced homogeneous perturbing scheme designed to generate rigorous negative samples, encouraging the model to explore minimal contextual features more thoroughly. Through extensive empirical evaluations, we demonstrate that POGAT significantly outperforms state-of-the-art baselines, achieving a groundbreaking improvement of up to 10.78\% in F1-score for the critical task of link prediction and 12.01\% in Micro-F1 for the critical task of node classification.

Via

Access Paper or Ask Questions

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Nov 05, 2024

Satvik Dixit, Soham Deshmukh, Bhiksha Raj

Figure 1 for MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Figure 2 for MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Figure 3 for MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Figure 4 for MACE: Leveraging Audio for Evaluating Audio Captioning Systems

Abstract:The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments demonstrate MACE's superior performance in predicting human quality judgments compared to traditional metrics. Specifically, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, it significantly outperforms all the previous metrics on the audio captioning evaluation task. The metric is opensourced at https://github.com/satvik-dixit/mace

Via

Access Paper or Ask Questions

FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

Oct 25, 2024

Naga VS Raviteja Chappa, Page Daniel Dobbs, Bhiksha Raj, Khoa Luu

Figure 1 for FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

Figure 2 for FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

Figure 3 for FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

Figure 4 for FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis

Abstract:The proliferation of tobacco-related content on social media platforms poses significant challenges for public health monitoring and intervention. This paper introduces a novel multi-modal deep learning framework named Flow-Attention Adaptive Semantic Hierarchical Fusion (FLAASH) designed to analyze tobacco-related video content comprehensively. FLAASH addresses the complexities of integrating visual and textual information in short-form videos by leveraging a hierarchical fusion mechanism inspired by flow network theory. Our approach incorporates three key innovations, including a flow-attention mechanism that captures nuanced interactions between visual and textual modalities, an adaptive weighting scheme that balances the contribution of different hierarchical levels, and a gating mechanism that selectively emphasizes relevant features. This multi-faceted approach enables FLAASH to effectively process and analyze diverse tobacco-related content, from product showcases to usage scenarios. We evaluate FLAASH on the Multimodal Tobacco Content Analysis Dataset (MTCAD), a large-scale collection of tobacco-related videos from popular social media platforms. Our results demonstrate significant improvements over existing methods, outperforming state-of-the-art approaches in classification accuracy, F1 score, and temporal consistency. The proposed method also shows strong generalization capabilities when tested on standard video question-answering datasets, surpassing current models. This work contributes to the intersection of public health and artificial intelligence, offering an effective tool for analyzing tobacco promotion in digital media.

* Under review at International Journal of Computer Vision; 20 pages, 4 figures, 5 tables

Via

Access Paper or Ask Questions

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Oct 19, 2024

Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, Marah I. Abdin

Figure 1 for On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Figure 2 for On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Figure 3 for On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Figure 4 for On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Abstract:The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has focused predominantly on the quality and quantity of real data, our work enables the measurement of diversity in synthetic data and explores its impact on LLM performance. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages by introducing a new diversity metric, \textit{LLM cluster-agent}, designed to evaluate the diversity of synthetic datasets. Through a series of controlled experiments with models of 350M and 1.4B parameters, we demonstrate that the proposed cluster-based LLM scoring of diversity correlates positively with both pre-training and supervised fine-tuning performance. Our findings also reveal that synthetic data diversity in pre-training affects supervised fine-tuning more significantly than pre-training itself, even for smaller models. We hope this study advances our understanding of the optimal use of synthetic data in LLM training and opens new avenues for efficient data generation processes.

Via

Access Paper or Ask Questions

What Do Speech Foundation Models Not Learn About Speech?

Oct 16, 2024

Abdul Waheed, Hanin Atwany, Bhiksha Raj, Rita Singh

Figure 1 for What Do Speech Foundation Models Not Learn About Speech?

Figure 2 for What Do Speech Foundation Models Not Learn About Speech?

Figure 3 for What Do Speech Foundation Models Not Learn About Speech?

Figure 4 for What Do Speech Foundation Models Not Learn About Speech?

Abstract:Understanding how speech foundation models capture non-verbal cues is crucial for improving their interpretability and adaptability across diverse tasks. In our work, we analyze several prominent models such as Whisper, Seamless, Wav2Vec, HuBERT, and Qwen2-Audio focusing on their learned representations in both paralinguistic and non-paralinguistic tasks from the Dynamic-SUPERB benchmark. Our study addresses three key questions: (1) What non-verbal cues (e.g., speaker intent, emotion, environmental context) are captured? (2) How are these cues represented across different layers of the models? and (3) To what extent can these representations be effectively adapted to downstream tasks? To answer these questions, we first evaluate the models in a zero-shot setting, followed by fine-tuning on layer-wise features extracted from these models. Our results provide insights into the models' capacity for generalization, the characteristics of their layer-wise representations, and the degree of transformation required for downstream task adaptation. Our findings suggest that some of these models perform well on various tasks in zero-shot settings, despite not being explicitly trained for those tasks. We also observe that zero-shot performance correlates with better-learned representations. The analysis of layer-wise features demonstrates that some models exhibit a convex relationship between the separability of the learned representations and model depth, with different layers capturing task-specific features.

* 20 Pages

Via

Access Paper or Ask Questions

Improving Speaker Representations Using Contrastive Losses on Multi-scale Features

Oct 07, 2024

Satvik Dixit, Massa Baali, Rita Singh, Bhiksha Raj

Figure 1 for Improving Speaker Representations Using Contrastive Losses on Multi-scale Features

Figure 2 for Improving Speaker Representations Using Contrastive Losses on Multi-scale Features

Figure 3 for Improving Speaker Representations Using Contrastive Losses on Multi-scale Features

Figure 4 for Improving Speaker Representations Using Contrastive Losses on Multi-scale Features

Abstract:Speaker verification systems have seen significant advancements with the introduction of Multi-scale Feature Aggregation (MFA) architectures, such as MFA-Conformer and ECAPA-TDNN. These models leverage information from various network depths by concatenating intermediate feature maps before the pooling and projection layers, demonstrating that even shallower feature maps encode valuable speaker-specific information. Building upon this foundation, we propose a Multi-scale Feature Contrastive (MFCon) loss that directly enhances the quality of these intermediate representations. Our MFCon loss applies contrastive learning to all feature maps within the network, encouraging the model to learn more discriminative representations at the intermediate stage itself. By enforcing better feature map learning, we show that the resulting speaker embeddings exhibit increased discriminative power. Our method achieves a 9.05% improvement in equal error rate (EER) compared to the standard MFA-Conformer on the VoxCeleb-1O test set.

Via

Access Paper or Ask Questions

RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

Oct 07, 2024

Ibrahim Aldarmaki, Thamar Solorio, Bhiksha Raj, Hanan Aldarmaki

Figure 1 for RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

Figure 2 for RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

Figure 3 for RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

Figure 4 for RelUNet: Relative Channel Fusion U-Net for Multichannel Speech Enhancement

Abstract:Neural multi-channel speech enhancement models, in particular those based on the U-Net architecture, demonstrate promising performance and generalization potential. These models typically encode input channels independently, and integrate the channels during later stages of the network. In this paper, we propose a novel modification of these models by incorporating relative information from the outset, where each channel is processed in conjunction with a reference channel through stacking. This input strategy exploits comparative differences to adaptively fuse information between channels, thereby capturing crucial spatial information and enhancing the overall performance. The experiments conducted on the CHiME-3 dataset demonstrate improvements in speech enhancement metrics across various architectures.

Via

Access Paper or Ask Questions