Alert button
Picture for Jason Li

Jason Li

Alert button

Sandy

SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation

Oct 13, 2023
Zhehuai Chen, He Huang, Andrei Andrusenko, Oleksii Hrinchuk, Krishna C. Puvvada, Jason Li, Subhankar Ghosh, Jagadeesh Balam, Boris Ginsburg

We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, {\em speech supervised in-context training} is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.

* submit to ICASSP 2024 
Viaarxiv icon

AstroLLaMA: Towards Specialized Foundation Models in Astronomy

Sep 12, 2023
Tuan Dung Nguyen, Yuan-Sen Ting, Ioana Ciucă, Charlie O'Neill, Ze-Chang Sun, Maja Jabłońska, Sandor Kruk, Ernest Perkowski, Jack Miller, Jason Li, Josh Peek, Kartheik Iyer, Tomasz Różański, Pranav Khetarpal, Sharaf Zaman, David Brodrick, Sergio J. Rodríguez Méndez, Thang Bui, Alyssa Goodman, Alberto Accomazzi, Jill Naiman, Jesse Cranney, Kevin Schawinski, UniverseTBD

Figure 1 for AstroLLaMA: Towards Specialized Foundation Models in Astronomy
Figure 2 for AstroLLaMA: Towards Specialized Foundation Models in Astronomy
Figure 3 for AstroLLaMA: Towards Specialized Foundation Models in Astronomy

Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marked domain adaptation. Our model generates more insightful and scientifically relevant text completions and embedding extraction than state-of-the-arts foundation models despite having significantly fewer parameters. AstroLLaMA serves as a robust, domain-specific model with broad fine-tuning potential. Its public release aims to spur astronomy-focused research, including automatic paper summarization and conversational agent development.

* 6 pages, 3 figures, submitted to IJCNLP-AACL 2023. Comments are welcome. The model can be found on Hugging Face - https://huggingface.co/universeTBD/astrollama 
Viaarxiv icon

NutritionVerse-Thin: An Optimized Strategy for Enabling Improved Rendering of 3D Thin Food Models

Apr 12, 2023
Chi-en Amy Tai, Jason Li, Sriram Kumar, Saeejith Nair, Yuhao Chen, Pengcheng Xi, Alexander Wong

Figure 1 for NutritionVerse-Thin: An Optimized Strategy for Enabling Improved Rendering of 3D Thin Food Models
Figure 2 for NutritionVerse-Thin: An Optimized Strategy for Enabling Improved Rendering of 3D Thin Food Models
Figure 3 for NutritionVerse-Thin: An Optimized Strategy for Enabling Improved Rendering of 3D Thin Food Models
Figure 4 for NutritionVerse-Thin: An Optimized Strategy for Enabling Improved Rendering of 3D Thin Food Models

With the growth in capabilities of generative models, there has been growing interest in using photo-realistic renders of common 3D food items to improve downstream tasks such as food printing, nutrition prediction, or management of food wastage. Despite 3D modelling capabilities being more accessible than ever due to the success of NeRF based view-synthesis, such rendering methods still struggle to correctly capture thin food objects, often generating meshes with significant holes. In this study, we present an optimized strategy for enabling improved rendering of thin 3D food models, and demonstrate qualitative improvements in rendering quality. Our method generates the 3D model mesh via a proposed thin-object-optimized differentiable reconstruction method and tailors the strategy at both the data collection and training stages to better handle thin objects. While simple, we find that this technique can be employed for quick and highly consistent capturing of thin 3D objects.

Viaarxiv icon

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

Feb 16, 2023
Shehzeen Hussain, Paarth Neekhara, Jocelyn Huang, Jason Li, Boris Ginsburg

Figure 1 for ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations
Figure 2 for ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations
Figure 3 for ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations
Figure 4 for ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

In this work, we propose a zero-shot voice conversion method using speech representations trained with self-supervised learning. First, we develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style. To disentangle content and speaker representations, we propose a training strategy based on Siamese networks that encourages similarity between the content representations of the original and pitch-shifted audio. Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its decomposed representation. Our framework allows controllable and speaker-adaptive synthesis to perform zero-shot any-to-any voice conversion achieving state-of-the-art results on metrics evaluating speaker similarity, intelligibility, and naturalness. Using just 10 seconds of data for a target speaker, our framework can perform voice swapping and achieves a speaker verification EER of 5.5% for seen speakers and 8.4% for unseen speakers.

* Published as a conference paper at ICASSP 2023 
Viaarxiv icon

Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task

Dec 20, 2022
Jason Li, Nicholas Watters, Yingting, Wang, Hansem Sohn, Mehrdad Jazayeri

Figure 1 for Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task
Figure 2 for Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task
Figure 3 for Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task
Figure 4 for Modeling Human Eye Movements with Neural Networks in a Maze-Solving Task

From smoothly pursuing moving objects to rapidly shifting gazes during visual search, humans employ a wide variety of eye movement strategies in different contexts. While eye movements provide a rich window into mental processes, building generative models of eye movements is notoriously difficult, and to date the computational objectives guiding eye movements remain largely a mystery. In this work, we tackled these problems in the context of a canonical spatial planning task, maze-solving. We collected eye movement data from human subjects and built deep generative models of eye movements using a novel differentiable architecture for gaze fixations and gaze shifts. We found that human eye movements are best predicted by a model that is optimized not to perform the task as efficiently as possible but instead to run an internal simulation of an object traversing the maze. This not only provides a generative model of eye movements in this task but also suggests a computational theory for how humans solve the task, namely that humans use mental simulation.

Viaarxiv icon

Adapting TTS models For New Speakers using Transfer Learning

Oct 12, 2021
Paarth Neekhara, Jason Li, Boris Ginsburg

Figure 1 for Adapting TTS models For New Speakers using Transfer Learning
Figure 2 for Adapting TTS models For New Speakers using Transfer Learning

Training neural text-to-speech (TTS) models for a new speaker typically requires several hours of high quality speech data. Prior works on voice cloning attempt to address this challenge by adapting pre-trained multi-speaker TTS models for a new voice, using a few minutes of speech data of the new speaker. However, publicly available large multi-speaker datasets are often noisy, thereby resulting in TTS models that are not suitable for use in products. We address this challenge by proposing transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data. We conduct an extensive study using different amounts of data for a new speaker and evaluate the synthesized speech in terms of naturalness and voice/style similarity to the target speaker. We find that fine-tuning a single-speaker TTS model on just 30 minutes of data, can yield comparable performance to a model trained from scratch on more than 27 hours of data for both male and female target speakers.

* Submitted to ICASSP 2022 
Viaarxiv icon

Offensive Language and Hate Speech Detection with Deep Learning and Transfer Learning

Aug 23, 2021
Bencheng Wei, Jason Li, Ajay Gupta, Hafiza Umair, Atsu Vovor, Natalie Durzynski

Figure 1 for Offensive Language and Hate Speech Detection with Deep Learning and Transfer Learning
Figure 2 for Offensive Language and Hate Speech Detection with Deep Learning and Transfer Learning
Figure 3 for Offensive Language and Hate Speech Detection with Deep Learning and Transfer Learning
Figure 4 for Offensive Language and Hate Speech Detection with Deep Learning and Transfer Learning

Toxic online speech has become a crucial problem nowadays due to an exponential increase in the use of internet by people from different cultures and educational backgrounds. Differentiating if a text message belongs to hate speech and offensive language is a key challenge in automatic detection of toxic text content. In this paper, we propose an approach to automatically classify tweets into three classes: Hate, offensive and Neither. Using public tweet data set, we first perform experiments to build BI-LSTM models from empty embedding and then we also try the same neural network architecture with pre-trained Glove embedding. Next, we introduce a transfer learning approach for hate speech detection using an existing pre-trained language model BERT (Bidirectional Encoder Representations from Transformers), DistilBert (Distilled version of BERT) and GPT-2 (Generative Pre-Training). We perform hyper parameters tuning analysis of our best model (BI-LSTM) considering different neural network architectures, learn-ratings and normalization methods etc. After tuning the model and with the best combination of parameters, we achieve over 92 percent accuracy upon evaluating it on test data. We also create a class module which contains main functionality including text classification, sentiment checking and text data augmentation. This model could serve as an intermediate module between user and Twitter.

Viaarxiv icon

A Lightweight Algorithm to Uncover Deep Relationships in Data Tables

Sep 07, 2020
Jin Cao, Yibo Zhao, Linjun Zhang, Jason Li

Figure 1 for A Lightweight Algorithm to Uncover Deep Relationships in Data Tables
Figure 2 for A Lightweight Algorithm to Uncover Deep Relationships in Data Tables
Figure 3 for A Lightweight Algorithm to Uncover Deep Relationships in Data Tables
Figure 4 for A Lightweight Algorithm to Uncover Deep Relationships in Data Tables

Many data we collect today are in tabular form, with rows as records and columns as attributes associated with each record. Understanding the structural relationship in tabular data can greatly facilitate the data science process. Traditionally, much of this relational information is stored in table schema and maintained by its creators, usually domain experts. In this paper, we develop automated methods to uncover deep relationships in a single data table without expert or domain knowledge. Our method can decompose a data table into layers of smaller tables, revealing its deep structure. The key to our approach is a computationally lightweight forward addition algorithm that we developed to recursively extract the functional dependencies between table columns that are scalable to tables with many columns. With our solution, data scientists will be provided with automatically generated, data-driven insights when exploring new data sets.

* 9 pages, 4 figures, paper presented on AutoML 2019 (The Third International Workshop on Automation in Machine Learning) 
Viaarxiv icon

Cycle Text-To-Image GAN with BERT

Mar 26, 2020
Trevor Tsue, Samir Sen, Jason Li

Figure 1 for Cycle Text-To-Image GAN with BERT
Figure 2 for Cycle Text-To-Image GAN with BERT
Figure 3 for Cycle Text-To-Image GAN with BERT
Figure 4 for Cycle Text-To-Image GAN with BERT

We explore novel approaches to the task of image generation from their respective captions, building on state-of-the-art GAN architectures. Particularly, we baseline our models with the Attention-based GANs that learn attention mappings from words to image features. To better capture the features of the descriptions, we then built a novel cyclic design that learns an inverse function to maps the image back to original caption. Additionally, we incorporated recently developed BERT pretrained word embeddings as our initial text featurizer and observe a noticeable improvement in qualitative and quantitative performance compared to the Attention GAN baseline.

Viaarxiv icon