There has been an increasing interest in large speech models that can perform multiple speech processing tasks in a single model. Such models usually adopt the encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 25% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up. We will publicly release our codebase, pre-trained model, and training logs to promote open science in speech foundation models.
Artificial intelligence (AI) systems have the potential to revolutionize clinical practices, including improving diagnostic accuracy and surgical decision-making, while also reducing costs and manpower. However, it is important to recognize that these systems may perpetuate social inequities or demonstrate biases, such as those based on race or gender. Such biases can occur before, during, or after the development of AI models, making it critical to understand and address potential biases to enable the accurate and reliable application of AI models in clinical settings. To mitigate bias concerns during model development, we surveyed recent publications on different debiasing methods in the fields of biomedical natural language processing (NLP) or computer vision (CV). Then we discussed the methods that have been applied in the biomedical domain to address bias. We performed our literature search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published between January 2018 and December 2023 using multiple combinations of keywords. We then filtered the result of 10,041 articles automatically with loose constraints, and manually inspected the abstracts of the remaining 890 articles to identify the 55 articles included in this review. Additional articles in the references are also included in this review. We discuss each method and compare its strengths and weaknesses. Finally, we review other potential methods from the general domain that could be applied to biomedicine to address bias and improve fairness.The bias of AIs in biomedicine can originate from multiple sources. Existing debiasing methods that focus on algorithms can be categorized into distributional or algorithmic.
Recent advancements in language models have significantly enhanced performance in multiple speech-related tasks. Existing speech language models typically utilize task-dependent prompt tokens to unify various speech tasks in a single model. However, this design omits the intrinsic connections between different speech tasks, which can potentially boost the performance of each task. In this work, we propose a novel decoder-only speech language model, SpeechComposer, that can unify common speech tasks by composing a fixed set of prompt tokens. Built upon four primary tasks -- speech synthesis, speech recognition, speech language modeling, and text language modeling -- SpeechComposer can easily extend to more speech tasks via compositions of well-designed prompt tokens, like voice conversion and speech enhancement. The unification of prompt tokens also makes it possible for knowledge sharing among different speech tasks in a more structured manner. Experimental results demonstrate that our proposed SpeechComposer can improve the performance of both primary tasks and composite tasks, showing the effectiveness of the shared prompt tokens. Remarkably, the unified decoder-only model achieves a comparable and even better performance than the baselines which are expert models designed for single tasks.
Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed. We publicly release the data preparation scripts, pre-trained models and training logs.
Purpose: Limited studies exploring concrete methods or approaches to tackle and enhance model fairness in the radiology domain. Our proposed AI model utilizes supervised contrastive learning to minimize bias in CXR diagnosis. Materials and Methods: In this retrospective study, we evaluated our proposed method on two datasets: the Medical Imaging and Data Resource Center (MIDRC) dataset with 77,887 CXR images from 27,796 patients collected as of April 20, 2023 for COVID-19 diagnosis, and the NIH Chest X-ray (NIH-CXR) dataset with 112,120 CXR images from 30,805 patients collected between 1992 and 2015. In the NIH-CXR dataset, thoracic abnormalities include atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax, consolidation, edema, emphysema, fibrosis, pleural thickening, or hernia. Our proposed method utilizes supervised contrastive learning with carefully selected positive and negative samples to generate fair image embeddings, which are fine-tuned for subsequent tasks to reduce bias in chest X-ray (CXR) diagnosis. We evaluated the methods using the marginal AUC difference ($\delta$ mAUC). Results: The proposed model showed a significant decrease in bias across all subgroups when compared to the baseline models, as evidenced by a paired T-test (p<0.0001). The $\delta$ mAUC obtained by our method were 0.0116 (95\% CI, 0.0110-0.0123), 0.2102 (95% CI, 0.2087-0.2118), and 0.1000 (95\% CI, 0.0988-0.1011) for sex, race, and age on MIDRC, and 0.0090 (95\% CI, 0.0082-0.0097) for sex and 0.0512 (95% CI, 0.0512-0.0532) for age on NIH-CXR, respectively. Conclusion: Employing supervised contrastive learning can mitigate bias in CXR diagnosis, addressing concerns of fairness and reliability in deep learning-based diagnostic methods.
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V outperforms human physicians regarding multi-choice accuracy (88.0% vs. 77.0%, p=0.034). GPT-4V also performs well in cases where physicians incorrectly answer, with over 80% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (27.3%), most prominent in image comprehension (21.6%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such models into clinical workflows.
Nonparametric models are of great interest in various scientific and engineering disciplines. Classical kernel methods, while numerically robust and statistically sound in low-dimensional settings, become inadequate in higher-dimensional settings due to the curse of dimensionality. In this paper, we introduce a new framework called Variance-Reduced Sketching (VRS), specifically designed to estimate density functions and nonparametric regression functions in higher dimensions with a reduced curse of dimensionality. Our framework conceptualizes multivariable functions as infinite-size matrices, and facilitates a new sketching technique motivated by numerical linear algebra literature to reduce the variance in estimation problems. We demonstrate the robust numerical performance of VRS through a series of simulated experiments and real-world data applications. Notably, VRS shows remarkable improvement over existing neural network estimators and classical kernel methods in numerous density estimation and nonparametric regression models. Additionally, we offer theoretical justifications for VRS to support its ability to deliver nonparametric estimation with a reduced curse of dimensionality.
End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.
Objectives Extraction of PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method PICOX to extract overlapping PICO entities. Materials and Methods PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then it uses a multi-label classifier to assign one or more PICO labels to a span candidate. PICOX was evaluated using one of the best-performing baselines, EBM-NLP, and three more datasets, i.e., PICO-Corpus, and RCT publications on Alzheimer's Disease or COVID-19, using entity-level precision, recall, and F1 scores. Results PICOX achieved superior precision, recall, and F1 scores across the board, with the micro F1 score improving from 45.05 to 50.87 (p << 0.01). On the PICO-Corpus, PICOX obtained higher recall and F1 scores than the baseline and improved the micro recall score from 56.66 to 67.33. On the COVID-19 dataset, PICOX also outperformed the baseline and improved the micro F1 score from 77.10 to 80.32. On the AD dataset, PICOX demonstrated comparable F1 scores with higher precision when compared to the baseline. Conclusion PICOX excels in identifying overlapping entities and consistently surpasses a leading baseline across multiple datasets. Ablation studies reveal that its data augmentation strategy effectively minimizes false positives and improves precision.