End-to-end automatic speech recognition (ASR) can achieve promising performance with large-scale training data. However, it is known that domain mismatch between training and testing data often leads to a degradation of recognition accuracy. In this work, we focus on the unsupervised domain adaptation for ASR and propose CMatch, a Character-level distribution matching method to perform fine-grained adaptation between each character in two domains. First, to obtain labels for the features belonging to each character, we achieve frame-level label assignment using the Connectionist Temporal Classification (CTC) pseudo labels. Then, we match the character-level distributions using Maximum Mean Discrepancy. We train our algorithm using the self-training technique. Experiments on the Libri-Adapt dataset show that our proposed approach achieves 14.39% and 16.50% relative Word Error Rate (WER) reduction on both cross-device and cross-environment ASR. We also comprehensively analyze the different strategies for frame-level label assignment and Transformer adaptations.
Speech recognition systems have achieved high recognition performance for several tasks. However, the performance of such systems is dependent on the tremendously costly development work of preparing vast amounts of task-matched transcribed speech data for supervised training. The key problem here is the cost of transcribing speech data. The cost is repeatedly required to support new languages and new tasks. Assuming broad network services for transcribing speech data for many users, a system would become more self-sufficient and more useful if it possessed the ability to learn from very light feedback from the users without annoying them. In this paper, we propose a general reinforcement learning framework for speech recognition systems based on the policy gradient method. As a particular instance of the framework, we also propose a hypothesis selection-based reinforcement learning method. The proposed framework provides a new view for several existing training and adaptation methods. The experimental results show that the proposed method improves the recognition performance compared to unsupervised adaptation.