Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nameer Hirschkind

REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

Aug 07, 2025

Nameer Hirschkind, Joseph Liu, Mahesh Kumar Nandwana, Xiao Yu

Figure 1 for REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

Figure 2 for REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

Figure 3 for REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

Figure 4 for REINA: Regularized Entropy Information-Based Loss for Efficient Simultaneous Speech Translation

Abstract:Simultaneous Speech Translation (SimulST) systems stream in audio while simultaneously emitting translated text or speech. Such systems face the significant challenge of balancing translation quality and latency. We introduce a strategy to optimize this tradeoff: wait for more input only if you gain information by doing so. Based on this strategy, we present Regularized Entropy INformation Adaptation (REINA), a novel loss to train an adaptive policy using an existing non-streaming translation model. We derive REINA from information theory principles and show that REINA helps push the reported Pareto frontier of the latency/quality tradeoff over prior works. Utilizing REINA, we train a SimulST model on French, Spanish and German, both from and into English. Training on only open source or synthetically generated data, we achieve state-of-the-art (SOTA) streaming results for models of comparable size. We also introduce a metric for streaming efficiency, quantitatively showing REINA improves the latency/quality trade-off by as much as 21% compared to prior approaches, normalized against non-streaming baseline BLEU scores.

Via

Access Paper or Ask Questions

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Jun 14, 2024

Nameer Hirschkind, Xiao Yu, Mahesh Kumar Nandwana, Joseph Liu, Eloi DuBois, Dao Le, Nicolas Thiebaut, Colin Sinclair, Kyle Spence, Charles Shang(+2 more)

Figure 1 for Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Figure 2 for Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Figure 3 for Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Figure 4 for Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Abstract:We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23\% each and speaker similarity by 5\% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5$\times$ faster than real-time.

* Published in Interspeech 2024

Via

Access Paper or Ask Questions