Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

John Hershey

I-Con: A Unifying Framework for Representation Learning

Apr 23, 2025

Shaden Alshammari, John Hershey, Axel Feldmann, William T. Freeman, Mark Hamilton

Abstract:As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.

* ICLR 2025; website: https://aka.ms/i-con . Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025)

Via

Access Paper or Ask Questions

Unsupervised Improved MVDR Beamforming for Sound Enhancement

Jun 12, 2024

Jacob Kealey, John Hershey, François Grondin

Figure 1 for Unsupervised Improved MVDR Beamforming for Sound Enhancement

Figure 2 for Unsupervised Improved MVDR Beamforming for Sound Enhancement

Figure 3 for Unsupervised Improved MVDR Beamforming for Sound Enhancement

Figure 4 for Unsupervised Improved MVDR Beamforming for Sound Enhancement

Abstract:Neural networks have recently become the dominant approach to sound separation. Their good performance relies on large datasets of isolated recordings. For speech and music, isolated single channel data are readily available; however the same does not hold in the multi-channel case, and with most other sound classes. Multi-channel methods have the potential to outperform single channel approaches as they can exploit both spatial and spectral features, but the lack of training data remains a challenge. We propose unsupervised improved minimum variation distortionless response (UIMVDR), which enables multi-channel separation to leverage in-the-wild single-channel data through unsupervised training and beamforming. Results show that UIMVDR generalizes well and improves separation performance compared to supervised models, particularly in cases with limited supervised data. By using data available online, it also reduces the effort required to gather data for multi-channel approaches.

* Accepted at INTERSPEECH 2024

Via

Access Paper or Ask Questions

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Oct 27, 2018

Quan Wang, Hannah Muckenhirn, Kevin Wilson, Prashant Sridhar, Zelin Wu, John Hershey, Rif A. Saurous, Ron J. Weiss, Ye Jia, Ignacio Lopez Moreno

Figure 1 for VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Figure 2 for VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Figure 3 for VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Figure 4 for VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

Abstract:In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

* To be submitted to ICASSP 2019

Via

Access Paper or Ask Questions