Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lior Wolf

High Fidelity Speech Regeneration with Application to Speech Enhancement

Jan 31, 2021
Adam Polyak, Lior Wolf, Yossi Adi, Ori Kabeli, Yaniv Taigman

Figure 1 for High Fidelity Speech Regeneration with Application to Speech Enhancement

Figure 2 for High Fidelity Speech Regeneration with Application to Speech Enhancement

Figure 3 for High Fidelity Speech Regeneration with Application to Speech Enhancement

Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines.

Via

Access Paper or Ask Questions

HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation

Dec 21, 2020
Yuval Nirkin, Lior Wolf, Tal Hassner

Figure 1 for HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation

Figure 2 for HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation

Figure 3 for HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation

Figure 4 for HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation

We present a novel, real-time, semantic segmentation network in which the encoder both encodes and generates the parameters (weights) of the decoder. Furthermore, to allow maximal adaptivity, the weights at each decoder block vary spatially. For this purpose, we design a new type of hypernetwork, composed of a nested U-Net for drawing higher level context features, a multi-headed weight generating module which generates the weights of each block in the decoder immediately before they are consumed, for efficient memory utilization, and a primary network that is composed of novel dynamic patch-wise convolutions. Despite the usage of less-conventional blocks, our architecture obtains real-time performance. In terms of the runtime vs. accuracy trade-off, we surpass state of the art (SotA) results on popular semantic segmentation benchmarks: PASCAL VOC 2012 (val. set) and real-time semantic segmentation on Cityscapes, and CamVid.

Via

Access Paper or Ask Questions

Transformer Interpretability Beyond Attention Visualization

Dec 17, 2020
Hila Chefer, Shir Gur, Lior Wolf

Figure 1 for Transformer Interpretability Beyond Attention Visualization

Figure 2 for Transformer Interpretability Beyond Attention Visualization

Figure 3 for Transformer Interpretability Beyond Attention Visualization

Figure 4 for Transformer Interpretability Beyond Attention Visualization

Self-attention techniques, and specifically Transformers, are dominating the field of text processing and are becoming increasingly popular in computer vision classification tasks. In order to visualize the parts of the image that led to a certain classification, existing methods either rely on the obtained attention maps, or employ heuristic propagation along the attention graph. In this work, we propose a novel way to compute relevancy for Transformer networks. The method assigns local relevance based on the deep Taylor decomposition principle and then propagates these relevancy scores through the layers. This propagation involves attention layers and skip connections, which challenge existing methods. Our solution is based on a specific formulation that is shown to maintain the total relevancy across layers. We benchmark our method on very recent visual Transformer networks, as well as on a text classification problem, and demonstrate a clear advantage over the existing explainability methods.

Via

Access Paper or Ask Questions

Visualization of Supervised and Self-Supervised Neural Networks via Attribution Guided Factorization

Dec 03, 2020
Shir Gur, Ameen Ali, Lior Wolf

Figure 1 for Visualization of Supervised and Self-Supervised Neural Networks via Attribution Guided Factorization

Figure 2 for Visualization of Supervised and Self-Supervised Neural Networks via Attribution Guided Factorization

Figure 3 for Visualization of Supervised and Self-Supervised Neural Networks via Attribution Guided Factorization

Figure 4 for Visualization of Supervised and Self-Supervised Neural Networks via Attribution Guided Factorization

Neural network visualization techniques mark image locations by their relevancy to the network's classification. Existing methods are effective in highlighting the regions that affect the resulting classification the most. However, as we show, these methods are limited in their ability to identify the support for alternative classifications, an effect we name {\em the saliency bias} hypothesis. In this work, we integrate two lines of research: gradient-based methods and attribution-based methods, and develop an algorithm that provides per-class explainability. The algorithm back-projects the per pixel local influence, in a manner that is guided by the local attributions, while correcting for salient features that would otherwise bias the explanation. In an extensive battery of experiments, we demonstrate the ability of our methods to class-specific visualization, and not just the predicted label. Remarkably, the method obtains state of the art results in benchmarks that are commonly applied to gradient-based methods as well as in those that are employed mostly for evaluating attribution methods. Using a new unsupervised procedure, our method is also successful in demonstrating that self-supervised methods learn semantic information.

Via

Access Paper or Ask Questions

Single-Shot Freestyle Dance Reenactment

Dec 02, 2020
Oran Gafni, Oron Ashual, Lior Wolf

Figure 1 for Single-Shot Freestyle Dance Reenactment

Figure 2 for Single-Shot Freestyle Dance Reenactment

Figure 3 for Single-Shot Freestyle Dance Reenactment

Figure 4 for Single-Shot Freestyle Dance Reenactment

The task of motion transfer between a source dancer and a target person is a special case of the pose transfer problem, in which the target person changes their pose in accordance with the motions of the dancer. In this work, we propose a novel method that can reanimate a single image by arbitrary video sequences, unseen during training. The method combines three networks: (i) a segmentation-mapping network, (ii) a realistic frame-rendering network, and (iii) a face refinement network. By separating this task into three stages, we are able to attain a novel sequence of realistic frames, capturing natural motion and appearance. Our method obtains significantly better visual quality than previous methods and is able to animate diverse body types and appearances, which are captured in challenging poses, as shown in the experiments and supplementary video.

Via

Access Paper or Ask Questions

Image Animation with Perturbed Masks

Nov 18, 2020
Yoav Shalev, Lior Wolf

Figure 1 for Image Animation with Perturbed Masks

Figure 2 for Image Animation with Perturbed Masks

Figure 3 for Image Animation with Perturbed Masks

Figure 4 for Image Animation with Perturbed Masks

We present a novel approach for image-animation of a source image by a driving video, both depicting the same type of object. We do not assume the existence of pose models and our method is able to animate arbitrary objects without knowledge of the object's structure. Furthermore, both the driving video and the source image are only seen during test-time. Our method is based on a shared mask generator, which separates the foreground object from its background, and captures the object's general pose and shape. A mask-refinement module then replaces, in the mask extracted from the driver image, the identity of the driver with the identity of the source. Conditioned on the source image, the transformed mask is then decoded by a multi-scale generator that renders a realistic image, in which the content of the source frame is animated by the pose in the driving video. Due to lack of fully supervised data, we train on the task of reconstructing frames from the same video the source image is taken from. In order to control {the} source of the identity of the output frame, we employ during training perturbations that remove the unwanted identity information. Our method is shown to greatly outperform the state of the art methods on multiple benchmarks. Our code and samples are available at https://github.com/itsyoavshalev/Image-Animation-with-Perturbed-Masks

Via

Access Paper or Ask Questions

Single channel voice separation for unknown number of speakers under reverberant and noisy settings

Nov 04, 2020
Shlomo E. Chazan, Lior Wolf, Eliya Nachmani, Yossi Adi

Figure 1 for Single channel voice separation for unknown number of speakers under reverberant and noisy settings

Figure 2 for Single channel voice separation for unknown number of speakers under reverberant and noisy settings

Figure 3 for Single channel voice separation for unknown number of speakers under reverberant and noisy settings

Figure 4 for Single channel voice separation for unknown number of speakers under reverberant and noisy settings

We present a unified network for voice separation of an unknown number of speakers. The proposed approach is composed of several separation heads optimized together with a speaker classification branch. The separation is carried out in the time domain, together with parameter sharing between all separation heads. The classification branch estimates the number of speakers while each head is specialized in separating a different number of speakers. We evaluate the proposed model under both clean and noisy reverberant set-tings. Results suggest that the proposed approach is superior to the baseline model by a significant margin. Additionally, we present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.

Via

Access Paper or Ask Questions

Generating Correct Answers for Progressive Matrices Intelligence Tests

Nov 01, 2020
Niv Pekar, Yaniv Benny, Lior Wolf

Figure 1 for Generating Correct Answers for Progressive Matrices Intelligence Tests

Figure 2 for Generating Correct Answers for Progressive Matrices Intelligence Tests

Figure 3 for Generating Correct Answers for Progressive Matrices Intelligence Tests

Figure 4 for Generating Correct Answers for Progressive Matrices Intelligence Tests

Raven's Progressive Matrices are multiple-choice intelligence tests, where one tries to complete the missing location in a $3\times 3$ grid of abstract images. Previous attempts to address this test have focused solely on selecting the right answer out of the multiple choices. In this work, we focus, instead, on generating a correct answer given the grid, without seeing the choices, which is a harder task, by definition. The proposed neural model combines multiple advances in generative models, including employing multiple pathways through the same network, using the reparameterization trick along two pathways to make their encoding compatible, a dynamic application of variational losses, and a complex perceptual loss that is coupled with a selective backpropagation procedure. Our algorithm is able not only to generate a set of plausible answers, but also to be competitive to the state of the art methods in multiple-choice tests.

* To appear in the 34th Conference on Neural Information Processing Systems (NeurIPS 2020)

Via

Access Paper or Ask Questions

Permuted AdaIN: Enhancing the Representation of Local Cues in Image Classifiers

Oct 09, 2020
Oren Nuriel, Sagie Benaim, Lior Wolf

Figure 1 for Permuted AdaIN: Enhancing the Representation of Local Cues in Image Classifiers

Figure 2 for Permuted AdaIN: Enhancing the Representation of Local Cues in Image Classifiers

Figure 3 for Permuted AdaIN: Enhancing the Representation of Local Cues in Image Classifiers

Figure 4 for Permuted AdaIN: Enhancing the Representation of Local Cues in Image Classifiers

Recent work has shown that convolutional neural network classifiers overly rely on texture at the expense of shape cues, which adversely affects the classifier's performance in shifted domains. In this work, we make a similar but different distinction between local image cues, including shape and texture, and global image statistics. We provide a method that enhances the representation of local cues in the hidden layers of image classifiers. Our method, called Permuted Adaptive Instance Normalization (pAdaIN), samples a random permutation $\pi$ that rearranges the samples in a given batch. Adaptive Instance Normalization (AdaIN) is then applied between the activations of each (non-permuted) sample $i$ and the corresponding activations of the sample $\pi(i)$, thus swapping statistics between the samples of the batch. Since the global image statistics are distorted, this swapping procedure causes the network to rely on the local image cues. By choosing the random permutation with probability $p$ and the identity permutation otherwise, one can control the strength of this effect. With the correct choice of $p$, selected without considering the test data, our method consistently outperforms baseline methods in image classification, as well as in the setting of domain generalization.

* 8 pages, 3 figures

Via

Access Paper or Ask Questions