Alert button
Picture for Israel Cohen

Israel Cohen

Alert button

PCMC-T1: Free-breathing myocardial T1 mapping with Physically-Constrained Motion Correction

Aug 22, 2023
Eyal Hanania, Ilya Volovik, Lilach Barkat, Israel Cohen, Moti Freiman

Figure 1 for PCMC-T1: Free-breathing myocardial T1 mapping with Physically-Constrained Motion Correction
Figure 2 for PCMC-T1: Free-breathing myocardial T1 mapping with Physically-Constrained Motion Correction
Figure 3 for PCMC-T1: Free-breathing myocardial T1 mapping with Physically-Constrained Motion Correction
Figure 4 for PCMC-T1: Free-breathing myocardial T1 mapping with Physically-Constrained Motion Correction

T1 mapping is a quantitative magnetic resonance imaging (qMRI) technique that has emerged as a valuable tool in the diagnosis of diffuse myocardial diseases. However, prevailing approaches have relied heavily on breath-hold sequences to eliminate respiratory motion artifacts. This limitation hinders accessibility and effectiveness for patients who cannot tolerate breath-holding. Image registration can be used to enable free-breathing T1 mapping. Yet, inherent intensity differences between the different time points make the registration task challenging. We introduce PCMC-T1, a physically-constrained deep-learning model for motion correction in free-breathing T1 mapping. We incorporate the signal decay model into the network architecture to encourage physically-plausible deformations along the longitudinal relaxation axis. We compared PCMC-T1 to baseline deep-learning-based image registration approaches using a 5-fold experimental setup on a publicly available dataset of 210 patients. PCMC-T1 demonstrated superior model fitting quality (R2: 0.955) and achieved the highest clinical impact (clinical score: 3.93) compared to baseline methods (0.941, 0.946 and 3.34, 3.62 respectively). Anatomical alignment results were comparable (Dice score: 0.9835 vs. 0.984, 0.988). Our code and trained models are available at https://github.com/eyalhana/PCMC-T1.

* Accepted to MICCAI 2023 
Viaarxiv icon

Challenges and Opportunities in Multi-device Speech Processing

Jun 27, 2022
Gregory Ciccarelli, Jarred Barber, Arun Nair, Israel Cohen, Tao Zhang

Figure 1 for Challenges and Opportunities in Multi-device Speech Processing

We review current solutions and technical challenges for automatic speech recognition, keyword spotting, device arbitration, speech enhancement, and source localization in multidevice home environments to provide context for the INTERSPEECH 2022 special session, "Challenges and opportunities for signal processing and machine learning for multiple smart devices". We also identify the datasets needed to support these research areas. Based on the review and our research experience in the multi-device domain, we conclude with an outlook on the future evolution

* Accepted for INTERSPEECH 2022 
Viaarxiv icon

Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk

Jul 15, 2021
Amir Ivry, Israel Cohen, Baruch Berdugo

Figure 1 for Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk
Figure 2 for Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk
Figure 3 for Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk
Figure 4 for Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk

Human subjective evaluation is optimal to assess speech quality for human perception. The recently introduced deep noise suppression mean opinion score (DNSMOS) metric was shown to estimate human ratings with great accuracy. The signal-to-distortion ratio (SDR) metric is widely used to evaluate residual-echo suppression (RES) systems by estimating speech quality during double-talk. However, since the SDR is affected by both speech distortion and residual-echo presence, it does not correlate well with human ratings according to the DNSMOS. To address that, we introduce two objective metrics to separately quantify the desired-speech maintained level (DSML) and residual-echo suppression level (RESL) during double-talk. These metrics are evaluated using a deep learning-based RES-system with a tunable design parameter. Using 280 hours of real and simulated recordings, we show that the DSML and RESL correlate well with the DNSMOS with high generalization to various setups. Also, we empirically investigate the relation between tuning the RES-system design parameter and the DSML-RESL tradeoff it creates and offer a practical design scheme for dynamic system requirements.

* Accepted to WASPAA 
Viaarxiv icon

Convolutional Sparse Coding Fast Approximation with Application to Seismic Reflectivity Estimation

Jun 29, 2021
Deborah Pereg, Israel Cohen, Anthony A. Vassiliou

Figure 1 for Convolutional Sparse Coding Fast Approximation with Application to Seismic Reflectivity Estimation
Figure 2 for Convolutional Sparse Coding Fast Approximation with Application to Seismic Reflectivity Estimation
Figure 3 for Convolutional Sparse Coding Fast Approximation with Application to Seismic Reflectivity Estimation
Figure 4 for Convolutional Sparse Coding Fast Approximation with Application to Seismic Reflectivity Estimation

In sparse coding, we attempt to extract features of input vectors, assuming that the data is inherently structured as a sparse superposition of basic building blocks. Similarly, neural networks perform a given task by learning features of the training data set. Recently both data-driven and model-driven feature extracting methods have become extremely popular and have achieved remarkable results. Nevertheless, practical implementations are often too slow to be employed in real-life scenarios, especially for real-time applications. We propose a speed-up upgraded version of the classic iterative thresholding algorithm, that produces a good approximation of the convolutional sparse code within 2-5 iterations. The speed advantage is gained mostly from the observation that most solvers are slowed down by inefficient global thresholding. The main idea is to normalize each data point by the local receptive field energy, before applying a threshold. This way, the natural inclination towards strong feature expressions is suppressed, so that one can rely on a global threshold that can be easily approximated, or learned during training. The proposed algorithm can be employed with a known predetermined dictionary, or with a trained dictionary. The trained version is implemented as a neural net designed as the unfolding of the proposed solver. The performance of the proposed solution is demonstrated via the seismic inversion problem in both synthetic and real data scenarios. We also provide theoretical guarantees for a stable support recovery. Namely, we prove that under certain conditions the true support is perfectly recovered within the first iteration.

Viaarxiv icon

Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets

Jun 25, 2021
Amir Ivry, Baruch Berdugo, Israel Cohen

Figure 1 for Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets
Figure 2 for Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets
Figure 3 for Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets
Figure 4 for Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets

We address voice activity detection in acoustic environments of transients and stationary noises, which often occur in real life scenarios. We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure. This process is done through a deep encoder-decoder based neural network architecture. This structure involves an encoder that maps spectral features with temporal information to their low-dimensional representations, which are generated by applying the diffusion maps method. The encoder feeds a decoder that maps the embedded data back into the high-dimensional space. A deep neural network, which is trained to separate speech from non-speech frames, is obtained by concatenating the decoder to the encoder, resembling the known Diffusion nets architecture. Experimental results show enhanced performance compared to competing voice activity detection methods. The improvement is achieved in both accuracy, robustness and generalization ability. Our model performs in a real-time manner and can be integrated into audio-based communication systems. We also present a batch algorithm which obtains an even higher accuracy for off-line applications.

* volume 13, number 2, pp. 254--264, year 2019  
* Accepted to IEEE journal of selected topics in signal processing 2019 
Viaarxiv icon

Nonlinear Acoustic Echo Cancellation with Deep Learning

Jun 25, 2021
Amir Ivry, Israel Cohen, Baruch Berdugo

Figure 1 for Nonlinear Acoustic Echo Cancellation with Deep Learning
Figure 2 for Nonlinear Acoustic Echo Cancellation with Deep Learning
Figure 3 for Nonlinear Acoustic Echo Cancellation with Deep Learning
Figure 4 for Nonlinear Acoustic Echo Cancellation with Deep Learning

We propose a nonlinear acoustic echo cancellation system, which aims to model the echo path from the far-end signal to the near-end microphone in two parts. Inspired by the physical behavior of modern hands-free devices, we first introduce a novel neural network architecture that is specifically designed to model the nonlinear distortions these devices induce between receiving and playing the far-end signal. To account for variations between devices, we construct this network with trainable memory length and nonlinear activation functions that are not parameterized in advance, but are rather optimized during the training stage using the training data. Second, the network is succeeded by a standard adaptive linear filter that constantly tracks the echo path between the loudspeaker output and the microphone. During training, the network and filter are jointly optimized to learn the network parameters. This system requires 17 thousand parameters that consume 500 Million floating-point operations per second and 40 Kilo-bytes of memory. It also satisfies hands-free communication timing requirements on a standard neural processor, which renders it adequate for embedding on hands-free communication devices. Using 280 hours of real and synthetic data, experiments show advantageous performance compared to competing methods.

* Accepted to Interspeech 2021 
Viaarxiv icon

Deep Residual Echo Suppression with A Tunable Tradeoff Between Signal Distortion and Echo Suppression

Jun 25, 2021
Amir Ivry, Israel Cohen, Baruch Berdugo

Figure 1 for Deep Residual Echo Suppression with A Tunable Tradeoff Between Signal Distortion and Echo Suppression
Figure 2 for Deep Residual Echo Suppression with A Tunable Tradeoff Between Signal Distortion and Echo Suppression
Figure 3 for Deep Residual Echo Suppression with A Tunable Tradeoff Between Signal Distortion and Echo Suppression
Figure 4 for Deep Residual Echo Suppression with A Tunable Tradeoff Between Signal Distortion and Echo Suppression

In this paper, we propose a residual echo suppression method using a UNet neural network that directly maps the outputs of a linear acoustic echo canceler to the desired signal in the spectral domain. This system embeds a design parameter that allows a tunable tradeoff between the desired-signal distortion and residual echo suppression in double-talk scenarios. The system employs 136 thousand parameters, and requires 1.6 Giga floating-point operations per second and 10 Mega-bytes of memory. The implementation satisfies both the timing requirements of the AEC challenge and the computational and memory limitations of on-device applications. Experiments are conducted with 161~h of data from the AEC challenge database and from real independent recordings. We demonstrate the performance of the proposed system in real-life conditions and compare it with two competing methods regarding echo suppression and desired-signal distortion, generalization to various environments, and robustness to high echo levels.

* pp. 126--130, year 2021  
* Accepted to ICASSP 2021 
Viaarxiv icon

Evaluation of Deep-Learning-Based Voice Activity Detectors and Room Impulse Response Models in Reverberant Environments

Jun 25, 2021
Amir Ivry, Israel Cohen, Baruch Berdugo

Figure 1 for Evaluation of Deep-Learning-Based Voice Activity Detectors and Room Impulse Response Models in Reverberant Environments
Figure 2 for Evaluation of Deep-Learning-Based Voice Activity Detectors and Room Impulse Response Models in Reverberant Environments
Figure 3 for Evaluation of Deep-Learning-Based Voice Activity Detectors and Room Impulse Response Models in Reverberant Environments

State-of-the-art deep-learning-based voice activity detectors (VADs) are often trained with anechoic data. However, real acoustic environments are generally reverberant, which causes the performance to significantly deteriorate. To mitigate this mismatch between training data and real data, we simulate an augmented training set that contains nearly five million utterances. This extension comprises of anechoic utterances and their reverberant modifications, generated by convolutions of the anechoic utterances with a variety of room impulse responses (RIRs). We consider five different models to generate RIRs, and five different VADs that are trained with the augmented training set. We test all trained systems in three different real reverberant environments. Experimental results show $20\%$ increase on average in accuracy, precision and recall for all detectors and response models, compared to anechoic training. Furthermore, one of the RIR models consistently yields better performance than the other models, for all the tested VADs. Additionally, one of the VADs consistently outperformed the other VADs in all experiments.

* Accepted to ICASSP 2020 
Viaarxiv icon

Data-Driven Tree Transforms and Metrics

Aug 18, 2017
Gal Mishne, Ronen Talmon, Israel Cohen, Ronald R. Coifman, Yuval Kluger

Figure 1 for Data-Driven Tree Transforms and Metrics
Figure 2 for Data-Driven Tree Transforms and Metrics
Figure 3 for Data-Driven Tree Transforms and Metrics
Figure 4 for Data-Driven Tree Transforms and Metrics

We consider the analysis of high dimensional data given in the form of a matrix with columns consisting of observations and rows consisting of features. Often the data is such that the observations do not reside on a regular grid, and the given order of the features is arbitrary and does not convey a notion of locality. Therefore, traditional transforms and metrics cannot be used for data organization and analysis. In this paper, our goal is to organize the data by defining an appropriate representation and metric such that they respect the smoothness and structure underlying the data. We also aim to generalize the joint clustering of observations and features in the case the data does not fall into clear disjoint groups. For this purpose, we propose multiscale data-driven transforms and metrics based on trees. Their construction is implemented in an iterative refinement procedure that exploits the co-dependencies between features and observations. Beyond the organization of a single dataset, our approach enables us to transfer the organization learned from one dataset to another and to integrate several datasets together. We present an application to breast cancer gene expression analysis: learning metrics on the genes to cluster the tumor samples into cancer sub-types and validating the joint organization of both the genes and the samples. We demonstrate that using our approach to combine information from multiple gene expression cohorts, acquired by different profiling technologies, improves the clustering of tumor samples.

* 16 pages, 5 figures. Accepted to IEEE Transactions on Signal and Information Processing over Networks 
Viaarxiv icon

Kernel-based Sensor Fusion with Application to Audio-Visual Voice Activity Detection

Apr 11, 2016
David Dov, Ronen Talmon, Israel Cohen

Figure 1 for Kernel-based Sensor Fusion with Application to Audio-Visual Voice Activity Detection
Figure 2 for Kernel-based Sensor Fusion with Application to Audio-Visual Voice Activity Detection
Figure 3 for Kernel-based Sensor Fusion with Application to Audio-Visual Voice Activity Detection
Figure 4 for Kernel-based Sensor Fusion with Application to Audio-Visual Voice Activity Detection

In this paper, we address the problem of multiple view data fusion in the presence of noise and interferences. Recent studies have approached this problem using kernel methods, by relying particularly on a product of kernels constructed separately for each view. From a graph theory point of view, we analyze this fusion approach in a discrete setting. More specifically, based on a statistical model for the connectivity between data points, we propose an algorithm for the selection of the kernel bandwidth, a parameter, which, as we show, has important implications on the robustness of this fusion approach to interferences. Then, we consider the fusion of audio-visual speech signals measured by a single microphone and by a video camera pointed to the face of the speaker. Specifically, we address the task of voice activity detection, i.e., the detection of speech and non-speech segments, in the presence of structured interferences such as keyboard taps and office noise. We propose an algorithm for voice activity detection based on the audio-visual signal. Simulation results show that the proposed algorithm outperforms competing fusion and voice activity detection approaches. In addition, we demonstrate that a proper selection of the kernel bandwidth indeed leads to improved performance.

Viaarxiv icon