Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junichi Yamagishi

LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example

Oct 11, 2021

Hieu-Thi Luong, Junichi Yamagishi

Figure 1 for LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example

Figure 2 for LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example

Figure 3 for LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example

Figure 4 for LaughNet: synthesizing laughter utterances from waveform silhouettes and a single laughter example

Abstract:Emotional and controllable speech synthesis is a topic that has received much attention. However, most studies focused on improving the expressiveness and controllability in the context of linguistic content, even though natural verbal human communication is inseparable from spontaneous non-speech expressions such as laughter, crying, or grunting. We propose a model called LaughNet for synthesizing laughter by using waveform silhouettes as inputs. The motivation is not simply synthesizing new laughter utterances but testing a novel synthesis-control paradigm that uses an abstract representation of the waveform. We conducted basic listening test experiments, and the results showed that LaughNet can synthesize laughter utterances with moderate quality and retain the characteristics of the training example. More importantly, the generated waveforms have shapes similar to the input silhouettes. For future work, we will test the same method on other types of human nonverbal expressions and integrate it into more elaborated synthesis systems.

* Submitted to ICASSP 2022

Via

Access Paper or Ask Questions

Estimating the confidence of speech spoofing countermeasure

Oct 10, 2021

Xin Wang, Junichi Yamagishi

Figure 1 for Estimating the confidence of speech spoofing countermeasure

Figure 2 for Estimating the confidence of speech spoofing countermeasure

Figure 3 for Estimating the confidence of speech spoofing countermeasure

Figure 4 for Estimating the confidence of speech spoofing countermeasure

Abstract:Conventional speech spoofing countermeasures (CMs) are designed to make a binary decision on an input trial. However, a CM trained on a closed-set database is theoretically not guaranteed to perform well on unknown spoofing attacks. In some scenarios, an alternative strategy is to let the CM defer a decision when it is not confident. The question is then how to estimate a CM's confidence regarding an input trial. We investigated a few confidence estimators that can be easily plugged into a CM. On the ASVspoof2019 logical access database, the results demonstrate that an energy-based estimator and a neural-network-based one achieved acceptable performance in identifying unknown attacks in the test set. On a test set with additional unknown attacks and bona fide trials from other databases, the confidence estimators performed moderately well, and the CMs better discriminated bona fide and spoofed trials that had a high confidence score. Additional results also revealed the difficulty in enhancing a confidence estimator by adding unknown attacks to the training set.

* Work in progress. Comments are welcome

Via

Access Paper or Ask Questions

On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

Oct 04, 2021

Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox(+1 more)

Figure 1 for On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

Figure 2 for On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

Figure 3 for On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

Figure 4 for On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

Abstract:Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech. Additionally, we explored several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed by large-scale subjective tests and objective measures. Code and 200 pruned models are made available to facilitate future research on efficiency in TTS.

Via

Access Paper or Ask Questions

DDS: A new device-degraded speech dataset for speech enhancement

Sep 28, 2021

Haoyu Li, Junichi Yamagishi

Figure 1 for DDS: A new device-degraded speech dataset for speech enhancement

Figure 2 for DDS: A new device-degraded speech dataset for speech enhancement

Figure 3 for DDS: A new device-degraded speech dataset for speech enhancement

Figure 4 for DDS: A new device-degraded speech dataset for speech enhancement

Abstract:A large and growing amount of speech content in real-life scenarios is being recorded on common consumer devices in uncontrolled environments, resulting in degraded speech quality. Transforming such low-quality device-degraded speech into high-quality speech is a goal of speech enhancement (SE). This paper introduces a new speech dataset, DDS, to facilitate the research on SE. DDS provides aligned parallel recordings of high-quality speech (recorded in professional studios) and a number of versions of low-quality speech, producing approximately 2,000 hours speech data. The DDS dataset covers 27 realistic recording conditions by combining diverse acoustic environments and microphone devices, and each version of a condition consists of multiple recordings from six different microphone positions to simulate various signal-to-noise ratio (SNR) and reverberation levels. We also test several SE baseline systems on the DDS dataset and show the impact of recording diversity on performance.

* Submitted to ICASSP 2022

Via

Access Paper or Ask Questions

Master Face Attacks on Face Recognition Systems

Sep 08, 2021

Huy H. Nguyen, Sébastien Marcel, Junichi Yamagishi, Isao Echizen

Figure 1 for Master Face Attacks on Face Recognition Systems

Figure 2 for Master Face Attacks on Face Recognition Systems

Figure 3 for Master Face Attacks on Face Recognition Systems

Figure 4 for Master Face Attacks on Face Recognition Systems

Abstract:Face authentication is now widely used, especially on mobile devices, rather than authentication using a personal identification number or an unlock pattern, due to its convenience. It has thus become a tempting target for attackers using a presentation attack. Traditional presentation attacks use facial images or videos of the victim. Previous work has proven the existence of master faces, i.e., faces that match multiple enrolled templates in face recognition systems, and their existence extends the ability of presentation attacks. In this paper, we perform an extensive study on latent variable evolution (LVE), a method commonly used to generate master faces. We run an LVE algorithm for various scenarios and with more than one database and/or face recognition system to study the properties of the master faces and to understand in which conditions strong master faces could be generated. Moreover, through analysis, we hypothesize that master faces come from some dense areas in the embedding spaces of the face recognition systems. Last but not least, simulated presentation attacks using generated master faces generally preserve the false-matching ability of their original digital forms, thus demonstrating that the existence of master faces poses an actual threat.

* This paper is an extension of the IJCB paper published in 2019 (Generating Master Faces for Use in Performing Wolf Attacks on Face Recognition Systems) and its first version was initially submitted to T-BIOM journal on Dec 25, 2020

Via

Access Paper or Ask Questions

The VoicePrivacy 2020 Challenge: Results and findings

Sep 01, 2021

Natalia Tomashenko, Xin Wang, Emmanuel Vincent, Jose Patino, Brij Mohan Lal Srivastava, Paul-Gauthier Noé, Andreas Nautsch, Nicholas Evans, Junichi Yamagishi, Benjamin O'Brien(+4 more)

Figure 1 for The VoicePrivacy 2020 Challenge: Results and findings

Figure 2 for The VoicePrivacy 2020 Challenge: Results and findings

Figure 3 for The VoicePrivacy 2020 Challenge: Results and findings

Figure 4 for The VoicePrivacy 2020 Challenge: Results and findings

Abstract:This paper presents the results and analyses stemming from the first VoicePrivacy 2020 Challenge which focuses on developing anonymization solutions for speech technology. We provide a systematic overview of the challenge design with an analysis of submitted systems and evaluation results. In particular, we describe the voice anonymization task and datasets used for system development and evaluation. Also, we present different attack models and the associated objective and subjective evaluation metrics. We introduce two anonymization baselines and provide a summary description of the anonymization systems developed by the challenge participants. We report objective and subjective evaluation results for baseline and submitted systems. In addition, we present experimental results for alternative privacy metrics and attack models developed as a part of the post-evaluation analysis. Finally, we summarize our insights and observations that will influence the design of the next VoicePrivacy challenge edition and some directions for future voice anonymization research.

* Submitted to the Special Issue on Voice Privacy (Computer Speech and Language Journal - Elsevier); under review

Via

Access Paper or Ask Questions

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Sep 01, 2021

Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans(+1 more)

Figure 1 for ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Figure 2 for ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Figure 3 for ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Figure 4 for ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Abstract:ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task involving deepfake speech detection. This paper describes all three tasks, the new databases for each of them, the evaluation metrics, four challenge baselines, the evaluation platform and a summary of challenge results. Despite the introduction of channel and compression variability which compound the difficulty, results for the logical access and deepfake tasks are close to those from previous ASVspoof editions. Results for the physical access task show the difficulty in detecting attacks in real, variable physical spaces. With ASVspoof 2021 being the first edition for which participants were not provided with any matched training or development data and with this reflecting real conditions in which the nature of spoofed and deepfake speech can never be predicated with confidence, the results are extremely encouraging and demonstrate the substantial progress made in the field in recent years.

* Accepted to the ASVspoof 2021 Workshop

Via

Access Paper or Ask Questions

ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan

Sep 01, 2021

Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang(+1 more)

Figure 1 for ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan

Abstract:The automatic speaker verification spoofing and countermeasures (ASVspoof) challenge series is a community-led initiative which aims to promote the consideration of spoofing and the development of countermeasures. ASVspoof 2021 is the 4th in a series of bi-annual, competitive challenges where the goal is to develop countermeasures capable of discriminating between bona fide and spoofed or deepfake speech. This document provides a technical description of the ASVspoof 2021 challenge, including details of training, development and evaluation data, metrics, baselines, evaluation rules, submission procedures and the schedule.

* http://www.asvspoof.org

Via

Access Paper or Ask Questions

Benchmarking and challenges in security and privacy for voice biometrics

Sep 01, 2021

Jean-Francois Bonastre, Hector Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Paul-Gauthier Noe, Jose Patino, Md Sahidullah(+6 more)

Figure 1 for Benchmarking and challenges in security and privacy for voice biometrics

Figure 2 for Benchmarking and challenges in security and privacy for voice biometrics

Figure 3 for Benchmarking and challenges in security and privacy for voice biometrics

Figure 4 for Benchmarking and challenges in security and privacy for voice biometrics

Abstract:For many decades, research in speech technologies has focused upon improving reliability. With this now meeting user expectations for a range of diverse applications, speech technology is today omni-present. As result, a focus on security and privacy has now come to the fore. Here, the research effort is in its relative infancy and progress calls for greater, multidisciplinary collaboration with security, privacy, legal and ethical experts among others. Such collaboration is now underway. To help catalyse the efforts, this paper provides a high-level overview of some related research. It targets the non-speech audience and describes the benchmarking methodology that has spearheaded progress in traditional research and which now drives recent security and privacy initiatives related to voice biometrics. We describe: the ASVspoof challenge relating to the development of spoofing countermeasures; the VoicePrivacy initiative which promotes research in anonymisation for privacy preservation.

* Submitted to the symposium of the ISCA Security & Privacy in Speech Communications (SPSC) special interest group

Via

Access Paper or Ask Questions

OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild

Jul 30, 2021

Trung-Nghia Le, Huy H. Nguyen, Junichi Yamagishi, Isao Echizen

Figure 1 for OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild

Figure 2 for OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild

Figure 3 for OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild

Figure 4 for OpenForensics: Large-Scale Challenging Dataset For Multi-Face Forgery Detection And Segmentation In-The-Wild

Abstract:The proliferation of deepfake media is raising concerns among the public and relevant authorities. It has become essential to develop countermeasures against forged faces in social media. This paper presents a comprehensive study on two new countermeasure tasks: multi-face forgery detection and segmentation in-the-wild. Localizing forged faces among multiple human faces in unrestricted natural scenes is far more challenging than the traditional deepfake recognition task. To promote these new tasks, we have created the first large-scale dataset posing a high level of challenges that is designed with face-wise rich annotations explicitly for face forgery detection and segmentation, namely OpenForensics. With its rich annotations, our OpenForensics dataset has great potentials for research in both deepfake prevention and general human face detection. We have also developed a suite of benchmarks for these tasks by conducting an extensive evaluation of state-of-the-art instance detection and segmentation methods on our newly constructed dataset in various scenarios. The dataset, benchmark results, codes, and supplementary materials will be publicly available on our project page: https://sites.google.com/view/ltnghia/research/openforensics

* Accepted to ICCV 2021. Project page: https://sites.google.com/view/ltnghia/research/openforensics

Via

Access Paper or Ask Questions