Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sayaka Shiota

The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels

Jan 31, 2026

Ayuto Tsutsumi, Kohei Tanaka, Sayaka Shiota

Abstract:In this paper, we propose a submission to the x-to-audio alignment (XACLE) challenge. The goal is to predict semantic alignment of a given general audio and text pair. The proposed system is based on a large audio language model (LALM) architecture. We employ a three-stage training pipeline: automated audio captioning pretraining, pretraining with CLAP pseudo-labels, and fine-tuning on the XACLE dataset. Our experiments show that pretraining with CLAP pseudo-labels is the primary performance driver. On the XACLE test set, our system reaches an SRCC of 0.632, significantly outperforming the baseline system (0.334) and securing third place in the challenge team ranking. Code and models can be found at https://github.com/shiotalab-tmu/tmu-xacle2026

* 3 pages; 2 figures; 2 tables; Accepted at ICASSP 2026 Workshop (SP Grand Challenges, GC-12: XACLE)

Via

Access Paper or Ask Questions

Speech privacy-preserving methods using secret key for convolutional neural network models and their robustness evaluation

Aug 07, 2024

Shoko Niwa, Sayaka Shiota, Hitoshi Kiya

Abstract:In this paper, we propose privacy-preserving methods with a secret key for convolutional neural network (CNN)-based models in speech processing tasks. In environments where untrusted third parties, like cloud servers, provide CNN-based systems, ensuring the privacy of speech queries becomes essential. This paper proposes encryption methods for speech queries using secret keys and a model structure that allows for encrypted queries to be accepted without decryption. Our approach introduces three types of secret keys: Shuffling, Flipping, and random orthogonal matrix (ROM). In experiments, we demonstrate that when the proposed methods are used with the correct key, identification performance did not degrade. Conversely, when an incorrect key is used, the performance significantly decreased. Particularly, with the use of ROM, we show that even with a relatively small key space, high privacy-preserving performance can be maintained many speech processing tasks. Furthermore, we also demonstrate the difficulty of recovering original speech from encrypted queries in various robustness evaluations.

Via

Access Paper or Ask Questions

YODAS: Youtube-Oriented Dataset for Audio and Speech

Jun 02, 2024

Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe

Figure 1 for YODAS: Youtube-Oriented Dataset for Audio and Speech

Figure 2 for YODAS: Youtube-Oriented Dataset for Audio and Speech

Figure 3 for YODAS: Youtube-Oriented Dataset for Audio and Speech

Figure 4 for YODAS: Youtube-Oriented Dataset for Audio and Speech

Abstract:In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.

* ASRU 2023

Via

Access Paper or Ask Questions

A Random Ensemble of Encrypted Vision Transformers for Adversarially Robust Defense

Feb 11, 2024

Ryota Iijima, Sayaka Shiota, Hitoshi Kiya

Abstract:Deep neural networks (DNNs) are well known to be vulnerable to adversarial examples (AEs). In previous studies, the use of models encrypted with a secret key was demonstrated to be robust against white-box attacks, but not against black-box ones. In this paper, we propose a novel method using the vision transformer (ViT) that is a random ensemble of encrypted models for enhancing robustness against both white-box and black-box attacks. In addition, a benchmark attack method, called AutoAttack, is applied to models to test adversarial robustness objectively. In experiments, the method was demonstrated to be robust against not only white-box attacks but also black-box ones in an image classification task on the CIFAR-10 and ImageNet datasets. The method was also compared with the state-of-the-art in a standardized benchmark for adversarial robustness, RobustBench, and it was verified to outperform conventional defenses in terms of clean accuracy and robust accuracy.

* 9 pages

Via

Access Paper or Ask Questions

Efficient Fine-Tuning with Domain Adaptation for Privacy-Preserving Vision Transformer

Jan 10, 2024

Teru Nagamori, Sayaka Shiota, Hitoshi Kiya

Figure 1 for Efficient Fine-Tuning with Domain Adaptation for Privacy-Preserving Vision Transformer

Figure 2 for Efficient Fine-Tuning with Domain Adaptation for Privacy-Preserving Vision Transformer

Figure 3 for Efficient Fine-Tuning with Domain Adaptation for Privacy-Preserving Vision Transformer

Figure 4 for Efficient Fine-Tuning with Domain Adaptation for Privacy-Preserving Vision Transformer

Abstract:We propose a novel method for privacy-preserving deep neural networks (DNNs) with the Vision Transformer (ViT). The method allows us not only to train models and test with visually protected images but to also avoid the performance degradation caused from the use of encrypted images, whereas conventional methods cannot avoid the influence of image encryption. A domain adaptation method is used to efficiently fine-tune ViT with encrypted images. In experiments, the method is demonstrated to outperform conventional methods in an image classification task on the CIFAR-10 and ImageNet datasets in terms of classification accuracy.

* Under review by APSIPA Transactions on Signal and Information Processing. arXiv admin note: substantial text overlap with arXiv:2309.02556

Via

Access Paper or Ask Questions

A Random Ensemble of Encrypted models for Enhancing Robustness against Adversarial Examples

Jan 05, 2024

Ryota Iijima, Sayaka Shiota, Hitoshi Kiya

Abstract:Deep neural networks (DNNs) are well known to be vulnerable to adversarial examples (AEs). In addition, AEs have adversarial transferability, which means AEs generated for a source model can fool another black-box model (target model) with a non-trivial probability. In previous studies, it was confirmed that the vision transformer (ViT) is more robust against the property of adversarial transferability than convolutional neural network (CNN) models such as ConvMixer, and moreover encrypted ViT is more robust than ViT without any encryption. In this article, we propose a random ensemble of encrypted ViT models to achieve much more robust models. In experiments, the proposed scheme is verified to be more robust against not only black-box attacks but also white-box ones than convention methods.

* 4 pages

Via

Access Paper or Ask Questions

A privacy-preserving method using secret key for convolutional neural network-based speech classification

Oct 06, 2023

Shoko Niwa, Sayaka Shiota, Hitoshi Kiya

Abstract:In this paper, we propose a privacy-preserving method with a secret key for convolutional neural network (CNN)-based speech classification tasks. Recently, many methods related to privacy preservation have been developed in image classification research fields. In contrast, in speech classification research fields, little research has considered these risks. To promote research on privacy preservation for speech classification, we provide an encryption method with a secret key in CNN-based speech classification systems. The encryption method is based on a random matrix with an invertible inverse. The encrypted speech data with a correct key can be accepted by a model with an encrypted kernel generated using an inverse matrix of a random matrix. Whereas the encrypted speech data is strongly distorted, the classification tasks can be correctly performed when a correct key is provided. Additionally, in this paper, we evaluate the difficulty of reconstructing the original information from the encrypted spectrograms and waveforms. In our experiments, the proposed encryption methods are performed in automatic speech recognition~(ASR) and automatic speaker verification~(ASV) tasks. The results show that the encrypted data can be used completely the same as the original data when a correct secret key is provided in the transformer-based ASR and x-vector-based ASV with self-supervised front-end systems. The robustness of the encrypted data against reconstruction attacks is also illustrated.

* To appear in the 31st European Signal Processing Conference (EUSIPCO 2023)

Via

Access Paper or Ask Questions

Domain Adaptation for Efficiently Fine-tuning Vision Transformer with Encrypted Images

Sep 07, 2023

Teru Nagamori, Sayaka Shiota, Hitoshi Kiya

Abstract:In recent years, deep neural networks (DNNs) trained with transformed data have been applied to various applications such as privacy-preserving learning, access control, and adversarial defenses. However, the use of transformed data decreases the performance of models. Accordingly, in this paper, we propose a novel method for fine-tuning models with transformed images under the use of the vision transformer (ViT). The proposed domain adaptation method does not cause the accuracy degradation of models, and it is carried out on the basis of the embedding structure of ViT. In experiments, we confirmed that the proposed method prevents accuracy degradation even when using encrypted images with the CIFAR-10 and CIFAR-100 datasets.

* Accepted by APSIPA 2023

Via

Access Paper or Ask Questions

Enhanced Security against Adversarial Examples Using a Random Ensemble of Encrypted Vision Transformer Models

Jul 26, 2023

Ryota Iijima, Miki Tanaka, Sayaka Shiota, Hitoshi Kiya

* 4 pages, 3 figures

Via

Access Paper or Ask Questions

Access Control of Semantic Segmentation Models Using Encrypted Feature Maps

Jun 11, 2022

Hiroki Ito, AprilPyone MaungMaung, Sayaka Shiota, Hitoshi Kiya

Figure 1 for Access Control of Semantic Segmentation Models Using Encrypted Feature Maps

Figure 2 for Access Control of Semantic Segmentation Models Using Encrypted Feature Maps

Figure 3 for Access Control of Semantic Segmentation Models Using Encrypted Feature Maps

Figure 4 for Access Control of Semantic Segmentation Models Using Encrypted Feature Maps

Abstract:In this paper, we propose an access control method with a secret key for semantic segmentation models for the first time so that unauthorized users without a secret key cannot benefit from the performance of trained models. The method enables us not only to provide a high segmentation performance to authorized users but to also degrade the performance for unauthorized users. We first point out that, for the application of semantic segmentation, conventional access control methods which use encrypted images for classification tasks are not directly applicable due to performance degradation. Accordingly, in this paper, selected feature maps are encrypted with a secret key for training and testing models, instead of input images. In an experiment, the protected models allowed authorized users to obtain almost the same performance as that of non-protected models but also with robustness against unauthorized access without a key.

Via

Access Paper or Ask Questions