Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Radu Tudor Ionescu

RoDia: A New Dataset for Romanian Dialect Identification from Speech

Sep 12, 2023

Codrut Rotaru, Nicolae-Catalin Ristea, Radu Tudor Ionescu

Figure 1 for RoDia: A New Dataset for Romanian Dialect Identification from Speech

Figure 2 for RoDia: A New Dataset for Romanian Dialect Identification from Speech

Figure 3 for RoDia: A New Dataset for Romanian Dialect Identification from Speech

Figure 4 for RoDia: A New Dataset for Romanian Dialect Identification from Speech

Abstract:Dialect identification is a critical task in speech processing and language technology, enhancing various applications such as speech recognition, speaker verification, and many others. While most research studies have been dedicated to dialect identification in widely spoken languages, limited attention has been given to dialect identification in low-resource languages, such as Romanian. To address this research gap, we introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We publicly release our dataset and code at https://github.com/codrut2/RoDia.

Via

Access Paper or Ask Questions

CL-MAE: Curriculum-Learned Masked Autoencoders

Aug 31, 2023

Neelu Madan, Nicolae-Catalin Ristea, Kamal Nasrollahi, Thomas B. Moeslund, Radu Tudor Ionescu

Figure 1 for CL-MAE: Curriculum-Learned Masked Autoencoders

Figure 2 for CL-MAE: Curriculum-Learned Masked Autoencoders

Figure 3 for CL-MAE: Curriculum-Learned Masked Autoencoders

Figure 4 for CL-MAE: Curriculum-Learned Masked Autoencoders

Abstract:Masked image modeling has been demonstrated as a powerful pretext task for generating robust representations that can be effectively generalized across multiple downstream tasks. Typically, this approach involves randomly masking patches (tokens) in input images, with the masking strategy remaining unchanged during training. In this paper, we propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We conjecture that, by gradually increasing the task complexity, the model can learn more sophisticated and transferable representations. To facilitate this, we introduce a novel learnable masking module that possesses the capability to generate masks of different complexities, and integrate the proposed module into masked autoencoders (MAE). Our module is jointly trained with the MAE, while adjusting its behavior during training, transitioning from a partner to the MAE (optimizing the same reconstruction loss) to an adversary (optimizing the opposite loss), while passing through a neutral state. The transition between these behaviors is smooth, being regulated by a factor that is multiplied with the reconstruction loss of the masking module. The resulting training procedure generates an easy-to-hard curriculum. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE. The empirical results on five downstream tasks confirm our conjecture, demonstrating that curriculum learning can be successfully used to self-supervise masked autoencoders.

Via

Access Paper or Ask Questions

JEDI: Joint Expert Distillation in a Semi-Supervised Multi-Dataset Student-Teacher Scenario for Video Action Recognition

Aug 09, 2023

Lucian Bicsi, Bogdan Alexe, Radu Tudor Ionescu, Marius Leordeanu

Abstract:We propose JEDI, a multi-dataset semi-supervised learning method, which efficiently combines knowledge from multiple experts, learned on different datasets, to train and improve the performance of individual, per dataset, student models. Our approach achieves this by addressing two important problems in current machine learning research: generalization across datasets and limitations of supervised training due to scarcity of labeled data. We start with an arbitrary number of experts, pretrained on their own specific dataset, which form the initial set of student models. The teachers are immediately derived by concatenating the feature representations from the penultimate layers of the students. We then train all models in a student-teacher semi-supervised learning scenario until convergence. In our efficient approach, student-teacher training is carried out jointly and end-to-end, showing that both students and teachers improve their generalization capacity during training. We validate our approach on four video action recognition datasets. By simultaneously considering all datasets within a unified semi-supervised setting, we demonstrate significant improvements over the initial experts.

* Accepted in ICCV 2023 Workshops

Via

Access Paper or Ask Questions

Reverse Stable Diffusion: What prompt was used to generate this image?

Aug 02, 2023

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah

Figure 1 for Reverse Stable Diffusion: What prompt was used to generate this image?

Figure 2 for Reverse Stable Diffusion: What prompt was used to generate this image?

Figure 3 for Reverse Stable Diffusion: What prompt was used to generate this image?

Figure 4 for Reverse Stable Diffusion: What prompt was used to generate this image?

Abstract:Text-to-image diffusion models such as Stable Diffusion have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we introduce the new task of predicting the text prompt given an image generated by a generative diffusion model. We combine a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising of a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (i.e. that are better aligned), and an unsupervised domain-adaptive kernel learning method that uses the similarities between samples in the source and target domains as extra features. We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. Our novel learning framework produces excellent results on the aforementioned task, yielding the highest gains when applied on the white-box model. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation.

Via

Access Paper or Ask Questions

Cascaded Cross-Modal Transformer for Request and Complaint Detection

Jul 27, 2023

Nicolae-Catalin Ristea, Radu Tudor Ionescu

Figure 1 for Cascaded Cross-Modal Transformer for Request and Complaint Detection

Figure 2 for Cascaded Cross-Modal Transformer for Request and Complaint Detection

Figure 3 for Cascaded Cross-Modal Transformer for Request and Complaint Detection

Figure 4 for Cascaded Cross-Modal Transformer for Request and Complaint Detection

Abstract:We propose a novel cascaded cross-modal transformer (CCMT) that combines speech and text transcripts to detect customer requests and complaints in phone conversations. Our approach leverages a multimodal paradigm by transcribing the speech using automatic speech recognition (ASR) models and translating the transcripts into different languages. Subsequently, we combine language-specific BERT-based models with Wav2Vec2.0 audio features in a novel cascaded cross-attention transformer model. We apply our system to the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge, reaching unweighted average recalls (UAR) of 65.41% and 85.87% for the complaint and request classes, respectively.

* Accepted at ACMMM 2023

Via

Access Paper or Ask Questions

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Jun 21, 2023

Nicolae-Catalin Ristea, Florinel-Alin Croitoru, Radu Tudor Ionescu, Marius Popescu, Fahad Shahbaz Khan, Mubarak Shah

Figure 1 for Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Figure 2 for Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Figure 3 for Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Figure 4 for Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Abstract:We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model is threefold. First, we introduce an approach to weight tokens based on motion gradients, thus avoiding learning to reconstruct the static background scene. Second, we integrate a teacher decoder and a student decoder into our architecture, leveraging the discrepancy between the outputs given by the two decoders to improve anomaly detection. Third, we generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames (without anomalies) and the corresponding pixel-level anomaly maps. Our design leads to an efficient and effective model, as demonstrated by the extensive experiments carried out on three benchmarks: Avenue, ShanghaiTech and UCSD Ped2. The empirical results show that our model achieves an excellent trade-off between speed and accuracy, obtaining competitive AUC scores, while processing 1670 FPS. Hence, our model is between 8 and 70 times faster than competing methods. We also conduct an ablation study to justify our design.

Via

Access Paper or Ask Questions

Class Anchor Margin Loss for Content-Based Image Retrieval

Jun 03, 2023

Alexandru Ghita, Radu Tudor Ionescu

Abstract:The performance of neural networks in content-based image retrieval (CBIR) is highly influenced by the chosen loss (objective) function. The majority of objective functions for neural models can be divided into metric learning and statistical learning. Metric learning approaches require a pair mining strategy that often lacks efficiency, while statistical learning approaches are not generating highly compact features due to their indirect feature optimization. To this end, we propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimizes for the L2 metric without the need of generating pairs. Our loss is formed of three components. One leading objective ensures that the learned features are attracted to each designated learnable class anchor. The second loss component regulates the anchors and forces them to be separable by a margin, while the third objective ensures that the anchors do not collapse to zero. Furthermore, we develop a more efficient two-stage retrieval system by harnessing the learned class anchors during the first stage of the retrieval process, eliminating the need of comparing the query with every image in the database. We establish a set of four datasets (CIFAR-100, Food-101, SVHN, and Tiny ImageNet) and evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures. Compared to existing objective functions, our empirical evidence shows that the proposed objective is generating superior and more consistent results.

Via

Access Paper or Ask Questions

iQPP: A Benchmark for Image Query Performance Prediction

Feb 21, 2023

Eduard Poesina, Radu Tudor Ionescu, Josiane Mothe

Abstract:To date, query performance prediction (QPP) in the context of content-based image retrieval remains a largely unexplored task, especially in the query-by-example scenario, where the query is an image. To boost the exploration of the QPP task in image retrieval, we propose the first benchmark for image query performance prediction (iQPP). First, we establish a set of four data sets (PASCAL VOC 2012, Caltech-101, ROxford5k and RParis6k) and estimate the ground-truth difficulty of each query as the average precision or the precision@k, using two state-of-the-art image retrieval models. Next, we propose and evaluate novel pre-retrieval and post-retrieval query performance predictors, comparing them with existing or adapted (from text to image) predictors. The empirical results show that most predictors do not generalize across evaluation scenarios. Our comprehensive experiments indicate that iQPP is a challenging benchmark, revealing an important research gap that needs to be addressed in future work. We release our code and data as open source at https://github.com/Eduard6421/iQPP, to foster future research.

Via

Access Paper or Ask Questions

FreCDo: A Large Corpus for French Cross-Domain Dialect Identification

Dec 15, 2022

Mihaela Gaman, Adrian-Gabriel Chifu, William Domingues, Radu Tudor Ionescu

Figure 1 for FreCDo: A Large Corpus for French Cross-Domain Dialect Identification

Figure 2 for FreCDo: A Large Corpus for French Cross-Domain Dialect Identification

Figure 3 for FreCDo: A Large Corpus for French Cross-Domain Dialect Identification

Abstract:We present a novel corpus for French dialect identification comprising 413,522 French text samples collected from public news websites in Belgium, Canada, France and Switzerland. To ensure an accurate estimation of the dialect identification performance of models, we designed the corpus to eliminate potential biases related to topic, writing style, and publication source. More precisely, the training, validation and test splits are collected from different news websites, while searching for different keywords (topics). This leads to a French cross-domain (FreCDo) dialect identification task. We conduct experiments with four competitive baselines, a fine-tuned CamemBERT model, an XGBoost based on fine-tuned CamemBERT features, a Support Vector Machines (SVM) classifier based on fine-tuned CamemBERT features, and an SVM based on word n-grams. Aside from presenting quantitative results, we also make an analysis of the most discriminative features learned by CamemBERT. Our corpus is available at https://github.com/MihaelaGaman/FreCDo.

Via

Access Paper or Ask Questions

Audiovisual Masked Autoencoders

Dec 09, 2022

Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

Figure 1 for Audiovisual Masked Autoencoders

Figure 2 for Audiovisual Masked Autoencoders

Figure 3 for Audiovisual Masked Autoencoders

Figure 4 for Audiovisual Masked Autoencoders

Abstract:Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

Via

Access Paper or Ask Questions