Humans are arguably one of the most important subjects in video streams, many real-world applications such as video summarization or video editing workflows often require the automatic search and retrieval of a person of interest. Despite tremendous efforts in the person reidentification and retrieval domains, few works have developed audiovisual search strategies. In this paper, we present the Audiovisual Person Search dataset (APES), a new dataset composed of untrimmed videos whose audio (voices) and visual (faces) streams are densely annotated. APES contains over 1.9K identities labeled along 36 hours of video, making it the largest dataset available for untrimmed audiovisual person search. A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity. To showcase the potential of our new dataset, we propose an audiovisual baseline and benchmark for person retrieval. Our study shows that modeling audiovisual cues benefits the recognition of people's identities. To enable reproducibility and promote future research, the dataset annotations and baseline code are available at: https://github.com/fuankarion/audiovisual-person-search
We present FedScale, a diverse set of challenging and realistic benchmark datasets to facilitate scalable, comprehensive, and reproducible federated learning (FL) research. FedScale datasets are large-scale, encompassing a diverse range of important FL tasks, such as image classification, object detection, language modeling, speech recognition, and reinforcement learning. For each dataset, we provide a unified evaluation protocol using realistic data splits and evaluation metrics. To meet the pressing need for reproducing realistic FL at scale, we have also built an efficient evaluation platform to simplify and standardize the process of FL experimental setup and model evaluation. Our evaluation platform provides flexible APIs to implement new FL algorithms and include new execution backends with minimal developer efforts. Finally, we perform indepth benchmark experiments on these datasets. Our experiments suggest that FedScale presents significant challenges of heterogeneity-aware co-optimizations of the system and statistical efficiency under realistic FL characteristics, indicating fruitful opportunities for future research. FedScale is open-source with permissive licenses and actively maintained, and we welcome feedback and contributions from the community.
Most existing beamforming methods are frequency-domain methods, and are designed for enhancing a farfield target source over a narrow frequency band. They have found diverse applications and are still under active development. However, they struggle to achieve desired performance if the target source is in the nearfield with a broadband output. This paper proposes a time-domain nearfield frequency-invariant beamforming method. The time-domain implementation makes the beamformer output suitable for further use by real-time applications, the nearfield focusing enables the beamforming method to suppress an interference even if it is in the same direction as the target source, and the frequency-invariant beampattern makes the beamforming method suitable for enhancing the target source over a broad frequency band. These three features together make the beamforming method suitable for real-time broadband nearfield source enhancement, such as speech enhancement in room environments. The beamformer design process is separated from the sound field measurement process, and such that a designed beamformer applies to sensor arrays with various structures. The beamformer design process is further simplified by decomposing it into several independent parts. Simulation results confirm the performance of the proposed beamforming method.
Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. We propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speaker-related style and voice content of each input voice into separated low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With information-theoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world VCTK datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness for voice style transfer experiments under both many-to-many and zero-shot setups.
Intent classification and slot filling are two critical tasks for natural language understanding. Traditionally the two tasks have been deemed to proceed independently. However, more recently, joint models for intent classification and slot filling have achieved state-of-the-art performance, and have proved that there exists a strong relationship between the two tasks. This article is a compilation of past work in natural language understanding, especially joint intent classification and slot filling. We observe three milestones in this research so far: Intent detection to identify the speaker's intention, slot filling to label each word token in the speech/text, and finally, joint intent classification and slot filling tasks. In this article, we describe trends, approaches, issues, data sets, evaluation metrics in intent classification and slot filling. We also discuss representative performance values, describe shared tasks, and provide pointers to future work, as given in prior works. To interpret the state-of-the-art trends, we provide multiple tables that describe and summarise past research along different dimensions, including the types of features, base approaches, and dataset domain used.
We propose dynamic curriculum learning via data parameters for noise robust keyword spotting. Data parameter learning has recently been introduced for image processing, where weight parameters, so-called data parameters, for target classes and instances are introduced and optimized along with model parameters. The data parameters scale logits and control importance over classes and instances during training, which enables automatic curriculum learning without additional annotations for training data. Similarly, in this paper, we propose using this curriculum learning approach for acoustic modeling, and train an acoustic model on clean and noisy utterances with the data parameters. The proposed approach automatically learns the difficulty of the classes and instances, e.g. due to low speech to noise ratio (SNR), in the gradient descent optimization and performs curriculum learning. This curriculum learning leads to overall improvement of the accuracy of the acoustic model. We evaluate the effectiveness of the proposed approach on a keyword spotting task. Experimental results show 7.7% relative reduction in false reject ratio with the data parameters compared to a baseline model which is simply trained on the multiconditioned dataset.
Researches on Indonesian named entity (NE) tagger have been conducted since years ago. However, most did not use deep learning and instead employed traditional machine learning algorithms such as association rule, support vector machine, random forest, na\"ive bayes, etc. In those researches, word lists as gazetteers or clue words were provided to enhance the accuracy. Here, we attempt to employ deep learning in our Indonesian NE tagger. We use long short-term memory (LSTM) as the topology since it is the state-of-the-art of NE tagger. By using LSTM, we do not need a word list in order to enhance the accuracy. Basically, there are two main things that we investigate. The first is the output layer of the network: Softmax vs conditional random field (CRF). The second is the usage of part of speech (POS) tag embedding input layer. Using 8400 sentences as the training data and 97 sentences as the evaluation data, we find that using POS tag embedding as additional input improves the performance of our Indonesian NE tagger. As for the comparison between Softmax and CRF, we find that both architectures have a weakness in classifying an NE tag.
Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy.