Deep Learning (DL) is a surprisingly successful branch of machine learning. The success of DL is usually explained by focusing analysis on a particular recent algorithm and its traits. Instead, we propose that an explanation of the success of DL must look at the population of all algorithms in the field and how they have evolved over time. We argue that cultural evolution is a useful framework to explain the success of DL. In analogy to biology, we use `development' to mean the process converting the pseudocode or text description of an algorithm into a fully trained model. This includes writing the programming code, compiling and running the program, and training the model. If all parts of the process don't align well then the resultant model will be useless (if the code runs at all!). This is a constraint. A core component of evolutionary developmental biology is the concept of deconstraints -- these are modification to the developmental process that avoid complete failure by automatically accommodating changes in other components. We suggest that many important innovations in DL, from neural networks themselves to hyperparameter optimization and AutoGrad, can be seen as developmental deconstraints. These deconstraints can be very helpful to both the particular algorithm in how it handles challenges in implementation and the overall field of DL in how easy it is for new ideas to be generated. We highlight how our perspective can both advance DL and lead to new insights for evolutionary biology.
Large multi-label text classification is a challenging Natural Language Processing (NLP) problem that is concerned with text classification for datasets with thousands of labels. We tackle this problem in the legal domain, where datasets, such as JRC-Acquis and EURLEX57K labeled with the EuroVoc vocabulary were created within the legal information systems of the European Union. The EuroVoc taxonomy includes around 7000 concepts. In this work, we study the performance of various recent transformer-based models in combination with strategies such as generative pretraining, gradual unfreezing and discriminative learning rates in order to reach competitive classification performance, and present new state-of-the-art results of 0.661 (F1) for JRC-Acquis and 0.754 for EURLEX57K. Furthermore, we quantify the impact of individual steps, such as language model fine-tuning or gradual unfreezing in an ablation study, and provide reference dataset splits created with an iterative stratification algorithm.
Previous works on expressive speech synthesis mainly focus on current sentence. The context in adjacent sentences is neglected, resulting in inflexible speaking style for the same text, which lacks speech variations. In this paper, we propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information considering structural relationship in context, including inter-phrase and inter-sentence relations. Moreover, to encourage this encoder to learn style representation better, we introduce a novel training strategy with knowledge distillation, which provides the target for encoder training. Both objective and subjective evaluations on a Mandarin lecture dataset demonstrate that the proposed method can significantly improve the naturalness and expressiveness of the synthesized speech.
Mass utilization of body-worn cameras has led to a huge corpus of available egocentric video. Existing video summarization algorithms can accelerate browsing such videos by selecting (visually) interesting shots from them. Nonetheless, since the system user still has to watch the summary videos, browsing large video databases remain a challenge. Hence, in this work, we propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos. Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database using text queries. Since egocentric videos are long and contain many activities and events, using video-to-text algorithms results in thousands of descriptions, many of which are incorrect. Therefore, we propose a multi-task learning scheme to simultaneously generate descriptions for video segments and summarize the resulting descriptions in an end-to-end fashion. We Input a set of video shots and the network generates a text description for each shot. Next, visual-language content matching unit that is trained with a weakly supervised objective, identifies the correct descriptions. Finally, the last component of our network, called purport network, evaluates the descriptions all together to select the ones containing crucial information. Out of thousands of descriptions generated for the video, a few informative sentences are returned to the user. We validate our framework on the challenging UT Egocentric video dataset, where each video is between 3 to 5 hours long, associated with over 3000 textual descriptions on average. The generated textual summaries, including only 5 percent (or less) of the generated descriptions, are compared to groundtruth summaries in text domain using well-established metrics in natural language processing.
Data is fundamental to machine learning-based products and services and is considered strategic due to its externalities for businesses, governments, non-profits, and more generally for society. It is renowned that the value of organizations (businesses, government agencies and programs, and even industries) scales with the volume of available data. What is often less appreciated is that the data value in making useful organizational predictions will range widely and is prominently a function of data characteristics and underlying algorithms. In this research, our goal is to study how the value of data changes over time and how this change varies across contexts and business areas (e.g. next word prediction in the context of history, sports, politics). We focus on data from Reddit.com and compare the value's time-dependency across various Reddit topics (Subreddits). We make this comparison by measuring the rate at which user-generated text data loses its relevance to the algorithmic prediction of conversations. We show that different subreddits have different rates of relevance decline over time. Relating the text topics to various business areas of interest, we argue that competing in a business area in which data value decays rapidly alters strategies to acquire competitive advantage. When data value decays rapidly, access to a continuous flow of data will be more valuable than access to a fixed stock of data. In this kind of setting, improving user engagement and increasing user-base help creating and maintaining a competitive advantage.
We present a method for exploiting weakly annotated images to improve text extraction pipelines. The approach exploits an arbitrary existing end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions. A process that includes imprecise transcription to annotation matching and edit distance guided neighbourhood search produces nearly error-free, localised instances of scene text, which we treat as pseudo ground truth used for training. We apply the method to two weakly-annotated datasets and show that the process consistently improves the accuracy of a state of the art recognition model across different benchmark datasets (image domains) as well as providing a significant performance boost on the same dataset.
CAPTCHAs are designed to prevent malicious bot programs from abusing websites. Most online service providers deploy audio CAPTCHAs as an alternative to text and image CAPTCHAs for visually impaired users. However, prior research investigating the security of audio CAPTCHAs found them highly vulnerable to automated attacks using Automatic Speech Recognition (ASR) systems. To improve the robustness of audio CAPTCHAs against automated abuses, we present the design and implementation of an audio adversarial CAPTCHA (aaeCAPTCHA) system in this paper. The aaeCAPTCHA system exploits audio adversarial examples as CAPTCHAs to prevent the ASR systems from automatically solving them. Furthermore, we conducted a rigorous security evaluation of our new audio CAPTCHA design against five state-of-the-art DNN-based ASR systems and three commercial Speech-to-Text (STT) services. Our experimental evaluations demonstrate that aaeCAPTCHA is highly secure against these speech recognition technologies, even when the attacker has complete knowledge of the current attacks against audio adversarial examples. We also conducted a usability evaluation of the proof-of-concept implementation of the aaeCAPTCHA scheme. Our results show that it achieves high robustness at a moderate usability cost compared to normal audio CAPTCHAs. Finally, our extensive analysis highlights that aaeCAPTCHA can significantly enhance the security and robustness of traditional audio CAPTCHA systems while maintaining similar usability.
Decision forest algorithms model data by learning a binary tree structure recursively where every node splits the feature space into two regions, sending examples into the left or right branches. This ``decision'' is the result of the evaluation of a condition. For example, a node may split input data by applying a threshold to a numerical feature value. Such decisions are learned using (often greedy) algorithms that attempt to optimize a local loss function. Crucially, whether an algorithm exists to find and evaluate splits for a feature type (e.g., text) determines whether a decision forest algorithm can model that feature type at all. In this work, we set out to devise such an algorithm for textual features, thereby equipping decision forests with the ability to directly model text without the need for feature transformation. Our algorithm is efficient during training and the resulting splits are fast to evaluate with our extension of the QuickScorer inference algorithm. Experiments on benchmark text classification datasets demonstrate the utility and effectiveness of our proposal.
With the proliferation of social media, many studies resort to social media to construct datasets for developing social meaning understanding systems. For the popular case of Twitter, most researchers distribute tweet IDs without the actual text contents due to the data distribution policy of the platform. One issue is that the posts become increasingly inaccessible over time, which leads to unfair comparisons and a temporal bias in social media research. To alleviate this challenge of data decay, we leverage a paraphrase model to propose a new persistent English Twitter dataset for social meaning (PTSM). PTSM consists of $17$ social meaning datasets in $10$ categories of tasks. We experiment with two SOTA pre-trained language models and show that our PTSM can substitute the actual tweets with paraphrases with marginal performance loss.
Training a multi-speaker Text-to-Speech (TTS) model from scratch is computationally expensive and adding new speakers to the dataset requires the model to be re-trained. The naive solution of sequential fine-tuning of a model for new speakers can cause the model to have poor performance on older speakers. This phenomenon is known as catastrophic forgetting. In this paper, we look at TTS modeling from a continual learning perspective where the goal is to add new speakers without forgetting previous speakers. Therefore, we first propose an experimental setup and show that serial fine-tuning for new speakers can result in the forgetting of the previous speakers. Then we exploit two well-known techniques for continual learning namely experience replay and weight regularization and we reveal how one can mitigate the effect of degradation in speech synthesis diversity in sequential training of new speakers using these methods. Finally, we present a simple extension to improve the results in extreme setups.