In this article, we present the elitist particle filter based on evolutionary strategies (EPFES) as an efficient approach for nonlinear system identification. The EPFES is derived from the frequently-employed state-space model, where the relevant information of the nonlinear system is captured by an unknown state vector. Similar to classical particle filtering, the EPFES consists of a set of particles and respective weights which represent different realizations of the latent state vector and their likelihood of being the solution of the optimization problem. As main innovation, the EPFES includes an evolutionary elitist-particle selection which combines long-term information with instantaneous sampling from an approximated continuous posterior distribution. In this article, we propose two advancements of the previously-published elitist-particle selection process. Further, the EPFES is shown to be a generalization of the widely-used Gaussian particle filter and thus evaluated with respect to the latter for two completely different scenarios: First, we consider the so-called univariate nonstationary growth model with time-variant latent state variable, where the evolutionary selection of elitist particles is evaluated for non-recursively calculated particle weights. Second, the problem of nonlinear acoustic echo cancellation is addressed in a simulated scenario with speech as input signal: By using long-term fitness measures, we highlight the efficacy of the well-generalizing EPFES in estimating the nonlinear system even for large search spaces. Finally, we illustrate similarities between the EPFES and evolutionary algorithms to outline future improvements by fusing the achievements of both fields of research.
Deep learning models are mostly used in an offline inference fashion. However, this strongly limits the use of these models inside audio generation setups, as most creative workflows are based on real-time digital signal processing. Although approaches based on recurrent networks can be naturally adapted to this buffer-based computation, the use of convolutions still poses some serious challenges. To tackle this issue, the use of causal streaming convolutions have been proposed. However, this requires specific complexified training and can impact the resulting audio quality. In this paper, we introduce a new method allowing to produce non-causal streaming models. This allows to make any convolutional model compatible with real-time buffer-based processing. As our method is based on a post-training reconfiguration of the model, we show that it is able to transform models trained without causal constraints into a streaming model. We show how our method can be adapted to fit complex architectures with parallel branches. To evaluate our method, we apply it on the recent RAVE model, which provides high-quality real-time audio synthesis. We test our approach on multiple music and speech datasets and show that it is faster than overlap-add methods, while having no impact on the generation quality. Finally, we introduce two open-source implementation of our work as Max/MSP and PureData externals, and as a VST audio plugin. This allows to endow traditional digital audio workstation with real-time neural audio synthesis on a laptop CPU.
Deep learning (DL) has proven its unprecedented success in diverse fields such as computer vision, natural language processing, and speech recognition by its strong representation ability and ease of computation. As we move forward to a thoroughly intelligent society with 6G wireless networks, new applications and use-cases have been emerging with stringent requirements for next-generation wireless communications. Therefore, recent studies have focused on the potential of DL approaches in satisfying these rigorous needs and overcoming the deficiencies of existing model-based techniques. The main objective of this article is to unveil the state-of-the-art advancements in the field of DL-based physical layer (PHY) methods to pave the way for fascinating applications of 6G. In particular, we have focused our attention on four promising PHY concepts foreseen to dominate next-generation communications, namely massive multiple-input multiple-output (MIMO) systems, sophisticated multi-carrier (MC) waveform designs, reconfigurable intelligent surface (RIS)-empowered communications, and PHY security. We examine up-to-date developments in DL-based techniques, provide comparisons with state-of-the-art methods, and introduce a comprehensive guide for future directions. We also present an overview of the underlying concepts of DL, along with the theoretical background of well-known DL techniques. Furthermore, this article provides programming examples for a number of DL techniques and the implementation of a DL-based MIMO by sharing user-friendly code snippets, which might be useful for interested readers.
Detecting and labeling stance in social media text is strongly motivated by hate speech detection, poll prediction, engagement forecasting, and concerted propaganda detection. Today's best neural stance detectors need large volumes of training data, which is difficult to curate given the fast-changing landscape of social media text and issues on which users opine. Homophily properties over the social network provide strong signal of coarse-grained user-level stance. But semi-supervised approaches for tweet-level stance detection fail to properly leverage homophily. In light of this, We present SANDS, a new semi-supervised stance detector. SANDS starts from very few labeled tweets. It builds multiple deep feature views of tweets. It also uses a distant supervision signal from the social network to provide a surrogate loss signal to the component learners. We prepare two new tweet datasets comprising over 236,000 politically tinted tweets from two demographics (US and India) posted by over 87,000 users, their follower-followee graph, and over 8,000 tweets annotated by linguists. SANDS achieves a macro-F1 score of 0.55 (0.49) on US (India)-based datasets, outperforming 17 baselines (including variants of SANDS) substantially, particularly for minority stance labels and noisy text. Numerous ablation experiments on SANDS disentangle the dynamics of textual and network-propagated stance signals.
Deep neural networks (DNNs) have the advantage that they can take into account a large number of parameters, which enables them to solve complex tasks. In computer vision and speech recognition, they have a better accuracy than common algorithms, and in some tasks, they boast an even higher accuracy than human experts. With the progress of DNNs in recent years, many other fields of application such as diagnosis of diseases and autonomous driving are taking advantage of them. The trend at DNNs is clear: The network size is growing exponentially, which leads to an exponential increase in computational effort and required memory size. For this reason, optimized hardware accelerators are used to increase the performance of the inference of neuronal networks. However, there are various neural network hardware accelerator platforms, such as graphics processing units (GPUs), application specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs). Each of these platforms offer certain advantages and disadvantages. Also, there are various methods for reducing the computational effort of DNNs, which are differently suitable for each hardware accelerator. In this article an overview of existing neural network hardware accelerators and acceleration methods is given. Their strengths and weaknesses are shown and a recommendation of suitable applications is given. In particular, we focus on acceleration of the inference of convolutional neural networks (CNNs) used for image recognition tasks. Given that there exist many different hardware architectures. FPGA-based implementations are well-suited to show the effect of DNN optimization methods on accuracy and throughput. For this reason, the focus of this work is more on FPGA-based implementations.
Artificial Intelligence (AI) applications based on Deep Neural Networks (DNN) or Deep Learning (DL) have become popular due to their success in solving problems likeimage analysis and speech recognition. Training a DNN is computationally intensive and High Performance Computing(HPC) has been a key driver in AI growth. Virtualisation and container technology have led to the convergence of cloud and HPC infrastructure. These infrastructures with diverse hardware increase the complexity of deploying and optimising AI training workloads. AI training deployments in HPC or cloud can be optimised with target-specific libraries, graph compilers, andby improving data movement or IO. Graph compilers aim to optimise the execution of a DNN graph by generating an optimised code for a target hardware/backend. As part of SODALITE (a Horizon 2020 project), MODAK tool is developed to optimise application deployment in software defined infrastructures. Using input from the data scientist and performance modelling, MODAK maps optimal application parameters to a target infrastructure and builds an optimised container. In this paper, we introduce MODAK and review container technologies and graph compilers for AI. We illustrate optimisation of AI training deployments using graph compilers and Singularity containers. Evaluation using MNIST-CNN and ResNet50 training workloads shows that custom built optimised containers outperform the official images from DockerHub. We also found that the performance of graph compilers depends on the target hardware and the complexity of the neural network.
In the era of advanced technologies, mobile devices are equipped with computing and sensing capabilities that gather excessive amounts of data. These amounts of data are suitable for training different learning models. Cooperated with advancements in Deep Learning (DL), these learning models empower numerous useful applications, e.g., image processing, speech recognition, healthcare, vehicular network and many more. Traditionally, Machine Learning (ML) approaches require data to be centralised in cloud-based data-centres. However, this data is often large in quantity and privacy-sensitive which prevents logging into these data-centres for training the learning models. In turn, this results in critical issues of high latency and communication inefficiency. Recently, in light of new privacy legislations in many countries, the concept of Federated Learning (FL) has been introduced. In FL, mobile users are empowered to learn a global model by aggregating their local models, without sharing the privacy-sensitive data. Usually, these mobile users have slow network connections to the data-centre where the global model is maintained. Moreover, in a complex and large scale network, heterogeneous devices that have various energy constraints are involved. This raises the challenge of communication cost when implementing FL at large scale. To this end, in this research, we begin with the fundamentals of FL, and then, we highlight the recent FL algorithms and evaluate their communication efficiency with detailed comparisons. Furthermore, we propose a set of solutions to alleviate the existing FL problems both from communication perspective and privacy perspective.
This paper presents a novel Dialect Identification (DID) system developed for the Fifth Edition of the Multi-Genre Broadcast challenge, the task of Fine-grained Arabic Dialect Identification (MGB-5 ADI Challenge). The system improves upon traditional DNN x-vector performance by employing a Convolutional and Long Short Term Memory-Recurrent (CLSTM) architecture to combine the benefits of a convolutional neural network front-end for feature extraction and a back-end recurrent neural to capture longer temporal dependencies. Furthermore we investigate intensive augmentation of one low resource dialect in the highly unbalanced training set using time-scale modification (TSM). This converts an utterance to several time-stretched or time-compressed versions, subsequently used to train the CLSTM system without using any other corpus. In this paper, we also investigate speech augmentation using MUSAN and the RIR datasets to increase the quantity and diversity of the existing training data in the normal way. Results show firstly that the CLSTM architecture outperforms a traditional DNN x-vector implementation. Secondly, adopting TSM-based speed perturbation yields a small performance improvement for the unbalanced data, finally that traditional data augmentation techniques yield further benefit, in line with evidence from related speaker and language recognition tasks. Our system achieved 2nd place ranking out of 15 entries in the MGB-5 ADI challenge, presented at ASRU 2019.