In this work we have analyzed a novel concept of sequential binding based learning capable network based on the coupling of recurrent units with Bayesian prior definition. The coupling structure encodes to generate efficient tensor representations that can be decoded to generate efficient sentences and can describe certain events. These descriptions are derived from structural representations of visual features of images and media. An elaborated study of the different types of coupling recurrent structures are studied and some insights of their performance are provided. Supervised learning performance for natural language processing is judged based on statistical evaluations, however, the truth is perspective, and in this case the qualitative evaluations reveal the real capability of the different architectural strengths and variations. Bayesian prior definition of different embedding helps in better characterization of the sentences based on the natural language structure related to parts of speech and other semantic level categorization in a form which is machine interpret-able and inherits the characteristics of the Tensor Representation binding and unbinding based on the mutually orthogonality. Our approach has surpassed some of the existing basic works related to image captioning.
Energy disaggregation, also referred to as a Non-Intrusive Load Monitoring (NILM), is the task of using an aggregate energy signal, for example coming from a whole-home power monitor, to make inferences about the different individual loads of the system. In this paper, we present a novel approach based on the encoder-decoder deep learning framework with an attention mechanism for solving NILM. The attention mechanism is inspired by the temporal attention mechanism that has been recently applied to get state-of-the-art results in neural machine translation, text summarization and speech recognition. The experiments have been conducted on two publicly available datasets AMPds and UK-DALE in seen and unseen conditions. The results show that our proposed deep neural network outperforms the state-of-the-art Denoising Auto-Encoder (DAE) proposed initially by Kelly and Knottenbely (2015) and its extended and improved architecture by Bonfigli et al. (2018), in all the addressed experimental conditions. We also show that modeling attention translates into the ability to correctly detect the state change of each appliance, that is of extreme interest in the field of energy disaggregation.
A head-mounted display (HMD) is a portable and interactive display device. With the development of 5G technology, it may become a general-purpose computing platform in the future. Human-computer interaction (HCI) technology for HMDs has also been of significant interest in recent years. In addition to tracking gestures and speech, tracking human eyes as a means of interaction is highly effective. In this paper, we propose two UnityEyes-based convolutional neural network models, UEGazeNet and UEGazeNet*, which can be used for input images with low resolution and high resolution, respectively. These models can perform rapid interactions by classifying gaze trajectories (GTs), and a GTgestures dataset containing data for 10,200 "eye-painting gestures" collected from 15 individuals is established with our gaze-tracking method. We evaluated the performance both indoors and outdoors and the UEGazeNet can obtaine results 52\% and 67\% better than those of state-of-the-art networks. The generalizability of our GTgestures dataset using a variety of gaze-tracking models is evaluated, and an average recognition rate of 96.71\% is obtained by our method.
Artificial intelligence and machine learning systems have demonstrated huge improvements and human-level parity in a range of activities, including speech recognition, face recognition and speaker verification. However, these diverse tasks share a key commonality that is not true in affective computing: the ground truth information that is inferred can be unambiguously represented. This observation provides some hints as to why affective computing, despite having attracted the attention of researchers for years, may not still be considered a mature field of research. A key reason for this is the lack of a common mathematical framework to describe all the relevant elements of emotion representations. This paper proposes the AMBiguous Emotion Representation (AMBER) framework to address this deficiency. AMBER is a unified framework that explicitly describes categorical, numerical and ordinal representations of emotions, including time varying representations. In addition to explaining the core elements of AMBER, the paper also discusses how some of the commonly employed emotion representation schemes can be viewed through the AMBER framework, and concludes with a discussion of how the proposed framework can be used to reason about current and future affective computing systems.
Manually annotated corpora for low-resource languages are usually small in quantity (gold), or large but distantly supervised (silver). Inspired by recent progress of injecting pre-trained language model (LM) on many Natural Language Processing (NLP) task, we proposed to fine-tune pre-trained language model from high-resources languages to low-resources languages to improve the performance of both scenarios. Our empirical experiment demonstrates significant improvement when fine-tuning pre-trained language model in cross-lingual transfer scenarios for small gold corpus and competitive results in large silver compare to supervised cross-lingual transfer, which will be useful when there is no parallel annotation in the same task to begin. We compare our proposed method of cross-lingual transfer using pre-trained LM to different sources of transfer such as mono-lingual LM and Part-of-Speech tagging (POS) in the downstream task of both large silver and small gold NER dataset by exploiting character-level input of bi-directional language model task.
Storytelling is fundamental to language, including culture, conversation and communication in their broadest senses. It thus emerges as an essential component of intelligent systems, including systems where natural language is not a primary focus or where we do not usually think of a story being involved. In this paper we explore the emergence of storytelling as a requirement in embodied conversational agents, including its role in educational and health interventions, as well as in a general-purpose computer interface for people with disabilities or other constraints that prevent the use of traditional keyboard and speech interfaces. We further present a characterization of storytelling as an inventive fleshing out of detail according to a particular personal perspective, and propose the DREAMT model to focus attention on the different layers that need to be present in a character-driven storytelling system. Most if not all aspects of the DREAMT model have arisen from or been explored in some aspect of our implemented research systems, but currently only at a primitive and relatively unintegrated level. However, this experience leads us to formalize and elaborate the DREAMT model mnemonically as follows: - Description/Dialogue/Definition/Denotation - Realization/Representation/Role - Explanation/Education/Entertainment - Actualization/Activation - Motivation/Modelling - Topicalization/Transformation
Multilingual training has been shown to improve acoustic modeling performance by sharing and transferring knowledge in modeling different languages. Knowledge sharing is usually achieved by using common lower-level layers for different languages in a deep neural network. Recently, the domain adversarial network was proposed to reduce domain mismatch of training data and learn domain-invariant features. It is thus worth exploring whether adversarial training can further promote knowledge sharing in multilingual models. In this work, we apply the domain adversarial network to encourage the shared layers of a multilingual model to learn language-invariant features. Bidirectional Long Short-Term Memory (LSTM) recurrent neural networks (RNN) are used as building blocks. We show that shared layers learned this way contain less language identification information and lead to better performance. In an automatic speech recognition task for seven languages, the resultant acoustic model improves the word error rate (WER) of the multilingual model by 4% relative on average, and the monolingual models by 10%.
In recent years, deep neural network is introduced in recommender systems to solve the collaborative filtering problem, which has achieved immense success on computer vision, speech recognition and natural language processing. On one hand, deep neural network can be used to model the auxiliary information in recommender systems. On the other hand, it is also capable of modeling nonlinear relationships between users and items. One advantage of deep neural network is that the performance of the algorithm can be easily enhanced by augmenting the depth of the neural network. However, two potential problems may emerge when the deep neural work is exploited to model relationships between users and items. The fundamental problem is that the complexity of the algorithm grows significantly with the increment in the depth of the neural network. The second one is that a deeper neural network may undermine the accuracy of the algorithm. In order to alleviate these problems, we propose a hybrid neural network that combines heterogeneous neural networks with different structures. The experimental results on real datasets reveal that our method is superior to the state-of-the-art methods in terms of the item ranking.
Labeling of sequential data is a prevalent meta-problem for a wide range of real world applications. While the first-order Hidden Markov Models (HMM) provides a fundamental approach for unsupervised sequential labeling, the basic model does not show satisfying performance when it is directly applied to real world problems, such as part-of-speech tagging (PoS tagging) and optical character recognition (OCR). Aiming at improving performance, important extensions of HMM have been proposed in the literatures. One of the common key features in these extensions is the incorporation of proper prior information. In this paper, we propose a new extension of HMM, termed diversified Hidden Markov Models (dHMM), which utilizes a diversity-encouraging prior over the state-transition probabilities and thus facilitates more dynamic sequential labellings. Specifically, the diversity is modeled by a continuous determinantal point process prior, which we apply to both unsupervised and supervised scenarios. Learning and inference algorithms for dHMM are derived. Empirical evaluations on benchmark datasets for unsupervised PoS tagging and supervised OCR confirmed the effectiveness of dHMM, with competitive performance to the state-of-the-art.
Named entity recognition (NER) is an important task in NLP, which is all the more challenging in conversational domain with their noisy facets. Moreover, conversational texts are often available in limited amount, making supervised tasks infeasible. To learn from small data, strong inductive biases are required. Previous work relied on hand-crafted features to encode these biases until transfer learning emerges. Here, we explore a transfer learning method, namely language model pretraining, on NER task in Indonesian conversational texts. We utilize large unlabeled data (generic domain) to be transferred to conversational texts, enabling supervised training on limited in-domain data. We report two transfer learning variants, namely supervised model fine-tuning and unsupervised pretrained LM fine-tuning. Our experiments show that both variants outperform baseline neural models when trained on small data (100 sentences), yielding an absolute improvement of 32 points of test F1 score. Furthermore, we find that the pretrained LM encodes part-of-speech information which is a strong predictor for NER.