Existing works on disentangled representation learning usually lie on a common assumption: all factors in disentangled representations should be independent. This assumption is about the inner property of disentangled representations, while ignoring their relation with external data. To tackle this problem, we propose another assumption to establish an important relation between data and its disentangled representations via mutual information: the mutual information between each factor of disentangled representations and data should be invariant to other factors. We formulate this assumption into mathematical equations, and theoretically bridge it with independence and conditional independence of factors. Meanwhile, we show that conditional independence is satisfied in encoders of VAEs due to factorized noise in reparameterization. To highlight the importance of our proposed assumption, we show in experiments that violating the assumption leads to dramatic decline of disentanglement. Based on this assumption, we further propose to split the deeper layers in encoder to ensure parameters in these layers are not shared for different factors. The proposed encoder, called Split Encoder, can be applied into models that penalize total correlation, and shows significant improvement in unsupervised learning of disentangled representations and reconstructions.
We study the problems of offline and online contextual optimization with feedback information, where instead of observing the loss, we observe, after-the-fact, the optimal action an oracle with full knowledge of the objective function would have taken. We aim to minimize regret, which is defined as the difference between our losses and the ones incurred by an all-knowing oracle. In the offline setting, the decision-maker has information available from past periods and needs to make one decision, while in the online setting, the decision-maker optimizes decisions dynamically over time based a new set of feasible actions and contextual functions in each period. For the offline setting, we characterize the optimal minimax policy, establishing the performance that can be achieved as a function of the underlying geometry of the information induced by the data. In the online setting, we leverage this geometric characterization to optimize the cumulative regret. We develop an algorithm that yields the first regret bound for this problem that is logarithmic in the time horizon.
Human actions recognition is a fundamental task in artificial vision, that has earned a great importance in recent years due to its multiple applications in different areas. %, such as the study of human behavior, security or video surveillance. In this context, this paper describes an approach for real-time human action recognition from raw depth image-sequences, provided by an RGB-D camera. The proposal is based on a 3D fully convolutional neural network, named 3DFCNN, which automatically encodes spatio-temporal patterns from depth sequences without %any costly pre-processing. Furthermore, the described 3D-CNN allows %automatic features extraction and actions classification from the spatial and temporal encoded information of depth sequences. The use of depth data ensures that action recognition is carried out protecting people's privacy% allows recognizing the actions carried out by people, protecting their privacy%\sout{of them} , since their identities can not be recognized from these data. %\st{ from depth images.} 3DFCNN has been evaluated and its results compared to those from other state-of-the-art methods within three widely used %large-scale NTU RGB+D datasets, with different characteristics (resolution, sensor type, number of views, camera location, etc.). The obtained results allows validating the proposal, concluding that it outperforms several state-of-the-art approaches based on classical computer vision techniques. Furthermore, it achieves action recognition accuracy comparable to deep learning based state-of-the-art methods with a lower computational cost, which allows its use in real-time applications.
Social networking and micro-blogging services, such as Twitter, play an important role in sharing digital information. Despite the popularity and usefulness of social media, there have been many instances where corrupted users found ways to abuse it, as for instance, through raising or lowering user's credibility. As a result, while social media facilitates an unprecedented ease of access to information, it also introduces a new challenge - that of ascertaining the credibility of shared information. Currently, there is no automated way of determining which news or users are credible and which are not. Hence, establishing a system that can measure the social media user's credibility has become an issue of great importance. Assigning a credibility score to a user has piqued the interest of not only the research community but also most of the big players on both sides - such as Facebook, on the side of industry, and political parties on the societal one. In this work, we created a model which, we hope, will ultimately facilitate and support the increase of trust in the social network communities. Our model collected data and analysed the behaviour of~50,000 politicians on Twitter. Influence score, based on several chosen features, was assigned to each evaluated user. Further, we classified the political Twitter users as either trusted or untrusted using random forest, multilayer perceptron, and support vector machine. An active learning model was used to classify any unlabelled ambiguous records from our dataset. Finally, to measure the performance of the proposed model, we used precision, recall, F1 score, and accuracy as the main evaluation metrics.
This paper presents a model that uses the information that sellers publish in real estate market websites to predict whether a property has higher or lower price than the average price of its similar properties. The model learns the correlation between price and information (text descriptions and features) of real estate properties through automatic identification of latent semantic content given by a machine learning model based on doc2vec and xgboost. The proposed model was evaluated with a data set of 57,516 publications of real estate properties collected from 2016 to 2018 of Bogot\'a city. Results show that the accuracy of a classifier that involves text descriptions is slightly higher than a classifier that only uses features of the real estate properties, as text descriptions tends to contain detailed information about the property.
The free energy principle from neuroscience provides a brain-inspired perception scheme through a data-driven model learning algorithm called Dynamic Expectation Maximization (DEM). This paper aims at introducing an experimental design to provide the first experimental confirmation of the usefulness of DEM as a state and input estimator for real robots. Through a series of quadcopter flight experiments under unmodelled wind dynamics, we prove that DEM can leverage the information from colored noise for accurate state and input estimation through the use of generalized coordinates. We demonstrate the superior performance of DEM for state estimation under colored noise with respect to other benchmarks like State Augmentation, SMIKF and Kalman Filtering through its minimal estimation error. We demonstrate the similarities in the performance of DEM and Unknown Input Observer (UIO) for input estimation. The paper concludes by showing the influence of prior beliefs in shaping the accuracy-complexity trade-off during DEM's estimation.
Deep learning has transformed computer vision, natural language processing, and speech recognition\cite{badrinarayanan2017segnet, dong2016image, ren2017faster, ji20133d}. However, two critical questions remain obscure: (1) why do deep neural networks generalize better than shallow networks; and (2) does it always hold that a deeper network leads to better performance? Specifically, letting $L$ be the number of convolutional and pooling layers in a deep neural network, and $n$ be the size of the training sample, we derive an upper bound on the expected generalization error for this network, i.e., \begin{eqnarray*} \mathbb{E}[R(W)-R_S(W)] \leq \exp{\left(-\frac{L}{2}\log{\frac{1}{\eta}}\right)}\sqrt{\frac{2\sigma^2}{n}I(S,W) } \end{eqnarray*} where $\sigma >0$ is a constant depending on the loss function, $0<\eta<1$ is a constant depending on the information loss for each convolutional or pooling layer, and $I(S, W)$ is the mutual information between the training sample $S$ and the output hypothesis $W$. This upper bound shows that as the number of convolutional and pooling layers $L$ increases in the network, the expected generalization error will decrease exponentially to zero. Layers with strict information loss, such as the convolutional layers, reduce the generalization error for the whole network; this answers the first question. However, algorithms with zero expected generalization error does not imply a small test error or $\mathbb{E}[R(W)]$. This is because $\mathbb{E}[R_S(W)]$ is large when the information for fitting the data is lost as the number of layers increases. This suggests that the claim `the deeper the better' is conditioned on a small training error or $\mathbb{E}[R_S(W)]$. Finally, we show that deep learning satisfies a weak notion of stability and the sample complexity of deep neural networks will decrease as $L$ increases.
Document-level relation extraction aims to extract relations among multiple entity pairs from a document. Previously proposed graph-based or transformer-based models utilize the entities independently, regardless of global information among relational triples. This paper approaches the problem by predicting an entity-level relation matrix to capture local and global information, parallel to the semantic segmentation task in computer vision. Herein, we propose a Document U-shaped Network for document-level relation extraction. Specifically, we leverage an encoder module to capture the context information of entities and a U-shaped segmentation module over the image-style feature map to capture global interdependency among triples. Experimental results show that our approach can obtain state-of-the-art performance on three benchmark datasets DocRED, CDR, and GDA.
The video game industry has seen rapid growth over the last decade. Thousands of video games are released and played by millions of people every year, creating a large community of players. Steam is a leading gaming platform and social networking site, which allows its users to purchase and store games. A by-product of Steam is a large database of information about games, players, and gaming behavior. In this paper, we take recent video games released on Steam and aim to discover the relation between game popularity and a game's features that can be acquired through Steam. We approach this task by predicting the popularity of Steam games in the early stages after their release and we use a Bayesian approach to understand the influence of a game's price, size, supported languages, release date, and genres on its player count. We implement several models and discover that a genre-based hierarchical approach achieves the best performance. We further analyze the model and interpret its coefficients, which indicate that games released at the beginning of the month and games of certain genres correlate with game popularity.
Knowledge of population distribution is critical for building infrastructure, distributing resources, and monitoring the progress of sustainable development goals. Although censuses can provide this information, they are typically conducted every ten years with some countries having forgone the process for several decades. Population can change in the intercensal period due to rapid migration, development, urbanisation, natural disasters, and conflicts. Census-independent population estimation approaches using alternative data sources, such as satellite imagery, have shown promise in providing frequent and reliable population estimates locally. Existing approaches, however, require significant human supervision, for example annotating buildings and accessing various public datasets, and therefore, are not easily reproducible. We explore recent representation learning approaches, and assess the transferability of representations to population estimation in Mozambique. Using representation learning reduces required human supervision, since features are extracted automatically, making the process of population estimation more sustainable and likely to be transferable to other regions or countries. We compare the resulting population estimates to existing population products from GRID3, Facebook (HRSL) and WorldPop. We observe that our approach matches the most accurate of these maps, and is interpretable in the sense that it recognises built-up areas to be an informative indicator of population.