Most of the existing recommender systems are based only on the rating data, and they ignore other sources of information that might increase the quality of recommendations, such as textual reviews, or user and item characteristics. Moreover, the majority of those systems are applicable only on small datasets (with thousands of observations) and are unable to handle large datasets (with millions of observations). We propose a recommender algorithm that combines a rating modelling technique (i.e., Latent Factor Model) with a topic modelling method based on textual reviews (i.e., Latent Dirichlet Allocation), and we extend the algorithm such that it allows adding extra user- and item-specific information to the system. We evaluate the performance of the algorithm using Amazon.com datasets with different sizes, corresponding to 23 product categories. After comparing the built model to four other models we found that combining textual reviews with ratings leads to better recommendations. Moreover, we found that adding extra user and item features to the model increases its prediction accuracy, which is especially true for medium and large datasets.
Human Activity Recognition (HAR) is an ongoing research topic. It has applications in medical support, sports, fitness, social networking, human-computer interfaces, senior care, entertainment, surveillance, and the list goes on. Traditionally, computer vision methods were employed for HAR, which has numerous problems such as secrecy or privacy, the influence of environmental factors, less mobility, higher running costs, occlusion, and so on. A new trend in the use of sensors, especially inertial sensors, has lately emerged. There are several advantages of employing sensor data as an alternative to traditional computer vision algorithms. Many of the limitations of computer vision algorithms have been documented in the literature, including research on Deep Neural Network (DNN) and Machine Learning (ML) approaches for activity categorization utilizing sensor data. We examined and analyzed different Machine Learning and Deep Learning approaches for Human Activity Recognition using inertial sensor data of smartphones. In order to identify which approach is best suited for this application.
Indiscriminate data poisoning attacks, which add imperceptible perturbations to training data to maximize the test error of trained models, have become a trendy topic because they are thought to be capable of preventing unauthorized use of data. In this work, we investigate why these perturbations work in principle. We find that the perturbations of advanced poisoning attacks are almost \textbf{linear separable} when assigned with the target labels of the corresponding samples, which hence can work as \emph{shortcuts} for the learning objective. This important population property has not been unveiled before. Moreover, we further verify that linear separability is indeed the workhorse for poisoning attacks. We synthesize linear separable data as perturbations and show that such synthetic perturbations are as powerful as the deliberately crafted attacks. Our finding suggests that the \emph{shortcut learning} problem is more serious than previously believed as deep learning heavily relies on shortcuts even if they are of an imperceptible scale and mixed together with the normal features. This finding also suggests that pre-trained feature extractors would disable these poisoning attacks effectively.
The remarkable performance of the pre-trained language model (LM) using self-supervised learning has led to a major paradigm shift in the study of natural language processing. In line with these changes, leveraging the performance of speech recognition systems with massive deep learning-based LMs is a major topic of speech recognition research. Among the various methods of applying LMs to speech recognition systems, in this paper, we focus on a cross-modal knowledge distillation method that transfers knowledge between two types of deep neural networks with different modalities. We propose an acoustic model structure with multiple auxiliary output layers for cross-modal distillation and demonstrate that the proposed method effectively compensates for the shortcomings of the existing label-interpolation-based distillation method. In addition, we extend the proposed method to a hierarchical distillation method using LMs trained in different units (senones, monophones, and subwords) and reveal the effectiveness of the hierarchical distillation method through an ablation study.
In this paper, I will discuss the work I am currently doing as a Ph.D. student at the University of Potsdam, under the tutoring of T. Schaub. I'm currently looking into action description in ASP. More precisely, my goal is to explore how to represent actions with durations in ASP, in different contexts. Right now, I'm focused on Multi-Agent Path Finding (MAPF), looking at how to represent speeds for different agents and contexts. Before tackling duration, I wanted to explore and compare different representations of action taking in ASP. For this, I started comparing different simple encodings tackling the MAPF problem. Even in simple code, choices and assumptions have been made in their creations. The objective of my work is to present the consequences of those design decisions in terms of performance and knowledge representation. As far as I know, there is no current research on this topic. Besides that, I'm also exploring different ways to represent duration and to solve related problems. I planed to compare them the same way I described before. I also want this to help me find innovative and effective ways to solve problems with duration.
In recent years, computer-aided diagnosis has become an increasingly popular topic. Methods based on convolutional neural networks have achieved good performance in medical image segmentation and classification. Due to the limitations of the convolution operation, the long-term spatial features are often not accurately obtained. Hence, we propose a TransClaw U-Net network structure, which combines the convolution operation with the transformer operation in the encoding part. The convolution part is applied for extracting the shallow spatial features to facilitate the recovery of the image resolution after upsampling. The transformer part is used to encode the patches, and the self-attention mechanism is used to obtain global information between sequences. The decoding part retains the bottom upsampling structure for better detail segmentation performance. The experimental results on Synapse Multi-organ Segmentation Datasets show that the performance of TransClaw U-Net is better than other network structures. The ablation experiments also prove the generalization performance of TransClaw U-Net.
In this paper, we investigate the parameter identification problem in dynamical systems through a deep learning approach. Focusing mainly on second-order, linear time-invariant dynamical systems, the topic of damping factor identification is studied. By utilizing a six-layer deep neural network with different recurrent cells, namely GRUs, LSTMs or BiLSTMs; and by feeding input-output sequence pairs captured from a dynamical system simulator, we search for an effective deep recurrent architecture in order to resolve damping factor identification problem. Our study results show that, although previously not utilized for this task in the literature, bidirectional gated recurrent cells (BiLSTMs) provide better parameter identification results when compared to unidirectional gated recurrent memory cells such as GRUs and LSTM. Thus, indicating that an input-output sequence pair of finite length, collected from a dynamical system and when observed anachronistically, may carry information in both time directions for prediction of a dynamical systems parameter.
Characterizing the privacy degradation over compositions, i.e., privacy accounting, is a fundamental topic in differential privacy (DP) with many applications to differentially private machine learning and federated learning. We propose a unification of recent advances (Renyi DP, privacy profiles, $f$-DP and the PLD formalism) via the characteristic function ($\phi$-function) of a certain ``worst-case'' privacy loss random variable. We show that our approach allows natural adaptive composition like Renyi DP, provides exactly tight privacy accounting like PLD, and can be (often losslessly) converted to privacy profile and $f$-DP, thus providing $(\epsilon,\delta)$-DP guarantees and interpretable tradeoff functions. Algorithmically, we propose an analytical Fourier accountant that represents the complex logarithm of $\phi$-functions symbolically and uses Gaussian quadrature for numerical computation. On several popular DP mechanisms and their subsampled counterparts, we demonstrate the flexibility and tightness of our approach in theory and experiments.
First person action recognition is an increasingly researched topic because of the growing popularity of wearable cameras. This is bringing to light cross-domain issues that are yet to be addressed in this context. Indeed, the information extracted from learned representations suffers from an intrinsic environmental bias. This strongly affects the ability to generalize to unseen scenarios, limiting the application of current methods in real settings where trimmed labeled data are not available during training. In this work, we propose to leverage over the intrinsic complementary nature of audio-visual signals to learn a representation that works well on data seen during training, while being able to generalize across different domains. To this end, we introduce an audio-visual loss that aligns the contributions from the two modalities by acting on the magnitude of their feature norm representations. This new loss, plugged into a minimal multi-modal action recognition architecture, leads to strong results in cross-domain first person action recognition, as demonstrated by extensive experiments on the popular EPIC-Kitchens dataset.
Human volumetric capture is a long-standing topic in computer vision and computer graphics. Although high-quality results can be achieved using sophisticated off-line systems, real-time human volumetric capture of complex scenarios, especially using light-weight setups, remains challenging. In this paper, we propose a human volumetric capture method that combines temporal volumetric fusion and deep implicit functions. To achieve high-quality and temporal-continuous reconstruction, we propose dynamic sliding fusion to fuse neighboring depth observations together with topology consistency. Moreover, for detailed and complete surface generation, we propose detail-preserving deep implicit functions for RGBD input which can not only preserve the geometric details on the depth inputs but also generate more plausible texturing results. Results and experiments show that our method outperforms existing methods in terms of view sparsity, generalization capacity, reconstruction quality, and run-time efficiency.