Inspired by InstaHide challenge [Huang, Song, Li and Arora'20], [Chen, Song and Zhuo'20] recently provides one mathematical formulation of InstaHide attack problem under Gaussian images distribution. They show that it suffices to use $O(n_{\mathsf{priv}}^{k_{\mathsf{priv}} - 2/(k_{\mathsf{priv}} + 1)})$ samples to recover one private image in $n_{\mathsf{priv}}^{O(k_{\mathsf{priv}})} + \mathrm{poly}(n_{\mathsf{pub}})$ time for any integer $k_{\mathsf{priv}}$, where $n_{\mathsf{priv}}$ and $n_{\mathsf{pub}}$ denote the number of images used in the private and the public dataset to generate a mixed image sample. Under the current setup for the InstaHide challenge of mixing two private images ($k_{\mathsf{priv}} = 2$), this means $n_{\mathsf{priv}}^{4/3}$ samples are sufficient to recover a private image. In this work, we show that $n_{\mathsf{priv}} \log ( n_{\mathsf{priv}} )$ samples are sufficient (information-theoretically) for recovering all the private images.
The formation of majorities in public discussions often depends on individuals who shift their opinion over time. The detection and characterization of these type of individuals is therefore extremely important for political analysis of social networks. In this paper, we study changes in individual's affiliations on Twitter using natural language processing techniques and graph machine learning algorithms. In particular, we collected 9 million Twitter messages from 1.5 million users and constructed the retweet networks. We identified communities with explicit political orientation and topics of discussion associated to them which provide the topological representation of the political map on Twitter in the analyzed periods. With that data, we present a machine learning framework for social media users classification which efficiently detects "shifting users" (i.e. users that may change their affiliation over time). Moreover, this machine learning framework allows us to identify not only which topics are more persuasive (using low dimensional topic embedding), but also which individuals are more likely to change their affiliation given their topological properties in a Twitter graph.
Nowadays, organizations collect vast quantities of sensitive information in `Enterprise Resource Planning' (ERP) systems, such as accounting relevant transactions, customer master data, or strategic sales price information. The leakage of such information poses a severe threat for companies as the number of incidents and the reputational damage to those experiencing them continue to increase. At the same time, discoveries in deep learning research revealed that machine learning models could be maliciously misused to create new attack vectors. Understanding the nature of such attacks becomes increasingly important for the (internal) audit and fraud examination practice. The creation of such an awareness holds in particular for the fraudulent data leakage using deep learning-based steganographic techniques that might remain undetected by state-of-the-art `Computer Assisted Audit Techniques' (CAATs). In this work, we introduce a real-world `threat model' designed to leak sensitive accounting data. In addition, we show that a deep steganographic process, constituted by three neural networks, can be trained to hide such data in unobtrusive `day-to-day' images. Finally, we provide qualitative and quantitative evaluations on two publicly available real-world payment datasets.
Bitcoin and its decentralized computing paradigm for digital currency trading are one of the most disruptive technology in the 21st century. This paper presents a novel approach to developing a Bitcoin transaction forecast model, DLForecast, by leveraging deep neural networks for learning Bitcoin transaction network representations. DLForecast makes three original contributions. First, we explore three interesting properties between Bitcoin transaction accounts: topological connectivity pattern of Bitcoin accounts, transaction amount pattern, and transaction dynamics. Second, we construct a time-decaying reachability graph and a time-decaying transaction pattern graph, aiming at capturing different types of spatial-temporal Bitcoin transaction patterns. Third, we employ node embedding on both graphs and develop a Bitcoin transaction forecasting system between user accounts based on historical transactions with built-in time-decaying factor. To maintain an effective transaction forecasting performance, we leverage the multiplicative model update (MMU) ensemble to combine prediction models built on different transaction features extracted from each corresponding Bitcoin transaction graph. Evaluated on real-world Bitcoin transaction data, we show that our spatial-temporal forecasting model is efficient with fast runtime and effective with forecasting accuracy over 60\% and improves the prediction performance by 50\% when compared to forecasting model built on the static graph baseline.
Drug discovery aims at designing novel molecules with specific desired properties for clinical trials. Over past decades, drug discovery and development have been a costly and time consuming process. Driven by big chemical data and AI, deep generative models show great potential to accelerate the drug discovery process. Existing works investigate different deep generative frameworks for molecular generation, however, less attention has been paid to the visualization tools to quickly demo and evaluate model's results. Here, we propose a visualization framework which provides interactive visualization tools to visualize molecules generated during the encoding and decoding process of deep graph generative models, and provide real time molecular optimization functionalities. Our work tries to empower black box AI driven drug discovery models with some visual interpretabilities.
Recently, sequence-to-sequence (seq2seq) models with the Transformer architecture have achieved remarkable performance on various conditional text generation tasks, such as machine translation. However, most of them are trained with teacher forcing with the ground truth label given at each time step, without being exposed to incorrectly generated tokens during training, which hurts its generalization to unseen inputs, that is known as the "exposure bias" problem. In this work, we propose to mitigate the conditional text generation problem by contrasting positive pairs with negative pairs, such that the model is exposed to various valid or incorrect perturbations of the inputs, for improved generalization. However, training the model with naive contrastive learning framework using random non-target sequences as negative examples is suboptimal, since they are easily distinguishable from the correct output, especially so with models pretrained with large text corpora. Also, generating positive examples requires domain-specific augmentation heuristics which may not generalize over diverse domains. To tackle this problem, we propose a principled method to generate positive and negative samples for contrastive learning of seq2seq models. Specifically, we generate negative examples by adding small perturbations to the input sequence to minimize its conditional likelihood, and positive examples by adding large perturbations while enforcing it to have a high conditional likelihood. Such "hard" positive and negative pairs generated using our method guides the model to better distinguish correct outputs from incorrect ones. We empirically show that our proposed method significantly improves the generalization of the seq2seq on three text generation tasks - machine translation, text summarization, and question generation.
We state theoretical properties for $k$-means clustering of Symmetric Positive Definite (SPD) matrices, in a non-Euclidean space, that provides a natural and favourable representation of these data. We then provide a novel application for this method, to time-series clustering of pixels in a sequence of Synthetic Aperture Radar images, via their finite-lag autocovariance matrices.
Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.
We review mathematical foundations of convolutional neural nets (CNNs) with the goals of: i) highlighting connections with techniques from statistics, signal processing, linear algebra, differential equations, and optimization, ii) demystifying underlying computations, and iii) identifying new types of applications. CNNs are powerful machine learning models that highlight features from grid data to make predictions (regression and classification). The grid data object can be represented as vectors (in 1D), matrices (in 2D), or tensors (in 3D or higher dimensions) and can incorporate multiple channels (thus providing high flexibility in the input data representation). For example, an image can be represented as a 2D grid data object that contains red, green, and blue (RBG) channels (each channel is a 2D matrix). Similarly, a video can be represented as a 3D grid data object (two spatial dimensions plus time) with RGB channels (each channel is a 3D tensor). CNNs highlight features from the grid data by performing convolution operations with different types of operators. The operators highlight different types of features (e.g., patterns, gradients, geometrical features) and are learned by using optimization techniques. In other words, CNNs seek to identify optimal operators that best map the input data to the output data. A common misconception is that CNNs are only capable of processing image or video data but their application scope is much wider; specifically, datasets encountered in diverse applications can be expressed as grid data. Here, we show how to apply CNNs to new types of applications such as optimal control, flow cytometry, multivariate process monitoring, and molecular simulations.
Reduction of combinatorial filters involves compressing state representations that robots use. Such optimization arises in automating the construction of minimalist robots. But exact combinatorial filter reduction is an NP-complete problem and all current techniques are either inexact or formalized with exponentially many constraints. This paper proposes a new formalization needing only a polynomial number of constraints, and characterizes these constraints in three different forms: nonlinear, linear, and conjunctive normal form. Empirical results show that constraints in conjunctive normal form capture the problem most effectively, leading to a method that outperforms the others. Further examination indicates that a substantial proportion of constraints remain inactive during iterative filter reduction. To leverage this observation, we introduce just-in-time generation of such constraints, which yields improvements in efficiency and has the potential to minimize large filters.