Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Recommendation": models, code, and papers

How Much Data Analytics is Enough? The ROI of Machine Learning Classification and its Application to Requirements Dependency Classification

Sep 28, 2021
Gouri Deshpande, Guenther Ruhe, Chad Saunders

Machine Learning (ML) can substantially improve the efficiency and effectiveness of organizations and is widely used for different purposes within Software Engineering. However, the selection and implementation of ML techniques rely almost exclusively on accuracy criteria. Thus, for organizations wishing to realize the benefits of ML investments, this narrow approach ignores crucial considerations around the anticipated costs of the ML activities across the ML lifecycle, while failing to account for the benefits that are likely to accrue from the proposed activity. We present findings for an approach that addresses this gap by enhancing the accuracy criterion with return on investment (ROI) considerations. Specifically, we analyze the performance of the two state-of-the-art ML techniques: Random Forest and Bidirectional Encoder Representations from Transformers (BERT), based on accuracy and ROI for two publicly available data sets. Specifically, we compare decision-making on requirements dependency extraction (i) exclusively based on accuracy and (ii) extended to include ROI analysis. As a result, we propose recommendations for selecting ML classification techniques based on the degree of training data used. Our findings indicate that considering ROI as additional criteria can drastically influence ML selection when compared to decisions based on accuracy as the sole criterion

* 15 pages, 13 figures, Journal 

  Access Paper or Ask Questions

Counterfactual Explanations as Interventions in Latent Space

Jun 14, 2021
Riccardo Crupi, Alessandro Castelnovo, Daniele Regoli, Beatriz San Miguel Gonzalez

Explainable Artificial Intelligence (XAI) is a set of techniques that allows the understanding of both technical and non-technical aspects of Artificial Intelligence (AI) systems. XAI is crucial to help satisfying the increasingly important demand of \emph{trustworthy} Artificial Intelligence, characterized by fundamental characteristics such as respect of human autonomy, prevention of harm, transparency, accountability, etc. Within XAI techniques, counterfactual explanations aim to provide to end users a set of features (and their corresponding values) that need to be changed in order to achieve a desired outcome. Current approaches rarely take into account the feasibility of actions needed to achieve the proposed explanations, and in particular they fall short of considering the causal impact of such actions. In this paper, we present Counterfactual Explanations as Interventions in Latent Space (CEILS), a methodology to generate counterfactual explanations capturing by design the underlying causal relations from the data, and at the same time to provide feasible recommendations to reach the proposed profile. Moreover, our methodology has the advantage that it can be set on top of existing counterfactuals generator algorithms, thus minimising the complexity of imposing additional causal constrains. We demonstrate the effectiveness of our approach with a set of different experiments using synthetic and real datasets (including a proprietary dataset of the financial domain).

* 34 pages, 4 figures, 4 tables 

  Access Paper or Ask Questions

The Power of Subsampling in Submodular Maximization

Apr 06, 2021
Christopher Harshaw, Ehsan Kazemi, Moran Feldman, Amin Karbasi

We propose subsampling as a unified algorithmic technique for submodular maximization in centralized and online settings. The idea is simple: independently sample elements from the ground set, and use simple combinatorial techniques (such as greedy or local search) on these sampled elements. We show that this approach leads to optimal/state-of-the-art results despite being much simpler than existing methods. In the usual offline setting, we present SampleGreedy, which obtains a $(p + 2 + o(1))$-approximation for maximizing a submodular function subject to a $p$-extendible system using $O(n + nk/p)$ evaluation and feasibility queries, where $k$ is the size of the largest feasible set. The approximation ratio improves to $p+1$ and $p$ for monotone submodular and linear objectives, respectively. In the streaming setting, we present SampleStreaming, which obtains a $(4p +2 - o(1))$-approximation for maximizing a submodular function subject to a $p$-matchoid using $O(k)$ memory and $O(km/p)$ evaluation and feasibility queries per element, where $m$ is the number of matroids defining the $p$-matchoid. The approximation ratio improves to $4p$ for monotone submodular objectives. We empirically demonstrate the effectiveness of our algorithms on video summarization, location summarization, and movie recommendation tasks.

* arXiv admin note: text overlap with arXiv:1802.07098 

  Access Paper or Ask Questions

NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions

Oct 24, 2020
Zhiyu Chen, Honglei Liu, Hu Xu, Seungwhan Moon, Hao Zhou, Bing Liu

Existing conversational systems are mostly agent-centric, which assumes the user utterances would closely follow the system ontology (for NLU or dialogue state tracking). However, in real-world scenarios, it is highly desirable that the users can speak freely in their own way. It is extremely hard, if not impossible, for the users to adapt to the unknown system ontology. In this work, we attempt to build a user-centric dialogue system. As there is no clean mapping for a user's free form utterance to an ontology, we first model the user preferences as estimated distributions over the system ontology and map the users' utterances to such distributions. Learning such a mapping poses new challenges on reasoning over existing knowledge, ranging from factoid knowledge, commonsense knowledge to the users' own situations. To this end, we build a new dataset named NUANCED that focuses on such realistic settings for conversational recommendation. Collected via dialogue simulation and paraphrasing, NUANCED contains 5.1k dialogues, 26k turns of high-quality user responses. We conduct experiments, showing both the usefulness and challenges of our problem setting. We believe NUANCED can serve as a valuable resource to push existing research from the agent-centric system to the user-centric system. The code and data will be made publicly available.

  Access Paper or Ask Questions

Data Readiness Report

Oct 15, 2020
Shazia Afzal, Rajmohan C, Manish Kesarwani, Sameep Mehta, Hima Patel

Data exploration and quality analysis is an important yet tedious process in the AI pipeline. Current practices of data cleaning and data readiness assessment for machine learning tasks are mostly conducted in an arbitrary manner which limits their reuse and results in loss of productivity. We introduce the concept of a Data Readiness Report as an accompanying documentation to a dataset that allows data consumers to get detailed insights into the quality of input data. Data characteristics and challenges on various quality dimensions are identified and documented keeping in mind the principles of transparency and explainability. The Data Readiness Report also serves as a record of all data assessment operations including applied transformations. This provides a detailed lineage for the purpose of data governance and management. In effect, the report captures and documents the actions taken by various personas in a data readiness and assessment workflow. Overtime this becomes a repository of best practices and can potentially drive a recommendation system for building automated data readiness workflows on the lines of AutoML [8]. We anticipate that together with the Datasheets [9], Dataset Nutrition Label [11], FactSheets [1] and Model Cards [15], the Data Readiness Report makes significant progress towards Data and AI lifecycle documentation.

  Access Paper or Ask Questions

Multilingual Music Genre Embeddings for Effective Cross-Lingual Music Item Annotation

Sep 16, 2020
Elena V. Epure, Guillaume Salha, Romain Hennequin

Annotating music items with music genres is crucial for music recommendation and information retrieval, yet challenging given that music genres are subjective concepts. Recently, in order to explicitly consider this subjectivity, the annotation of music items was modeled as a translation task: predict for a music item its music genres within a target vocabulary or taxonomy (tag system) from a set of music genre tags originating from other tag systems. However, without a parallel corpus, previous solutions could not handle tag systems in other languages, being limited to the English-language only. Here, by learning multilingual music genre embeddings, we enable cross-lingual music genre translation without relying on a parallel corpus. First, we apply compositionality functions on pre-trained word embeddings to represent multi-word tags.Second, we adapt the tag representations to the music domain by leveraging multilingual music genres graphs with a modified retrofitting algorithm. Experiments show that our method: 1) is effective in translating music genres across tag systems in multiple languages (English, French and Spanish); 2) outperforms the previous baseline in an English-language multi-source translation task. We publicly release the new multilingual data and code.

* 21st International Society for Music Information Retrieval Conference (ISMIR 2020) 

  Access Paper or Ask Questions

Distributed Learning of Deep Neural Networks using Independent Subnet Training

Oct 04, 2019
Binhang Yuan, Anastasios Kyrillidis, Christopher M. Jermaine

Stochastic gradient descent (SGD) is the method of choice for distributed machine learning, by virtue of its light complexity per iteration on compute nodes, leading to almost linear speedups in theory. Nevertheless, such speedups are rarely observed in practice, due to high communication overheads during synchronization steps. We alleviate this problem by introducing independent subnet training: a simple, jointly model-parallel and data-parallel, approach to distributed training for fully connected, feed-forward neural networks. During subnet training, neurons are stochastically partitioned without replacement, and each partition is sent only to a single worker. This reduces the overall synchronization overhead, as each worker only receives the weights associated with the subnetwork it has been assigned to. Subnet training also reduces synchronization frequency: since workers train disjoint portions of the network, the training can proceed for long periods of time before synchronization, similar to local SGD approaches. We empirically evaluate our approach on real-world speech recognition and product recommendation applications, where we observe that subnet training i) results into accelerated training times, as compared to state of the art distributed models, and ii) often results into boosting the testing accuracy, as it implicitly combines dropout and batch normalization regularizations during training.

  Access Paper or Ask Questions

The Power of Communities: A Text Classification Model with Automated Labeling Process Using Network Community Detection

Sep 25, 2019
Minjun Kim, Hiroki Sayama

The text classification is one of the most critical areas in machine learning and artificial intelligence research. It has been actively adopted in many business applications such as conversational intelligence systems, news articles categorizations, sentiment analysis, emotion detection systems, and many other recommendation systems in our daily life. One of the problems in supervised text classification models is that the models performance depend heavily on the quality of data labeling that are typically done by humans. In this study, we propose a new network community detection-based approach to automatically label and classify text data into multiclass value spaces. Specifically, we build a network with sentences as the network nodes and pairwise cosine similarities between TFIDF vector representations of the sentences as the network link weights. We use the Louvain method to detect the communities in the sentence network. We train and test Support vector machine and Random forest models on both the human labeled data and network community detection labeled data. Results showed that models with the data labeled by network community detection outperformed the models with the human-labeled data by 2.68-3.75% of classification accuracy. Our method may help development of a more accurate conversational intelligence system and other text classification systems.

* 14 pages, 6 figures, 1 table. Submitted for NetSci-X 2020 Tokyo 

  Access Paper or Ask Questions

Graph Embeddings at Scale

Jul 03, 2019
C. Bayan Bruss, Anish Khazane, Jonathan Rider, Richard Serpe, Saurabh Nagrecha, Keegan E. Hines

Graph embedding is a popular algorithmic approach for creating vector representations for individual vertices in networks. Training these algorithms at scale is important for creating embeddings that can be used for classification, ranking, recommendation and other common applications in industry. While industrial systems exist for training graph embeddings on large datasets, many of these distributed architectures are forced to partition copious amounts of data and model logic across many worker nodes. In this paper, we propose a distributed infrastructure that completely avoids graph partitioning, dynamically creates size constrained computational graphs across worker nodes, and uses highly efficient indexing operations for updating embeddings that allow the system to function at scale. We show that our system can scale an existing embeddings algorithm - skip-gram - to train on the open-source Friendster network (68 million vertices) and on an internal heterogeneous graph (50 million vertices). We measure the performance of our system on two key quantitative metrics: link-prediction accuracy and rate of convergence. We conclude this work by analyzing how a greater number of worker nodes actually improves our system's performance on the aforementioned metrics and discuss our next steps for rigorously evaluating the embedding vectors produced by our system.

* Workshop on Mining and Learning with Graphs 2019 

  Access Paper or Ask Questions

Interpreting OWL Complex Classes in AutomationML based on Bidirectional Translation

Jun 04, 2019
Yingbing Hua, Björn Hein

The World Wide Web Consortium (W3C) has published several recommendations for building and storing ontologies, including the most recent OWL 2 Web Ontology Language (OWL). These initiatives have been followed by practical implementations that popularize OWL in various domains. For example, OWL has been used for conceptual modeling in industrial engineering, and its reasoning facilities are used to provide a wealth of services, e.g. model diagnosis, automated code generation, and semantic integration. More specifically, recent studies have shown that OWL is well suited for harmonizing information of engineering tools stored as AutomationML (AML) files. However, OWL and its tools can be cumbersome for direct use by engineers such that an ontology expert is often required in practice. Although much attention has been paid in the literature to overcome this issue by transforming OWL ontologies from/to AML models automatically, dealing with OWL complex classes remains an open research question. In this paper, we introduce the AML concept models for representing OWL complex classes in AutomationML, and present algorithms for the bidirectional translation between OWL complex classes and their corresponding AML concept models. We show that this approach provides an efficient and intuitive interface for nonexperts to visualize, modify, and create OWL complex classes.

* As accepted to IEEE 24th International Conference on Emerging Technologies and Factory Automation (ETFA 2019) 

  Access Paper or Ask Questions