Alert button
Picture for Vincent Renkens

Vincent Renkens

Alert button

MARRS: Multimodal Reference Resolution System

Nov 03, 2023
Halim Cagri Ates, Shruti Bhargava, Site Li, Jiarui Lu, Siddhardha Maddula, Joel Ruben Antony Moniz, Anil Kumar Nalamalapu, Roman Hoang Nguyen, Melis Ozyildirim, Alkesh Patel, Dhivya Piraviperumal, Vincent Renkens, Ankit Samal, Thy Tran, Bo-Hsiang Tseng, Hong Yu, Yuan Zhang, Rong Zou

Successfully handling context is essential for any dialog understanding task. This context maybe be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual and background context. In particular, we present different machine learning models to enable handing contextual queries; specifically, one to enable reference resolution, and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.

* Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023) 
Viaarxiv icon

Referring to Screen Texts with Voice Assistants

Jun 10, 2023
Shruti Bhargava, Anand Dhoot, Ing-Marie Jonsson, Hoang Long Nguyen, Alkesh Patel, Hong Yu, Vincent Renkens

Figure 1 for Referring to Screen Texts with Voice Assistants
Figure 2 for Referring to Screen Texts with Voice Assistants
Figure 3 for Referring to Screen Texts with Voice Assistants
Figure 4 for Referring to Screen Texts with Voice Assistants

Voice assistants help users make phone calls, send messages, create events, navigate, and do a lot more. However, assistants have limited capacity to understand their users' context. In this work, we aim to take a step in this direction. Our work dives into a new experience for users to refer to phone numbers, addresses, email addresses, URLs, and dates on their phone screens. Our focus lies in reference understanding, which becomes particularly interesting when multiple similar texts are present on screen, similar to visual grounding. We collect a dataset and propose a lightweight general-purpose model for this novel experience. Due to the high cost of consuming pixels directly, our system is designed to rely on the extracted text from the UI. Our model is modular, thus offering flexibility, improved interpretability, and efficient runtime memory utilization.

* 7 pages, Accepted to ACL Industry Track 2023 
Viaarxiv icon

MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants

Oct 31, 2021
Alkesh Patel, Joel Ruben Antony Moniz, Roman Nguyen, Nick Tzou, Hadas Kotek, Vincent Renkens

Figure 1 for MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants
Figure 2 for MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants
Figure 3 for MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants
Figure 4 for MMIU: Dataset for Visual Intent Understanding in Multimodal Assistants

In multimodal assistant, where vision is also one of the input modalities, the identification of user intent becomes a challenging task as visual input can influence the outcome. Current digital assistants take spoken input and try to determine the user intent from conversational or device context. So, a dataset, which includes visual input (i.e. images or videos for the corresponding questions targeted for multimodal assistant use cases, is not readily available. The research in visual question answering (VQA) and visual question generation (VQG) is a great step forward. However, they do not capture questions that a visually-abled person would ask multimodal assistants. Moreover, many times questions do not seek information from external knowledge. In this paper, we provide a new dataset, MMIU (MultiModal Intent Understanding), that contains questions and corresponding intents provided by human annotators while looking at images. We, then, use this dataset for intent classification task in multimodal digital assistant. We also experiment with various approaches for combining vision and language features including the use of multimodal transformer for classification of image-question pairs into 14 intents. We provide the benchmark results and discuss the role of visual and text features for the intent classification task on our dataset.

* Extended abstract accepted for WeCNLP 2021 
Viaarxiv icon

User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems

Aug 02, 2021
Hoang Long Nguyen, Vincent Renkens, Joris Pelemans, Srividya Pranavi Potharaju, Anil Kumar Nalamalapu, Murat Akbacak

Figure 1 for User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems
Figure 2 for User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems
Figure 3 for User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems
Figure 4 for User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems

Recognition errors are common in human communication. Similar errors often lead to unwanted behaviour in dialogue systems or virtual assistants. In human communication, we can recover from them by repeating misrecognized words or phrases; however in human-machine communication this recovery mechanism is not available. In this paper, we attempt to bridge this gap and present a system that allows a user to correct speech recognition errors in a virtual assistant by repeating misunderstood words. When a user repeats part of the phrase the system rewrites the original query to incorporate the correction. This rewrite allows the virtual assistant to understand the original query successfully. We present an end-to-end 2-step attention pointer network that can generate the the rewritten query by merging together the incorrectly understood utterance with the correction follow-up. We evaluate the model on data collected for this task and compare the proposed model to a rule-based baseline and a standard pointer network. We show that rewriting the original query is an effective way to handle repetition-based recovery and that the proposed model outperforms the rule based baseline, reducing Word Error Rate by 19% relative at 2% False Alarm Rate on annotated data.

* Will be published in Interspeech 2021 
Viaarxiv icon

State Gradients for RNN Memory Analysis

Jun 18, 2018
Lyan Verwimp, Hugo Van hamme, Vincent Renkens, Patrick Wambacq

Figure 1 for State Gradients for RNN Memory Analysis
Figure 2 for State Gradients for RNN Memory Analysis
Figure 3 for State Gradients for RNN Memory Analysis
Figure 4 for State Gradients for RNN Memory Analysis

We present a framework for analyzing what the state in RNNs remembers from its input embeddings. Our approach is inspired by backpropagation, in the sense that we compute the gradients of the states with respect to the input embeddings. The gradient matrix is decomposed with Singular Value Decomposition to analyze which directions in the embedding space are best transferred to the hidden state space, characterized by the largest singular values. We apply our approach to LSTM language models and investigate to what extent and for how long certain classes of words are remembered on average for a certain corpus. Additionally, the extent to which a specific property or relationship is remembered by the RNN can be tracked by comparing a vector characterizing that property with the direction(s) in embedding space that are best preserved in hidden state space.

* Accepted for Interspeech 2018 
Viaarxiv icon