Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ahmed Aly

Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models

Jan 07, 2025

Malak Mansour, Ahmed Aly, Bahey Tharwat, Sarim Hashmi, Dong An, Ian Reid

Figure 1 for Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models

Figure 2 for Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models

Figure 3 for Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models

Figure 4 for Language and Planning in Robotic Navigation: A Multilingual Evaluation of State-of-the-Art Models

Abstract:Large Language Models (LLMs) such as GPT-4, trained on huge amount of datasets spanning multiple domains, exhibit significant reasoning, understanding, and planning capabilities across various tasks. This study presents the first-ever work in Arabic language integration within the Vision-and-Language Navigation (VLN) domain in robotics, an area that has been notably underexplored in existing research. We perform a comprehensive evaluation of state-of-the-art multi-lingual Small Language Models (SLMs), including GPT-4o mini, Llama 3 8B, and Phi-3 medium 14B, alongside the Arabic-centric LLM, Jais. Our approach utilizes the NavGPT framework, a pure LLM-based instruction-following navigation agent, to assess the impact of language on navigation reasoning through zero-shot sequential action prediction using the R2R dataset. Through comprehensive experiments, we demonstrate that our framework is capable of high-level planning for navigation tasks when provided with instructions in both English and Arabic. However, certain models struggled with reasoning and planning in the Arabic language due to inherent limitations in their capabilities, sub-optimal performance, and parsing issues. These findings highlight the importance of enhancing planning and reasoning capabilities in language models for effective navigation, emphasizing this as a key area for further development while also unlocking the potential of Arabic-language models for impactful real-world applications.

Via

Access Paper or Ask Questions

CoDi: Conversational Distillation for Grounded Question Answering

Aug 20, 2024

Patrick Huber, Arash Einolghozati, Rylan Conway, Kanika Narang, Matt Smith, Waqar Nayyar, Adithya Sagar, Ahmed Aly, Akshat Shrivastava

Figure 1 for CoDi: Conversational Distillation for Grounded Question Answering

Figure 2 for CoDi: Conversational Distillation for Grounded Question Answering

Figure 3 for CoDi: Conversational Distillation for Grounded Question Answering

Figure 4 for CoDi: Conversational Distillation for Grounded Question Answering

Abstract:Distilling conversational skills into Small Language Models (SLMs) with approximately 1 billion parameters presents significant challenges. Firstly, SLMs have limited capacity in their model parameters to learn extensive knowledge compared to larger models. Secondly, high-quality conversational datasets are often scarce, small, and domain-specific. Addressing these challenges, we introduce a novel data distillation framework named CoDi (short for Conversational Distillation, pronounced "Cody"), allowing us to synthesize large-scale, assistant-style datasets in a steerable and diverse manner. Specifically, while our framework is task agnostic at its core, we explore and evaluate the potential of CoDi on the task of conversational grounded reasoning for question answering. This is a typical on-device scenario for specialist SLMs, allowing for open-domain model responses, without requiring the model to "memorize" world knowledge in its limited weights. Our evaluations show that SLMs trained with CoDi-synthesized data achieve performance comparable to models trained on human-annotated data in standard metrics. Additionally, when using our framework to generate larger datasets from web data, our models surpass larger, instruction-tuned models in zero-shot conversational grounded reasoning tasks.

* 13 pages

Via

Access Paper or Ask Questions

PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

Jun 12, 2024

Trang Le, Daniel Lazar, Suyoun Kim, Shan Jiang, Duc Le, Adithya Sagar, Aleksandr Livshits, Ahmed Aly, Akshat Shrivastava

Figure 1 for PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

Figure 2 for PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

Figure 3 for PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

Figure 4 for PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding

Abstract:Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system.

Via

Access Paper or Ask Questions

Small But Funny: A Feedback-Driven Approach to Humor Distillation

Feb 28, 2024

Sahithya Ravi, Patrick Huber, Akshat Shrivastava, Aditya Sagar, Ahmed Aly, Vered Shwartz, Arash Einolghozati

Figure 1 for Small But Funny: A Feedback-Driven Approach to Humor Distillation

Figure 2 for Small But Funny: A Feedback-Driven Approach to Humor Distillation

Figure 3 for Small But Funny: A Feedback-Driven Approach to Humor Distillation

Figure 4 for Small But Funny: A Feedback-Driven Approach to Humor Distillation

Abstract:The emergence of Large Language Models (LLMs) has brought to light promising language generation capabilities, particularly in performing tasks like complex reasoning and creative writing. Consequently, distillation through imitation of teacher responses has emerged as a popular technique to transfer knowledge from LLMs to more accessible, Small Language Models (SLMs). While this works well for simpler tasks, there is a substantial performance gap on tasks requiring intricate language comprehension and creativity, such as humor generation. We hypothesize that this gap may stem from the fact that creative tasks might be hard to learn by imitation alone and explore whether an approach, involving supplementary guidance from the teacher, could yield higher performance. To address this, we study the effect of assigning a dual role to the LLM - as a "teacher" generating data, as well as a "critic" evaluating the student's performance. Our experiments on humor generation reveal that the incorporation of feedback significantly narrows the performance gap between SLMs and their larger counterparts compared to merely relying on imitation. As a result, our research highlights the potential of using feedback as an additional dimension to data when transferring complex language abilities via distillation.

Via

Access Paper or Ask Questions

Retrieve-and-Fill for Scenario-based Task-Oriented Semantic Parsing

Feb 02, 2022

Akshat Shrivastava, Shrey Desai, Anchit Gupta, Ali Elkahky, Aleksandr Livshits, Alexander Zotov, Ahmed Aly

Figure 1 for Retrieve-and-Fill for Scenario-based Task-Oriented Semantic Parsing

Figure 2 for Retrieve-and-Fill for Scenario-based Task-Oriented Semantic Parsing

Figure 3 for Retrieve-and-Fill for Scenario-based Task-Oriented Semantic Parsing

Figure 4 for Retrieve-and-Fill for Scenario-based Task-Oriented Semantic Parsing

Abstract:Task-oriented semantic parsing models have achieved strong results in recent years, but unfortunately do not strike an appealing balance between model size, runtime latency, and cross-domain generalizability. We tackle this problem by introducing scenario-based semantic parsing: a variant of the original task which first requires disambiguating an utterance's "scenario" (an intent-slot template with variable leaf spans) before generating its frame, complete with ontology and utterance tokens. This formulation enables us to isolate coarse-grained and fine-grained aspects of the task, each of which we solve with off-the-shelf neural modules, also optimizing for the axes outlined above. Concretely, we create a Retrieve-and-Fill (RAF) architecture comprised of (1) a retrieval module which ranks the best scenario given an utterance and (2) a filling module which imputes spans into the scenario to create the frame. Our model is modular, differentiable, interpretable, and allows us to garner extra supervision from scenarios. RAF achieves strong results in high-resource, low-resource, and multilingual settings, outperforming recent approaches by wide margins despite, using base pre-trained encoders, small sequence lengths, and parallel decoding.

Via

Access Paper or Ask Questions

AutoNLU: Detecting, root-causing, and fixing NLU model errors

Oct 12, 2021

Pooja Sethi, Denis Savenkov, Forough Arabshahi, Jack Goetz, Micaela Tolliver, Nicolas Scheffer, Ilknur Kabul, Yue Liu, Ahmed Aly

Figure 1 for AutoNLU: Detecting, root-causing, and fixing NLU model errors

Figure 2 for AutoNLU: Detecting, root-causing, and fixing NLU model errors

Figure 3 for AutoNLU: Detecting, root-causing, and fixing NLU model errors

Figure 4 for AutoNLU: Detecting, root-causing, and fixing NLU model errors

Abstract:Improving the quality of Natural Language Understanding (NLU) models, and more specifically, task-oriented semantic parsing models, in production is a cumbersome task. In this work, we present a system called AutoNLU, which we designed to scale the NLU quality improvement process. It adds automation to three key steps: detection, attribution, and correction of model errors, i.e., bugs. We detected four times more failed tasks than with random sampling, finding that even a simple active learning sampling method on an uncalibrated model is surprisingly effective for this purpose. The AutoNLU tool empowered linguists to fix ten times more semantic parsing bugs than with prior manual processes, auto-correcting 65% of all identified bugs.

* 8 pages, 5 figures

Via

Access Paper or Ask Questions

Assessing Data Efficiency in Task-Oriented Semantic Parsing

Jul 10, 2021

Shrey Desai, Akshat Shrivastava, Justin Rill, Brian Moran, Safiyyah Saleem, Alexander Zotov, Ahmed Aly

Figure 1 for Assessing Data Efficiency in Task-Oriented Semantic Parsing

Figure 2 for Assessing Data Efficiency in Task-Oriented Semantic Parsing

Figure 3 for Assessing Data Efficiency in Task-Oriented Semantic Parsing

Figure 4 for Assessing Data Efficiency in Task-Oriented Semantic Parsing

Abstract:Data efficiency, despite being an attractive characteristic, is often challenging to measure and optimize for in task-oriented semantic parsing; unlike exact match, it can require both model- and domain-specific setups, which have, historically, varied widely across experiments. In our work, as a step towards providing a unified solution to data-efficiency-related questions, we introduce a four-stage protocol which gives an approximate measure of how much in-domain, "target" data a parser requires to achieve a certain quality bar. Specifically, our protocol consists of (1) sampling target subsets of different cardinalities, (2) fine-tuning parsers on each subset, (3) obtaining a smooth curve relating target subset (%) vs. exact match (%), and (4) referencing the curve to mine ad-hoc (target subset, exact match) points. We apply our protocol in two real-world case studies -- model generalizability and intent complexity -- illustrating its flexibility and applicability to practitioners in task-oriented semantic parsing.

Via

Access Paper or Ask Questions

Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

Jun 25, 2021

David Eriksson, Pierce I-Jen Chuang, Samuel Daulton, Peng Xia, Akshat Shrivastava, Arun Babu, Shicong Zhao, Ahmed Aly, Ganesh Venkatesh, Maximilian Balandat

Figure 1 for Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

Figure 2 for Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

Figure 3 for Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

Figure 4 for Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

Abstract:When tuning the architecture and hyperparameters of large machine learning models for on-device deployment, it is desirable to understand the optimal trade-offs between on-device latency and model accuracy. In this work, we leverage recent methodological advances in Bayesian optimization over high-dimensional search spaces and multi-objective Bayesian optimization to efficiently explore these trade-offs for a production-scale on-device natural language understanding model at Facebook.

* To Appear at the 8th ICML Workshop on Automated Machine Learning, ICML 2021

Via

Access Paper or Ask Questions

Diagnosing Transformers in Task-Oriented Semantic Parsing

May 27, 2021

Shrey Desai, Ahmed Aly

Figure 1 for Diagnosing Transformers in Task-Oriented Semantic Parsing

Figure 2 for Diagnosing Transformers in Task-Oriented Semantic Parsing

Figure 3 for Diagnosing Transformers in Task-Oriented Semantic Parsing

Figure 4 for Diagnosing Transformers in Task-Oriented Semantic Parsing

Abstract:Modern task-oriented semantic parsing approaches typically use seq2seq transformers to map textual utterances to semantic frames comprised of intents and slots. While these models are empirically strong, their specific strengths and weaknesses have largely remained unexplored. In this work, we study BART and XLM-R, two state-of-the-art parsers, across both monolingual and multilingual settings. Our experiments yield several key results: transformer-based parsers struggle not only with disambiguating intents/slots, but surprisingly also with producing syntactically-valid frames. Though pre-training imbues transformers with syntactic inductive biases, we find the ambiguity of copying utterance spans into frames often leads to tree invalidity, indicating span extraction is a major bottleneck for current parsers. However, as a silver lining, we show transformer-based parsers give sufficient indicators for whether a frame is likely to be correct or incorrect, making them easier to deploy in production settings.

* Accepted to Findings of ACL 2021

Via

Access Paper or Ask Questions

Span Pointer Networks for Non-Autoregressive Task-Oriented Semantic Parsing

Apr 16, 2021

Akshat Shrivastava, Pierce Chuang, Arun Babu, Shrey Desai, Abhinav Arora, Alexander Zotov, Ahmed Aly

Figure 1 for Span Pointer Networks for Non-Autoregressive Task-Oriented Semantic Parsing

Figure 2 for Span Pointer Networks for Non-Autoregressive Task-Oriented Semantic Parsing

Figure 3 for Span Pointer Networks for Non-Autoregressive Task-Oriented Semantic Parsing

Figure 4 for Span Pointer Networks for Non-Autoregressive Task-Oriented Semantic Parsing

Abstract:An effective recipe for building seq2seq, non-autoregressive, task-oriented parsers to map utterances to semantic frames proceeds in three steps: encoding an utterance $x$, predicting a frame's length |y|, and decoding a |y|-sized frame with utterance and ontology tokens. Though empirically strong, these models are typically bottlenecked by length prediction, as even small inaccuracies change the syntactic and semantic characteristics of resulting frames. In our work, we propose span pointer networks, non-autoregressive parsers which shift the decoding task from text generation to span prediction; that is, when imputing utterance spans into frame slots, our model produces endpoints (e.g., [i, j]) as opposed to text (e.g., "6pm"). This natural quantization of the output space reduces the variability of gold frames, therefore improving length prediction and, ultimately, exact match. Furthermore, length prediction is now responsible for frame syntax and the decoder is responsible for frame semantics, resulting in a coarse-to-fine model. We evaluate our approach on several task-oriented semantic parsing datasets. Notably, we bridge the quality gap between non-autogressive and autoregressive parsers, achieving 87 EM on TOPv2 (Chen et al. 2020). Furthermore, due to our more consistent gold frames, we show strong improvements in model generalization in both cross-domain and cross-lingual transfer in low-resource settings. Finally, due to our diminished output vocabulary, we observe 70% reduction in latency and 83% reduction in memory at beam size 5 compared to prior non-autoregressive parsers.

Via

Access Paper or Ask Questions