Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Parisa Kordjamshidi

SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models

Oct 04, 2024

Yue Zhang, Zhiyang Xu, Ying Shen, Parisa Kordjamshidi, Lifu Huang

Abstract:Integrating the 3D world into large language models (3D-based LLMs) has been a promising research direction for 3D scene understanding. However, current 3D-based LLMs fall short in situated understanding due to two key limitations: 1) existing 3D datasets are constructed from a global perspective of the 3D scenes and lack situated context. 2) the architectures of existing 3D-based LLMs lack explicit alignment between the spatial representations of 3D scenes and natural language, limiting their performance in tasks requiring precise spatial reasoning. We address these issues by introducing a scalable situated 3D dataset, named Spartun3D, that incorporates various situated spatial reasoning tasks. Furthermore, we propose Spartun3D-LLM, built on an existing 3D-based LLM but integrated with a novel situated spatial alignment module, aiming to enhance the alignment between 3D visual representations and their corresponding textual descriptions. Experimental results demonstrate that both our proposed dataset and alignment module significantly enhance the situated spatial understanding of 3D-based LLMs.

Via

Access Paper or Ask Questions

Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs

Sep 06, 2024

Aliakbar Nafar, Kristen Brent Venable, Parisa Kordjamshidi

Figure 1 for Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs

Figure 2 for Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs

Figure 3 for Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs

Figure 4 for Learning vs Retrieval: The Role of In-Context Examples in Regression with LLMs

Abstract:Generative Large Language Models (LLMs) are capable of being in-context learners. However, the underlying mechanism of in-context learning (ICL) is still a major research question, and experimental research results about how models exploit ICL are not always consistent. In this work, we propose a framework for evaluating in-context learning mechanisms, which we claim are a combination of retrieving internal knowledge and learning from in-context examples by focusing on regression tasks. First, we show that LLMs can perform regression on real-world datasets and then design experiments to measure the extent to which the LLM retrieves its internal knowledge versus learning from in-context examples. We argue that this process lies on a spectrum between these two extremes. We provide an in-depth analysis of the degrees to which these mechanisms are triggered depending on various factors, such as prior knowledge about the tasks and the type and richness of the information provided by the in-context examples. We employ three LLMs and utilize multiple datasets to corroborate the robustness of our findings. Our results shed light on how to engineer prompts to leverage meta-learning from in-context examples and foster knowledge retrieval depending on the problem being addressed.

Via

Access Paper or Ask Questions

Narrowing the Gap between Vision and Action in Navigation

Aug 19, 2024

Yue Zhang, Parisa Kordjamshidi

Abstract:The existing methods for Vision and Language Navigation in the Continuous Environment (VLN-CE) commonly incorporate a waypoint predictor to discretize the environment. This simplifies the navigation actions into a view selection task and improves navigation performance significantly compared to direct training using low-level actions. However, the VLN-CE agents are still far from the real robots since there are gaps between their visual perception and executed actions. First, VLN-CE agents that discretize the visual environment are primarily trained with high-level view selection, which causes them to ignore crucial spatial reasoning within the low-level action movements. Second, in these models, the existing waypoint predictors neglect object semantics and their attributes related to passibility, which can be informative in indicating the feasibility of actions. To address these two issues, we introduce a low-level action decoder jointly trained with high-level action prediction, enabling the current VLN agent to learn and ground the selected visual view to the low-level controls. Moreover, we enhance the current waypoint predictor by utilizing visual representations containing rich semantic information and explicitly masking obstacles based on humans' prior knowledge about the feasibility of actions. Empirically, our agent can improve navigation performance metrics compared to the strong baselines on both high-level and low-level actions.

Via

Access Paper or Ask Questions

Prompt2DeModel: Declarative Neuro-Symbolic Modeling with Natural Language

Jul 30, 2024

Hossein Rajaby Faghihi, Aliakbar Nafar, Andrzej Uszok, Hamid Karimian, Parisa Kordjamshidi

Figure 1 for Prompt2DeModel: Declarative Neuro-Symbolic Modeling with Natural Language

Figure 2 for Prompt2DeModel: Declarative Neuro-Symbolic Modeling with Natural Language

Figure 3 for Prompt2DeModel: Declarative Neuro-Symbolic Modeling with Natural Language

Figure 4 for Prompt2DeModel: Declarative Neuro-Symbolic Modeling with Natural Language

Abstract:This paper presents a conversational pipeline for crafting domain knowledge for complex neuro-symbolic models through natural language prompts. It leverages large language models to generate declarative programs in the DomiKnowS framework. The programs in this framework express concepts and their relationships as a graph in addition to logical constraints between them. The graph, later, can be connected to trainable neural models according to those specifications. Our proposed pipeline utilizes techniques like dynamic in-context demonstration retrieval, model refinement based on feedback from a symbolic parser, visualization, and user interaction to generate the tasks' structure and formal knowledge representation. This approach empowers domain experts, even those not well-versed in ML/AI, to formally declare their knowledge to be incorporated in customized neural models in the DomiKnowS framework.

* Accepted in NeSy 2024 Conference

Via

Access Paper or Ask Questions

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Jul 09, 2024

Yue Zhang, Ziqiao Ma, Jialu Li, Yanyuan Qiao, Zun Wang, Joyce Chai, Qi Wu, Mohit Bansal, Parisa Kordjamshidi

Figure 1 for Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Figure 2 for Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Figure 3 for Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Abstract:Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on one hand, to milestone the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers.

* Authors contributed equally to this work, and supervisors contributed equal advising to this work

Via

Access Paper or Ask Questions

SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Jul 06, 2024

Zixu Cheng, Yujiang Pu, Shaogang Gong, Parisa Kordjamshidi, Yu Kong

Figure 1 for SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Figure 2 for SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Figure 3 for SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Figure 4 for SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Abstract:Temporal grounding, a.k.a video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at https://github.com/zxccade/SHINE.

* Accepted to ECCV 2024

Via

Access Paper or Ask Questions

Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA

Jun 27, 2024

Elham J. Barezi, Parisa Kordjamshidi

Abstract:We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer. Although many recent works use question-dependent captioners to verbalize the given image and use Large Language Models to solve the VQA problem, the research results show they are not reasonably performing for multi-hop questions. Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image and provide a stronger comprehension of it. Moreover, we analyze the decomposed questions to find out the modality of the information that is required to answer them and use a captioner for the visual questions and LLMs as a general knowledge source for the non-visual KB-based questions. Our results demonstrate the positive impact of using simple questions before retrieving visual or non-visual information. We have provided results and analysis on three well-known VQA datasets including OKVQA, A-OKVQA, and KRVQA, and achieved up to 2% improvement in accuracy.

Via

Access Paper or Ask Questions

Neuro-symbolic Training for Reasoning over Spatial Language

Jun 19, 2024

Tanawan Premsri, Parisa Kordjamshidi

Figure 1 for Neuro-symbolic Training for Reasoning over Spatial Language

Figure 2 for Neuro-symbolic Training for Reasoning over Spatial Language

Figure 3 for Neuro-symbolic Training for Reasoning over Spatial Language

Figure 4 for Neuro-symbolic Training for Reasoning over Spatial Language

Abstract:Recent research shows that more data and larger models can provide more accurate solutions to natural language problems requiring reasoning. However, models can easily fail to provide solutions in unobserved complex input compositions due to not achieving the level of abstraction required for generalizability. To alleviate this issue, we propose training the language models with neuro-symbolic techniques that can exploit the logical rules of reasoning as constraints and provide additional supervision sources to the model. Training models to adhere to the regulations of reasoning pushes them to make more effective abstractions needed for generalizability and transfer learning. We focus on a challenging problem of spatial reasoning over text. Our results on various benchmarks using multiple language models confirm our hypothesis of effective domain transfer based on neuro-symbolic training.

Via

Access Paper or Ask Questions

A Survey on Compositional Learning of AI Models: Theoretical and Experimetnal Practices

Jun 13, 2024

Sania Sinha, Tanawan Premsri, Parisa Kordjamshidi

Figure 1 for A Survey on Compositional Learning of AI Models: Theoretical and Experimetnal Practices

Figure 2 for A Survey on Compositional Learning of AI Models: Theoretical and Experimetnal Practices

Figure 3 for A Survey on Compositional Learning of AI Models: Theoretical and Experimetnal Practices

Abstract:Compositional learning, mastering the ability to combine basic concepts and construct more intricate ones, is crucial for human cognition, especially in human language comprehension and visual perception. This notion is tightly connected to generalization over unobserved situations. Despite its integral role in intelligence, there is a lack of systematic theoretical and experimental research methodologies, making it difficult to analyze the compositional learning abilities of computational models. In this paper, we survey the literature on compositional learning of AI models and the connections made to cognitive studies. We identify abstract concepts of compositionality in cognitive and linguistic studies and connect these to the computational challenges faced by language and vision models in compositional reasoning. We overview the formal definitions, tasks, evaluation benchmarks, variety of computational models, and theoretical findings. We cover modern studies on large language models to provide a deeper understanding of the cutting-edge compositional capabilities exhibited by state-of-the-art AI models and pinpoint important directions for future research.

Via

Access Paper or Ask Questions

Find The Gap: Knowledge Base Reasoning For Visual Question Answering

Apr 16, 2024

Elham J. Barezi, Parisa Kordjamshidi

Abstract:We analyze knowledge-based visual question answering, for which given a question, the models need to ground it into the visual modality and retrieve the relevant knowledge from a given large knowledge base (KB) to be able to answer. Our analysis has two folds, one based on designing neural architectures and training them from scratch, and another based on large pre-trained language models (LLMs). Our research questions are: 1) Can we effectively augment models by explicit supervised retrieval of the relevant KB information to solve the KB-VQA problem? 2) How do task-specific and LLM-based models perform in the integration of visual and external knowledge, and multi-hop reasoning over both sources of information? 3) Is the implicit knowledge of LLMs sufficient for KB-VQA and to what extent it can replace the explicit KB? Our results demonstrate the positive impact of empowering task-specific and LLM models with supervised external and visual knowledge retrieval models. Our findings show that though LLMs are stronger in 1-hop reasoning, they suffer in 2-hop reasoning in comparison with our fine-tuned NN model even if the relevant information from both modalities is available to the model. Moreover, we observed that LLM models outperform the NN model for KB-related questions which confirms the effectiveness of implicit knowledge in LLMs however, they do not alleviate the need for external KB.

Via

Access Paper or Ask Questions