Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hung-Ting Su

Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Sep 22, 2024

Hung-Ting Su, Ya-Ching Hsu, Xudong Lin, Xiang-Qian Shi, Yulei Niu, Han-Yuan Hsu, Hung-yi Lee, Winston H. Hsu

Figure 1 for Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Figure 2 for Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Figure 3 for Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Figure 4 for Unveiling Narrative Reasoning Limits of Large Language Models with Trope in Movie Synopses

Abstract:Large language models (LLMs) equipped with chain-of-thoughts (CoT) prompting have shown significant multi-step reasoning capabilities in factual content like mathematics, commonsense, and logic. However, their performance in narrative reasoning, which demands greater abstraction capabilities, remains unexplored. This study utilizes tropes in movie synopses to assess the abstract reasoning abilities of state-of-the-art LLMs and uncovers their low performance. We introduce a trope-wise querying approach to address these challenges and boost the F1 score by 11.8 points. Moreover, while prior studies suggest that CoT enhances multi-step reasoning, this study shows CoT can cause hallucinations in narrative content, reducing GPT-4's performance. We also introduce an Adversarial Injection method to embed trope-related text tokens into movie synopses without explicit tropes, revealing CoT's heightened sensitivity to such injections. Our comprehensive analysis provides insights for future research directions.

* EMNLP 2024 Findings. The first two authors contributed equally. Code: https://github.com/Shelley1214/Trope

Via

Access Paper or Ask Questions

Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation

Sep 07, 2024

Hung-Ting Su, Ching-Yuan Chen, Po-Chen Ko, Jia-Fong Yeh, Min Sun, Winston H. Hsu

Figure 1 for Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation

Figure 2 for Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation

Figure 3 for Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation

Figure 4 for Context-Aware Replanning with Pre-explored Semantic Map for Object Navigation

Abstract:Pre-explored Semantic Maps, constructed through prior exploration using visual language models (VLMs), have proven effective as foundational elements for training-free robotic applications. However, existing approaches assume the map's accuracy and do not provide effective mechanisms for revising decisions based on incorrect maps. To address this, we introduce Context-Aware Replanning (CARe), which estimates map uncertainty through confidence scores and multi-view consistency, enabling the agent to revise erroneous decisions stemming from inaccurate maps without requiring additional labels. We demonstrate the effectiveness of our proposed method by integrating it with two modern mapping backbones, VLMaps and OpenMask3D, and observe significant performance improvements in object navigation tasks. More details can be found on the project page: https://carmaps.github.io/supplements/.

* CoRL 2024. The first three authors contributed equally, and their order of authorship is interchangeable. Project page: https://carmaps.github.io/supplements/

Via

Access Paper or Ask Questions

Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Aug 30, 2024

Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Winston H. Hsu, Shang-Hong Lai

Figure 1 for Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Figure 2 for Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Figure 3 for Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Figure 4 for Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Abstract:While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that BREASE achieves state-of-the-art performance across multiple long video understanding benchmarks in both zero-shot and fully-supervised settings. The project page and code are at: https://joslefaure.github.io/assets/html/hermes.html.

* Accepted to the EVAL-FoMo Workshop at ECCV'24. Project page: https://joslefaure.github.io/assets/html/hermes.html

Via

Access Paper or Ask Questions

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Jun 16, 2024

Hung-Ting Su, Chun-Tong Chao, Ya-Ching Hsu, Xudong Lin, Yulei Niu, Hung-Yi Lee, Winston H. Hsu

Figure 1 for Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Figure 2 for Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Figure 3 for Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Figure 4 for Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Abstract:Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and integrating intermediate reasoning steps for understanding long-range videos with numerous frames. Utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. Our experiments show that current methods, including Captioner-Reasoner, Large Multimodal Model Instruction Fine-tuning, and Visual Programming, only marginally outperform a random baseline when tackling the challenges of Abstract Perception and Long-range Compositional Reasoning. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR), which enhance Visual Programming by fostering role interaction awareness and progressively refining movie contexts and trope queries during reasoning processes, significantly improving performance by 15 F1 points. However, this performance still lags behind human levels (40 vs. 65 F1). Additionally, we introduce a new protocol to evaluate the necessity of Abstract Perception and Long-range Compositional Reasoning for task resolution. This is done by analyzing the code generated through Visual Programming using an Abstract Syntax Tree (AST), thereby confirming the increased complexity of TiM. The dataset and code are available at: https://ander1119.github.io/TiM

* Project page: https://ander1119.github.io/TiM

Via

Access Paper or Ask Questions

Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

May 26, 2024

ChungYi Lin, Shen-Lung Tung, Hung-Ting Su, Winston H. Hsu

Figure 1 for Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

Figure 2 for Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

Figure 3 for Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

Figure 4 for Enhancing Sustainable Urban Mobility Prediction with Telecom Data: A Spatio-Temporal Framework Approach

Abstract:Traditional traffic prediction, limited by the scope of sensor data, falls short in comprehensive traffic management. Mobile networks offer a promising alternative using network activity counts, but these lack crucial directionality. Thus, we present the TeltoMob dataset, featuring undirected telecom counts and corresponding directional flows, to predict directional mobility flows on roadways. To address this, we propose a two-stage spatio-temporal graph neural network (STGNN) framework. The first stage uses a pre-trained STGNN to process telecom data, while the second stage integrates directional and geographic insights for accurate prediction. Our experiments demonstrate the framework's compatibility with various STGNN models and confirm its effectiveness. We also show how to incorporate the framework into real-world transportation systems, enhancing sustainable urban mobility.

* 8 Figures, 5 Tables. Just accepted by IJCAI (to appear)

Via

Access Paper or Ask Questions

Tracking-Assisted Object Detection with Event Cameras

Mar 27, 2024

Ting-Kang Yen, Igor Morawski, Shusil Dangi, Kai He, Chung-Yi Lin, Jia-Fong Yeh, Hung-Ting Su, Winston Hsu

Figure 1 for Tracking-Assisted Object Detection with Event Cameras

Figure 2 for Tracking-Assisted Object Detection with Event Cameras

Figure 3 for Tracking-Assisted Object Detection with Event Cameras

Figure 4 for Tracking-Assisted Object Detection with Event Cameras

Abstract:Event-based object detection has recently garnered attention in the computer vision community due to the exceptional properties of event cameras, such as high dynamic range and no motion blur. However, feature asynchronism and sparsity cause invisible objects due to no relative motion to the camera, posing a significant challenge in the task. Prior works have studied various memory mechanisms to preserve as many features as possible at the current time, guided by temporal clues. While these implicit-learned memories retain some short-term information, they still struggle to preserve long-term features effectively. In this paper, we consider those invisible objects as pseudo-occluded objects and aim to reveal their features. Firstly, we introduce visibility attribute of objects and contribute an auto-labeling algorithm to append additional visibility labels on an existing event camera dataset. Secondly, we exploit tracking strategies for pseudo-occluded objects to maintain their permanence and retain their bounding boxes, even when features have not been available for a very long time. These strategies can be treated as an explicit-learned memory guided by the tracking objective to record the displacements of objects across frames. Lastly, we propose a spatio-temporal feature aggregation module to enrich the latent features and a consistency loss to increase the robustness of the overall pipeline. We conduct comprehensive experiments to verify our method's effectiveness where still objects are retained but real occluded objects are discarded. The results demonstrate that (1) the additional visibility labels can assist in supervised training, and (2) our method outperforms state-of-the-art approaches with a significant improvement of 7.9% absolute mAP.

Via

Access Paper or Ask Questions

AED: Adaptable Error Detection for Few-shot Imitation Policy

Feb 06, 2024

Jia-Fong Yeh, Kuo-Han Hung, Pang-Chi Lo, Chi-Ming Chung, Tsung-Han Wu, Hung-Ting Su, Yi-Ting Chen, Winston H. Hsu

Figure 1 for AED: Adaptable Error Detection for Few-shot Imitation Policy

Figure 2 for AED: Adaptable Error Detection for Few-shot Imitation Policy

Figure 3 for AED: Adaptable Error Detection for Few-shot Imitation Policy

Figure 4 for AED: Adaptable Error Detection for Few-shot Imitation Policy

Abstract:We study how to report few-shot imitation (FSI) policies' behavior errors in novel environments, a novel task named adaptable error detection (AED). The potential to cause serious damage to surrounding areas limits the application of FSI policies in real-world scenarios. Thus, a robust system is necessary to notify operators when FSI policies are inconsistent with the intent of demonstrations. We develop a cross-domain benchmark for the challenging AED task, consisting of 329 base and 158 novel environments. This task introduces three challenges, including (1) detecting behavior errors in novel environments, (2) behavior errors occurring without revealing notable changes, and (3) lacking complete temporal information of the rollout due to the necessity of online detection. To address these challenges, we propose Pattern Observer (PrObe) to parse discernible patterns in the policy feature representations of normal or error states, whose effectiveness is verified in the proposed benchmark. Through our comprehensive evaluation, PrObe consistently surpasses strong baselines and demonstrates a robust capability to identify errors arising from a wide range of FSI policies. Moreover, we conduct comprehensive ablations and experiments (error correction, demonstration quality, etc.) to validate the practicality of our proposed task and methodology.

Via

Access Paper or Ask Questions

TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling

Jan 06, 2024

ChungYi Lin, Shen-Lung Tung, Hung-Ting Su, Winston H. Hsu

Figure 1 for TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling

Figure 2 for TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling

Figure 3 for TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling

Figure 4 for TelTrans: Applying Multi-Type Telecom Data to Transportation Evaluation and Prediction via Multifaceted Graph Modeling

Abstract:To address the limitations of traffic prediction from location-bound detectors, we present Geographical Cellular Traffic (GCT) flow, a novel data source that leverages the extensive coverage of cellular traffic to capture mobility patterns. Our extensive analysis validates its potential for transportation. Focusing on vehicle-related GCT flow prediction, we propose a graph neural network that integrates multivariate, temporal, and spatial facets for improved accuracy. Experiments reveal our model's superiority over baselines, especially in long-term predictions. We also highlight the potential for GCT flow integration into transportation systems.

* 7 pages, 7 figures, 4 tables. Accepted by AAAI-24-IAAI, to appear

Via

Access Paper or Ask Questions

Unsupervised Adversarial Detection without Extra Model: Training Loss Should Change

Aug 07, 2023

Chien Cheng Chyou, Hung-Ting Su, Winston H. Hsu

Figure 1 for Unsupervised Adversarial Detection without Extra Model: Training Loss Should Change

Figure 2 for Unsupervised Adversarial Detection without Extra Model: Training Loss Should Change

Figure 3 for Unsupervised Adversarial Detection without Extra Model: Training Loss Should Change

Figure 4 for Unsupervised Adversarial Detection without Extra Model: Training Loss Should Change

Abstract:Adversarial robustness poses a critical challenge in the deployment of deep learning models for real-world applications. Traditional approaches to adversarial training and supervised detection rely on prior knowledge of attack types and access to labeled training data, which is often impractical. Existing unsupervised adversarial detection methods identify whether the target model works properly, but they suffer from bad accuracies owing to the use of common cross-entropy training loss, which relies on unnecessary features and strengthens adversarial attacks. We propose new training losses to reduce useless features and the corresponding detection method without prior knowledge of adversarial attacks. The detection rate (true positive rate) against all given white-box attacks is above 93.9% except for attacks without limits (DF($\infty$)), while the false positive rate is barely 2.5%. The proposed method works well in all tested attack types and the false positive rates are even better than the methods good at certain types.

* AdvML in ICML 2023 code:https://github.com/CycleBooster/Unsupervised-adversarial-detection-without-extra-model

Via

Access Paper or Ask Questions

Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Apr 07, 2023

Hung-Ting Su, Yulei Niu, Xudong Lin, Winston H. Hsu, Shih-Fu Chang

Figure 1 for Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Figure 2 for Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Figure 3 for Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Figure 4 for Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Abstract:Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video. Existing question synthesis methods pre-trained question generation (QG) systems on reading comprehension datasets with text descriptions as inputs. However, QG models only learn to ask association questions (e.g., ``what is someone doing...'') and result in inferior performance due to the poor transfer of association knowledge to CVidQA, which focuses on causal questions like ``why is someone doing ...''. Observing this, we proposed to exploit causal knowledge to generate question-answer pairs, and proposed a novel framework, Causal Knowledge Extraction from Language Models (CaKE-LM), leveraging causal commonsense knowledge from language models to tackle CVidQA. To extract knowledge from LMs, CaKE-LM generates causal questions containing two events with one triggering another (e.g., ``score a goal'' triggers ``soccer player kicking ball'') by prompting LM with the action (soccer player kicking ball) to retrieve the intention (to score a goal). CaKE-LM significantly outperforms conventional methods by 4% to 6% of zero-shot CVidQA accuracy on NExT-QA and Causal-VidQA datasets. We also conduct comprehensive analyses and provide key findings for future research.

* CVPR 2023 Workshop L3D-IVU

Via

Access Paper or Ask Questions