Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Long Chen

University of Kaiserslautern-Landau, MODE Collaboration

SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Mar 03, 2024

Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, Shih-Fu Chang

Figure 1 for SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Figure 2 for SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Figure 3 for SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Figure 4 for SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

Abstract:We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations. The motivation of this problem is to learn a structured and plannable state and action space. Recent works succeeded in sequence modeling of steps with only sequence-level annotations accessible during training, which overlooked the roles of states in the procedures. In this work, we point out that State CHangEs MAtter (SCHEMA) for procedure planning in instructional videos. We aim to establish a more structured state space by investigating the causal relations between steps and states in procedures. Specifically, we explicitly represent each step as state changes and track the state changes in procedures. For step representation, we leveraged the commonsense knowledge in large language models (LLMs) to describe the state changes of steps via our designed chain-of-thought prompting. For state change tracking, we align visual state observations with language state descriptions via cross-modal contrastive learning, and explicitly model the intermediate states of the procedure using LLM-generated state descriptions. Experiments on CrossTask, COIN, and NIV benchmark datasets demonstrate that our proposed SCHEMA model achieves state-of-the-art performance and obtains explainable visualizations.

* Accepted by ICLR 2024

Via

Access Paper or Ask Questions

GenAD: Generative End-to-End Autonomous Driving

Feb 20, 2024

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Long Chen

Figure 1 for GenAD: Generative End-to-End Autonomous Driving

Figure 2 for GenAD: Generative End-to-End Autonomous Driving

Figure 3 for GenAD: Generative End-to-End Autonomous Driving

Figure 4 for GenAD: Generative End-to-End Autonomous Driving

Abstract:Directly producing planning results from raw sensors has been a long-desired solution for autonomous driving and has attracted increasing attention recently. Most existing end-to-end autonomous driving methods factorize this problem into perception, motion prediction, and planning. However, we argue that the conventional progressive pipeline still cannot comprehensively model the entire traffic evolution process, e.g., the future interaction between the ego car and other traffic participants and the structural trajectory prior. In this paper, we explore a new paradigm for end-to-end autonomous driving, where the key is to predict how the ego car and the surroundings evolve given past scenes. We propose GenAD, a generative framework that casts autonomous driving into a generative modeling problem. We propose an instance-centric scene tokenizer that first transforms the surrounding scenes into map-aware instance tokens. We then employ a variational autoencoder to learn the future trajectory distribution in a structural latent space for trajectory prior modeling. We further adopt a temporal model to capture the agent and ego movements in the latent space to generate more effective future trajectories. GenAD finally simultaneously performs motion prediction and planning by sampling distributions in the learned structural latent space conditioned on the instance tokens and using the learned temporal model to generate futures. Extensive experiments on the widely used nuScenes benchmark show that the proposed GenAD achieves state-of-the-art performance on vision-centric end-to-end autonomous driving with high efficiency. Code: https://github.com/wzzheng/GenAD.

* Code is available at: https://github.com/wzzheng/GenAD

Via

Access Paper or Ask Questions

Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning

Jan 28, 2024

Yuhang Zheng, Zhen Wang, Long Chen

Abstract:Being widely used in learning unbiased visual question answering (VQA) models, Data Augmentation (DA) helps mitigate language biases by generating extra training samples beyond the original samples. While today's DA methods can generate robust samples, the augmented training set, significantly larger than the original dataset, often exhibits redundancy in terms of difficulty or content repetition, leading to inefficient model training and even compromising the model performance. To this end, we design an Effective Curriculum Learning strategy ECL to enhance DA-based VQA methods. Intuitively, ECL trains VQA models on relatively ``easy'' samples first, and then gradually changes to ``harder'' samples, and less-valuable samples are dynamically removed. Compared to training on the entire augmented dataset, our ECL strategy can further enhance VQA models' performance with fewer training samples. Extensive ablations have demonstrated the effectiveness of ECL on various methods.

Via

Access Paper or Ask Questions

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Jan 26, 2024

Jinhan Wang, Long Chen, Aparna Khare, Anirudh Raju, Pranav Dheram, Di He, Minhua Wu, Andreas Stolcke, Venkatesh Ravichandran

Figure 1 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Figure 2 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Figure 3 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Figure 4 for Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Abstract:We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

* To appear in IEEE ICASSP 2024

Via

Access Paper or Ask Questions

Boundary and Relation Distillation for Semantic Segmentation

Jan 24, 2024

Dong Zhang, Pingcheng Dong, Xinting Hu, Long Chen, Kwang-Ting Cheng

Figure 1 for Boundary and Relation Distillation for Semantic Segmentation

Figure 2 for Boundary and Relation Distillation for Semantic Segmentation

Figure 3 for Boundary and Relation Distillation for Semantic Segmentation

Figure 4 for Boundary and Relation Distillation for Semantic Segmentation

Abstract:Recently, it has been revealed that small semantic segmentation (SS) models exhibit a tendency to make errors in maintaining boundary region completeness and preserving target region connectivity, despite their effective segmentation of the main object regions. To address these errors, we propose a targeted boundary and relation distillation (BRD) strategy using knowledge distillation from large teacher models to small student models. Specifically, the boundary distillation extracts explicit object boundaries from the hierarchical feature maps of the backbone network, subsequently enhancing the student model's mask quality in boundary regions. Concurrently, the relation distillation transfers implicit relations from the teacher model to the student model using pixel-level self-relation as a bridge, ensuring that the student's mask has strong target region connectivity. The proposed BRD is designed concretely for SS and is characterized by simplicity and efficiency. Through experimental evaluations on multiple SS datasets, including Pascal VOC 2012, Cityscapes, ADE20K, and COCO-Stuff 10K, we demonstrated that BRD significantly surpasses the current methods without increasing the inference costs, generating crisp region boundaries and smooth connecting regions that are challenging for small models.

Via

Access Paper or Ask Questions

Two-pass Endpoint Detection for Speech Recognition

Jan 17, 2024

Anirudh Raju, Aparna Khare, Di He, Ilya Sklyar, Long Chen, Sam Alptekin, Viet Anh Trinh, Zhe Zhang, Colin Vaz, Venkatesh Ravichandran(+2 more)

Abstract:Endpoint (EP) detection is a key component of far-field speech recognition systems that assist the user through voice commands. The endpoint detector has to trade-off between accuracy and latency, since waiting longer reduces the cases of users being cut-off early. We propose a novel two-pass solution for endpointing, where the utterance endpoint detected from a first pass endpointer is verified by a 2nd-pass model termed EP Arbitrator. Our method improves the trade-off between early cut-offs and latency over a baseline endpointer, as tested on datasets including voice-assistant transactional queries, conversational speech, and the public SLURP corpus. We demonstrate that our method shows improvements regardless of the first-pass EP model used.

* ASRU 2023

Via

Access Paper or Ask Questions

Multiperson Detection and Vital-Sign Sensing Empowered by Space-Time-Coding RISs

Jan 15, 2024

Xinyu Li, Jian Wei You, Ze Gu, Qian Ma, Jingyuan Zhang, Long Chen, Tie Jun Cui

Abstract:Passive human sensing using wireless signals has attracted increasing attention due to its superiorities of non-contact and robustness in various lighting conditions. However, when multiple human individuals are present, their reflected signals could be intertwined in the time, frequency and spatial domains, making it challenging to separate them. To address this issue, this paper proposes a novel system for multiperson detection and monitoring of vital signs (i.e., respiration and heartbeat) with the assistance of space-time-coding (STC) reconfigurable intelligent metasurfaces (RISs). Specifically, the proposed system scans the area of interest (AoI) for human detection by using the harmonic beams generated by the STC RIS. Simultaneously, frequencyorthogonal beams are assigned to each detected person for accurate estimation of their respiration rate (RR) and heartbeat rate (HR). Furthermore, to efficiently extract the respiration signal and the much weaker heartbeat signal, we propose an improved variational mode decomposition (VMD) algorithm to accurately decompose the complex reflected signals into a smaller number of intrinsic mode functions (IMFs). We build a prototype to validate the proposed multiperson detection and vital-sign monitoring system. Experimental results demonstrate that the proposed system can simultaneously monitor the vital signs of up to four persons. The errors of RR and HR estimation using the improved VMD algorithm are below 1 RPM (respiration per minute) and 5 BPM (beats per minute), respectively. Further analysis reveals that the flexible beam controlling mechanism empowered by the STC RIS can reduce the noise reflected from other irrelative objects on the physical layer, and improve the signal-to-noise ratio of echoes from the human chest.

Via

Access Paper or Ask Questions

SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network

Dec 26, 2023

Yuhang He, Zhuangzhuang Dai, Long Chen, Niki Trigoni, Andrew Markham

Figure 1 for SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network

Figure 2 for SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network

Figure 3 for SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network

Figure 4 for SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network

Abstract:In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sounds in raw audio characterized by a high degree of polyphonicity. We do so by systematically proposing a novel end-to-end trainable neural network (which we call DyDecNet, consisting of a dyadic decomposition front-end and backbone network), and quantifying the difficulty level of counting depending on sound polyphonicity. The dyadic decomposition front-end progressively decomposes the raw waveform dyadically along the frequency axis to obtain time-frequency representation in multi-stage, coarse-to-fine manner. Each intermediate waveform convolved by a parent filter is further processed by a pair of child filters that evenly split the parent filter's carried frequency response, with the higher-half child filter encoding the detail and lower-half child filter encoding the approximation. We further introduce an energy gain normalization to normalize sound loudness variance and spectrum overlap, and apply it to each intermediate parent waveform before feeding it to the two child filters. To better quantify sound counting difficulty level, we further design three polyphony-aware metrics: polyphony ratio, max polyphony and mean polyphony. We test DyDecNet on various datasets to show its superiority, and we further show dyadic decomposition network can be used as a general front-end to tackle other acoustic tasks.

* AAAI2024 Paper

Via

Access Paper or Ask Questions

LingoQA: Video Question Answering for Autonomous Driving

Dec 21, 2023

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton(+1 more)

Abstract:Autonomous driving has long faced a challenge with public acceptance due to the lack of explainability in the decision-making process. Video question-answering (QA) in natural language provides the opportunity for bridging this gap. Nonetheless, evaluating the performance of Video QA models has proved particularly tough due to the absence of comprehensive benchmarks. To fill this gap, we introduce LingoQA, a benchmark specifically for autonomous driving Video QA. The LingoQA trainable metric demonstrates a 0.95 Spearman correlation coefficient with human evaluations. We introduce a Video QA dataset of central London consisting of 419k samples that we release with the paper. We establish a baseline vision-language model and run extensive ablation studies to understand its performance.

* Benchmark and dataset are available at https://github.com/wayveai/LingoQA/

Via

Access Paper or Ask Questions

Beneath the Surface: Unveiling Harmful Memes with Multimodal Reasoning Distilled from Large Language Models

Dec 09, 2023

Hongzhan Lin, Ziyang Luo, Jing Ma, Long Chen

Abstract:The age of social media is rife with memes. Understanding and detecting harmful memes pose a significant challenge due to their implicit meaning that is not explicitly conveyed through the surface text and image. However, existing harmful meme detection approaches only recognize superficial harm-indicative signals in an end-to-end classification manner but ignore in-depth cognition of the meme text and image. In this paper, we attempt to detect harmful memes based on advanced reasoning over the interplay of multimodal information in memes. Inspired by the success of Large Language Models (LLMs) on complex reasoning, we first conduct abductive reasoning with LLMs. Then we propose a novel generative framework to learn reasonable thoughts from LLMs for better multimodal fusion and lightweight fine-tuning, which consists of two training stages: 1) Distill multimodal reasoning knowledge from LLMs; and 2) Fine-tune the generative framework to infer harmfulness. Extensive experiments conducted on three meme datasets demonstrate that our proposed approach achieves superior performance than state-of-the-art methods on the harmful meme detection task.

* The 2023 Conference on Empirical Methods in Natural Language Processing
* The first work to alleviate the issue of superficial understanding for harmful meme detection by explicitly utilizing commonsense knowledge, from a fresh perspective on harnessing advanced Large Language Models

Via

Access Paper or Ask Questions