Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianbing Shen

Multi-threshold Deep Metric Learning for Facial Expression Recognition

Jun 24, 2024

Wenwu Yang, Jinyi Yu, Tuo Chen, Zhenguang Liu, Xun Wang, Jianbing Shen

Figure 1 for Multi-threshold Deep Metric Learning for Facial Expression Recognition

Figure 2 for Multi-threshold Deep Metric Learning for Facial Expression Recognition

Figure 3 for Multi-threshold Deep Metric Learning for Facial Expression Recognition

Figure 4 for Multi-threshold Deep Metric Learning for Facial Expression Recognition

Abstract:Effective expression feature representations generated by a triplet-based deep metric learning are highly advantageous for facial expression recognition (FER). The performance of triplet-based deep metric learning is contingent upon identifying the best threshold for triplet loss. Threshold validation, however, is tough and challenging, as the ideal threshold changes among datasets and even across classes within the same dataset. In this paper, we present the multi-threshold deep metric learning technique, which not only avoids the difficult threshold validation but also vastly increases the capacity of triplet loss learning to construct expression feature representations. We find that each threshold of the triplet loss intrinsically determines a distinctive distribution of inter-class variations and corresponds, thus, to a unique expression feature representation. Therefore, rather than selecting a single optimal threshold from a valid threshold range, we thoroughly sample thresholds across the range, allowing the representation characteristics manifested by thresholds within the range to be fully extracted and leveraged for FER. To realize this approach, we partition the embedding layer of the deep metric learning network into a collection of slices and model training these embedding slices as an end-to-end multi-threshold deep metric learning problem. Each embedding slice corresponds to a sample threshold and is learned by enforcing the corresponding triplet loss, yielding a set of distinct expression features, one for each embedding slice. It makes the embedding layer, which is composed of a set of slices, a more informative and discriminative feature, hence enhancing the FER accuracy. Extensive evaluations demonstrate the superior performance of the proposed approach on both posed and spontaneous facial expression datasets.

* accepted by Pattern Recognition

Via

Access Paper or Ask Questions

Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

May 28, 2024

Yifan Bai, Dongming Wu, Yingfei Liu, Fan Jia, Weixin Mao, Ziheng Zhang, Yucheng Zhao, Jianbing Shen, Xing Wei, Tiancai Wang(+1 more)

Abstract:Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released.

Via

Access Paper or Ask Questions

IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Mar 22, 2024

Junbo Yin, Jianbing Shen, Runnan Chen, Wei Li, Ruigang Yang, Pascal Frossard, Wenguan Wang

Figure 1 for IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Figure 2 for IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Figure 3 for IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Figure 4 for IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection

Abstract:Bird's eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However, objects in the BEV representation typically exhibit small sizes, and the associated point cloud context is inherently sparse, which leads to great challenges for reliable 3D perception. In this paper, we propose IS-Fusion, an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion by explicitly incorporating instance-level multimodal information, thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates, explores their relationships, and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark, IS-Fusion outperforms all the published multimodal works to date. Code is available at: https://github.com/yinjunbo/IS-Fusion.

* Accepted to CVPR 2024; Code: https://github.com/yinjunbo/IS-Fusion

Via

Access Paper or Ask Questions

Visual In-Context Learning for Large Vision-Language Models

Feb 18, 2024

Yucheng Zhou, Xiang Li, Qianning Wang, Jianbing Shen

Figure 1 for Visual In-Context Learning for Large Vision-Language Models

Figure 2 for Visual In-Context Learning for Large Vision-Language Models

Figure 3 for Visual In-Context Learning for Large Vision-Language Models

Figure 4 for Visual In-Context Learning for Large Vision-Language Models

Abstract:In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. To overcome these challenges, we introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations that reduce token count and alleviate cross-modal interaction problem. Experimental evaluations on five visual reasoning datasets demonstrate the effectiveness of our method. Moreover, our extensive experiments leverage information flow analysis to elucidate the effectiveness of our method, and investigate the impact of length and position of demonstrations for LVLM. The use of in-context unlearning further shows promise in resetting specific model knowledge without retraining.

* 13 pages, 7 figures

Via

Access Paper or Ask Questions

DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving

Jan 08, 2024

Wencheng Han, Dongqian Guo, Cheng-Zhong Xu, Jianbing Shen

Abstract:In the field of autonomous driving, two important features of autonomous driving car systems are the explainability of decision logic and the accuracy of environmental perception. This paper introduces DME-Driver, a new autonomous driving system that enhances the performance and reliability of autonomous driving system. DME-Driver utilizes a powerful vision language model as the decision-maker and a planning-oriented perception model as the control signal generator. To ensure explainable and reliable driving decisions, the logical decision-maker is constructed based on a large vision language model. This model follows the logic employed by experienced human drivers and makes decisions in a similar manner. On the other hand, the generation of accurate control signals relies on precise and detailed environmental perception, which is where 3D scene perception models excel. Therefore, a planning oriented perception model is employed as the signal generator. It translates the logical decisions made by the decision-maker into accurate control signals for the self-driving cars. To effectively train the proposed model, a new dataset for autonomous driving was created. This dataset encompasses a diverse range of human driver behaviors and their underlying motivations. By leveraging this dataset, our model achieves high-precision planning accuracy through a logical thinking process.

Via

Access Paper or Ask Questions

DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection

Dec 25, 2023

Li Xiang, Junbo Yin, Wei Li, Cheng-Zhong Xu, Ruigang Yang, Jianbing Shen

Figure 1 for DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection

Figure 2 for DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection

Figure 3 for DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection

Figure 4 for DI-V2X: Learning Domain-Invariant Representation for Vehicle-Infrastructure Collaborative 3D Object Detection

Abstract:Vehicle-to-Everything (V2X) collaborative perception has recently gained significant attention due to its capability to enhance scene understanding by integrating information from various agents, e.g., vehicles, and infrastructure. However, current works often treat the information from each agent equally, ignoring the inherent domain gap caused by the utilization of different LiDAR sensors of each agent, thus leading to suboptimal performance. In this paper, we propose DI-V2X, that aims to learn Domain-Invariant representations through a new distillation framework to mitigate the domain discrepancy in the context of V2X 3D object detection. DI-V2X comprises three essential components: a domain-mixing instance augmentation (DMA) module, a progressive domain-invariant distillation (PDD) module, and a domain-adaptive fusion (DAF) module. Specifically, DMA builds a domain-mixing 3D instance bank for the teacher and student models during training, resulting in aligned data representation. Next, PDD encourages the student models from different domains to gradually learn a domain-invariant feature representation towards the teacher, where the overlapping regions between agents are employed as guidance to facilitate the distillation process. Furthermore, DAF closes the domain gap between the students by incorporating calibration-aware domain-adaptive attention. Extensive experiments on the challenging DAIR-V2X and V2XSet benchmark datasets demonstrate DI-V2X achieves remarkable performance, outperforming all the previous V2X models. Code is available at https://github.com/Serenos/DI-V2X

* aaai2024

Via

Access Paper or Ask Questions

Thread of Thought Unraveling Chaotic Contexts

Nov 15, 2023

Yucheng Zhou, Xiubo Geng, Tao Shen, Chongyang Tao, Guodong Long, Jian-Guang Lou, Jianbing Shen

Figure 1 for Thread of Thought Unraveling Chaotic Contexts

Figure 2 for Thread of Thought Unraveling Chaotic Contexts

Figure 3 for Thread of Thought Unraveling Chaotic Contexts

Figure 4 for Thread of Thought Unraveling Chaotic Contexts

Abstract:Large Language Models (LLMs) have ushered in a transformative era in the field of natural language processing, excelling in tasks related to text comprehension and generation. Nevertheless, they encounter difficulties when confronted with chaotic contexts (e.g., distractors rather than long irrelevant context), leading to the inadvertent omission of certain details within the chaotic context. In response to these challenges, we introduce the "Thread of Thought" (ThoT) strategy, which draws inspiration from human cognitive processes. ThoT systematically segments and analyzes extended contexts while adeptly selecting pertinent information. This strategy serves as a versatile "plug-and-play" module, seamlessly integrating with various LLMs and prompting techniques. In the experiments, we utilize the PopQA and EntityQ datasets, as well as a Multi-Turn Conversation Response dataset (MTCR) we collected, to illustrate that ThoT significantly improves reasoning performance compared to other prompting techniques.

* 11 pages, 7 figures, 5 tables

Via

Access Paper or Ask Questions

TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning

Oct 10, 2023

Dongming Wu, Jiahao Chang, Fan Jia, Yingfei Liu, Tiancai Wang, Jianbing Shen

Figure 1 for TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning

Figure 2 for TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning

Figure 3 for TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning

Figure 4 for TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning

Abstract:Topology reasoning aims to comprehensively understand road scenes and present drivable routes in autonomous driving. It requires detecting road centerlines (lane) and traffic elements, further reasoning their topology relationship, i.e., lane-lane topology, and lane-traffic topology. In this work, we first present that the topology score relies heavily on detection performance on lane and traffic elements. Therefore, we introduce a powerful 3D lane detector and an improved 2D traffic element detector to extend the upper limit of topology performance. Further, we propose TopoMLP, a simple yet high-performance pipeline for driving topology reasoning. Based on the impressive detection performance, we develop two simple MLP-based heads for topology generation. TopoMLP achieves state-of-the-art performance on OpenLane-V2 benchmark, i.e., 41.2% OLS with ResNet-50 backbone. It is also the 1st solution for 1st OpenLane Topology in Autonomous Driving Challenge. We hope such simple and strong pipeline can provide some new insights to the community. Code is at https://github.com/wudongming97/TopoMLP.

* The 1st solution for 1st OpenLane Topology in Autonomous Driving Challenge. Code is at https://github.com/wudongming97/TopoMLP

Via

Access Paper or Ask Questions

Decoupling the Curve Modeling and Pavement Regression for Lane Detection

Sep 19, 2023

Wencheng Han, Jianbing Shen

Abstract:The curve-based lane representation is a popular approach in many lane detection methods, as it allows for the representation of lanes as a whole object and maximizes the use of holistic information about the lanes. However, the curves produced by these methods may not fit well with irregular lines, which can lead to gaps in performance compared to indirect representations such as segmentation-based or point-based methods. We have observed that these lanes are not intended to be irregular, but they appear zigzagged in the perspective view due to being drawn on uneven pavement. In this paper, we propose a new approach to the lane detection task by decomposing it into two parts: curve modeling and ground height regression. Specifically, we use a parameterized curve to represent lanes in the BEV space to reflect the original distribution of lanes. For the second part, since ground heights are determined by natural factors such as road conditions and are less holistic, we regress the ground heights of key points separately from the curve modeling. Additionally, we have unified the 2D and 3D lane detection tasks by designing a new framework and a series of losses to guide the optimization of models with or without 3D lane labels. Our experiments on 2D lane detection benchmarks (TuSimple and CULane), as well as the recently proposed 3D lane detection datasets (ONCE-3Dlane and OpenLane), have shown significant improvements. We will make our well-documented source code publicly available.

Via

Access Paper or Ask Questions

Language Prompt for Autonomous Driving

Sep 08, 2023

Dongming Wu, Wencheng Han, Tiancai Wang, Yingfei Liu, Xiangyu Zhang, Jianbing Shen

Figure 1 for Language Prompt for Autonomous Driving

Figure 2 for Language Prompt for Autonomous Driving

Figure 3 for Language Prompt for Autonomous Driving

Figure 4 for Language Prompt for Autonomous Driving

Abstract:A new trend in the computer vision community is to capture objects of interest following flexible human command represented by a natural language prompt. However, the progress of using language prompts in driving scenarios is stuck in a bottleneck due to the scarcity of paired prompt-instance data. To address this challenge, we propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt. It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks. Based on the object-text pairs from the new benchmark, we formulate a new prompt-based driving task, \ie, employing a language prompt to predict the described object trajectory across views and frames. Furthermore, we provide a simple end-to-end baseline model based on Transformer, named PromptTrack. Experiments show that our PromptTrack achieves impressive performance on NuPrompt. We hope this work can provide more new insights for the autonomous driving community. Dataset and Code will be made public at \href{https://github.com/wudongming97/Prompt4Driving}{https://github.com/wudongming97/Prompt4Driving}.

Via

Access Paper or Ask Questions