Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lin Ma

LMEye: An Interactive Perception Network for Large Language Models

May 05, 2023

Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Min Zhang

Abstract:Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, is resource-intensive. Our paper proposes an alternative method called LMEye, a play-plug-in Interactive Perception Network for Large Language Models (LLMs), aiming to improve the accuracy of image understanding for the LVLM. Previous methods that infuse visual information into LLMs utilize a static visual mapping network, but lack dynamic interaction between the LLMs and visual information. LMEye addresses this issue by allowing the LLM to incorporate the visual information that aligned with human instruction. Specifically, the LMEye network consists of a static visual mapping network to provide the basic perception of an image to LLMs. Then, it also contains additional linear layers responsible for acquiring requests from LLMs, decomposing image features, and transmitting the interleaved information to LLMs, respectively. In this way, LLMs act to be in charge of understanding human instructions, sending it to the interactive perception network, and generating the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal question answering and reasoning tasks, demonstrating that it significantly improves the zero-shot performance of LLMs on multimodal tasks compared to previous methods.

* working in progress

Via

Access Paper or Ask Questions

A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text

May 05, 2023

Yunxin Li, Baotian Hu, Yuxin Ding, Lin Ma, Min Zhang

Abstract:Pretrained Vision-Language Models (VLMs) have achieved remarkable performance in image retrieval from text. However, their performance drops drastically when confronted with linguistically complex texts that they struggle to comprehend. Inspired by the Divide-and-Conquer algorithm and dual-process theory, in this paper, we regard linguistically complex texts as compound proposition texts composed of multiple simple proposition sentences and propose an end-to-end Neural Divide-and-Conquer Reasoning framework, dubbed NDCR. It contains three main components: 1) Divide: a proposition generator divides the compound proposition text into simple proposition sentences and produces their corresponding representations, 2) Conquer: a pretrained VLMs-based visual-linguistic interactor achieves the interaction between decomposed proposition sentences and images, 3) Combine: a neural-symbolic reasoner combines the above reasoning states to obtain the final solution via a neural logic reasoning approach. According to the dual-process theory, the visual-linguistic interactor and neural-symbolic reasoner could be regarded as analogical reasoning System 1 and logical reasoning System 2. We conduct extensive experiments on a challenging image retrieval from contextual descriptions data set. Experimental results and analyses indicate NDCR significantly improves performance in the complex image-text reasoning problem. Code link: https://github.com/YunxinLi/NDCR.

* Accepted to ACL 2023 Main Conference

Via

Access Paper or Ask Questions

Zero-Shot Semantic Segmentation with Decoupled One-Pass Network

Apr 03, 2023

Cong Han, Yujie Zhong, Dengjie Li, Kai Han, Lin Ma

Figure 1 for Zero-Shot Semantic Segmentation with Decoupled One-Pass Network

Figure 2 for Zero-Shot Semantic Segmentation with Decoupled One-Pass Network

Figure 3 for Zero-Shot Semantic Segmentation with Decoupled One-Pass Network

Figure 4 for Zero-Shot Semantic Segmentation with Decoupled One-Pass Network

Abstract:Recently, the zero-shot semantic segmentation problem has attracted increasing attention, and the best performing methods are based on two-stream networks: one stream for proposal mask generation and the other for segment classification using a pre-trained visual-language model. However, existing two-stream methods require passing a great number of (up to a hundred) image crops into the visuallanguage model, which is highly inefficient. To address the problem, we propose a network that only needs a single pass through the visual-language model for each input image. Specifically, we first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder. We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification. Extensive experiments demonstrate that the proposed method achieves outstanding performance, surpassing state-of-theart methods while being 4 to 7 times faster at inference. We release our code at https://github.com/CongHan0808/DeOP.git.

* 13pages, 9 figures

Via

Access Paper or Ask Questions

Adaptive Sparse Pairwise Loss for Object Re-Identification

Mar 31, 2023

Xiao Zhou, Yujie Zhong, Zhen Cheng, Fan Liang, Lin Ma

Figure 1 for Adaptive Sparse Pairwise Loss for Object Re-Identification

Figure 2 for Adaptive Sparse Pairwise Loss for Object Re-Identification

Figure 3 for Adaptive Sparse Pairwise Loss for Object Re-Identification

Figure 4 for Adaptive Sparse Pairwise Loss for Object Re-Identification

Abstract:Object re-identification (ReID) aims to find instances with the same identity as the given probe from a large gallery. Pairwise losses play an important role in training a strong ReID network. Existing pairwise losses densely exploit each instance as an anchor and sample its triplets in a mini-batch. This dense sampling mechanism inevitably introduces positive pairs that share few visual similarities, which can be harmful to the training. To address this problem, we propose a novel loss paradigm termed Sparse Pairwise (SP) loss that only leverages few appropriate pairs for each class in a mini-batch, and empirically demonstrate that it is sufficient for the ReID tasks. Based on the proposed loss framework, we propose an adaptive positive mining strategy that can dynamically adapt to diverse intra-class variations. Extensive experiments show that SP loss and its adaptive variant AdaSP loss outperform other pairwise losses, and achieve state-of-the-art performance across several ReID benchmarks. Code is available at https://github.com/Astaxanthin/AdaSP.

* Accepted by CVPR 2023

Via

Access Paper or Ask Questions

TriDet: Temporal Action Detection with Relative Boundary Modeling

Mar 16, 2023

Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, Dacheng Tao

Abstract:In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of $69.3\%$ on THUMOS14, outperforming the previous best by $2.5\%$, but with only $74.6\%$ of its latency. The code is released to https://github.com/sssste/TriDet.

* CVPR2023; Temporal Action Detection; Temporal Action Localization

Via

Access Paper or Ask Questions

FastPillars: A Deployment-friendly Pillar-based 3D Detector

Feb 08, 2023

Sifan Zhou, Zhi Tian, Xiangxiang Chu, Xinyu Zhang, Bo Zhang, Xiaobo Lu, Chengjian Feng, Zequn Jie, Patrick Yin Chiang, Lin Ma

Figure 1 for FastPillars: A Deployment-friendly Pillar-based 3D Detector

Figure 2 for FastPillars: A Deployment-friendly Pillar-based 3D Detector

Figure 3 for FastPillars: A Deployment-friendly Pillar-based 3D Detector

Figure 4 for FastPillars: A Deployment-friendly Pillar-based 3D Detector

Abstract:The deployment of 3D detectors strikes one of the major challenges in real-world self-driving scenarios. Existing BEV-based (i.e., Bird Eye View) detectors favor sparse convolution (known as SPConv) to speed up training and inference, which puts a hard barrier for deployment especially for on-device applications. In this paper, we tackle the problem of efficient 3D object detection from LiDAR point clouds with deployment in mind. To reduce computational burden, we propose a pillar-based 3D detector with high performance from an industry perspective, termed FastPillars. Compared with previous methods, we introduce a more effective Max-and-Attention pillar encoding (MAPE) module, and redesigning a powerful and lightweight backbone CRVNet imbued with Cross Stage Partial network (CSP) in a reparameterization style, forming a compact feature representation framework. Extensive experiments demonstrate that our FastPillars surpasses the state-of-the-art 3D detectors regarding both on-device speed and performance. Specifically, FastPillars can be effectively deployed through TensorRT, obtaining real-time performance (24FPS) on a single RTX3070Ti GPU with 64.6 mAP on the nuScenes test set. Our code is publicly available at: https://github.com/StiphyJay/FastPillars.

Via

Access Paper or Ask Questions

DiP: Learning Discriminative Implicit Parts for Person Re-Identification

Dec 24, 2022

Dengjie Li, Siyu Chen, Yujie Zhong, Fan Liang, Lin Ma

Figure 1 for DiP: Learning Discriminative Implicit Parts for Person Re-Identification

Figure 2 for DiP: Learning Discriminative Implicit Parts for Person Re-Identification

Figure 3 for DiP: Learning Discriminative Implicit Parts for Person Re-Identification

Figure 4 for DiP: Learning Discriminative Implicit Parts for Person Re-Identification

Abstract:In person re-identification (ReID) tasks, many works explore the learning of part features to improve the performance over global image features. Existing methods extract part features in an explicit manner, by either using a hand-designed image division or keypoints obtained with external visual systems. In this work, we propose to learn Discriminative implicit Parts (DiPs) which are decoupled from explicit body parts. Therefore, DiPs can learn to extract any discriminative features that can benefit in distinguishing identities, which is beyond predefined body parts (such as accessories). Moreover, we propose a novel implicit position to give a geometric interpretation for each DiP. The implicit position can also serve as a learning signal to encourage DiPs to be more position-equivariant with the identity in the image. Lastly, a set of attributes and auxiliary losses are introduced to further improve the learning of DiPs. Extensive experiments show that the proposed method achieves state-of-the-art performance on multiple person ReID benchmarks.

Via

Access Paper or Ask Questions

Multiple Object Tracking Challenge Technical Report for Team MT_IoT

Dec 07, 2022

Feng Yan, Zhiheng Li, Weixin Luo, Zequn jie, Fan Liang, Xiaolin Wei, Lin Ma

Abstract:This is a brief technical report of our proposed method for Multiple-Object Tracking (MOT) Challenge in Complex Environments. In this paper, we treat the MOT task as a two-stage task including human detection and trajectory matching. Specifically, we designed an improved human detector and associated most of detection to guarantee the integrity of the motion trajectory. We also propose a location-wise matching matrix to obtain more accurate trace matching. Without any model merging, our method achieves 66.672 HOTA and 93.971 MOTA on the DanceTrack challenge dataset.

* This is a brief technical report for Multiple Object Tracking Challenge of ECCV workshop 2022

Via

Access Paper or Ask Questions

AeDet: Azimuth-invariant Multi-view 3D Object Detection

Nov 22, 2022

Chengjian Feng, Zequn Jie, Yujie Zhong, Xiangxiang Chu, Lin Ma

Abstract:Recent LSS-based multi-view 3D object detection has made tremendous progress, by processing the features in Brid-Eye-View (BEV) via the convolutional detector. However, the typical convolution ignores the radial symmetry of the BEV features and increases the difficulty of the detector optimization. To preserve the inherent property of the BEV features and ease the optimization, we propose an azimuth-equivariant convolution (AeConv) and an azimuth-equivariant anchor. The sampling grid of AeConv is always in the radial direction, thus it can learn azimuth-invariant BEV features. The proposed anchor enables the detection head to learn predicting azimuth-irrelevant targets. In addition, we introduce a camera-decoupled virtual depth to unify the depth prediction for the images with different camera intrinsic parameters. The resultant detector is dubbed Azimuth-equivariant Detector (AeDet). Extensive experiments are conducted on nuScenes, and AeDet achieves a 62.0% NDS, surpassing the recent multi-view 3D object detectors such as PETRv2 (58.2% NDS) and BEVDepth (60.0% NDS) by a large margin. Project page: https://fcjian.github.io/aedet.

* Tech report

Via

Access Paper or Ask Questions

HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

Oct 30, 2022

Jiaming Chen, Weixin Luo, Xiaolin Wei, Lin Ma, Wei Zhang

Figure 1 for HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

Figure 2 for HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

Figure 3 for HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

Figure 4 for HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

Abstract:This paper tackles an emerging and challenging vision-language task, namely 3D visual grounding on point clouds. Many recent works benefit from Transformer with the well-known attention mechanism, leading to a tremendous breakthrough for this task. However, we find that they realize the achievement by using various pre-training or multi-stage processing. To simplify the pipeline, we carefully investigate 3D visual grounding and summarize three fundamental problems about how to develop an end-to-end model with high performance for this task. To address these problems, we especially introduce a novel Hierarchical Attention Model (HAM), offering multi-granularity representation and efficient augmentation for both given texts and multi-modal visual inputs. Extensive experimental results demonstrate the superiority of our proposed HAM model. Specifically, HAM ranks first on the large-scale ScanRefer challenge, which outperforms all the existing methods by a significant margin. Codes will be released after acceptance.

* Champion on ECCV 2022 ScanRefer Challenge

Via

Access Paper or Ask Questions