Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Songtao Liu

Refer to the report for detailed contributions

Dynamic Grained Encoder for Vision Transformers

Jan 10, 2023

Lin Song, Songyang Zhang, Songtao Liu, Zeming Li, Xuming He, Hongbin Sun, Jian Sun, Nanning Zheng

Abstract:Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https://github.com/StevenGrove/vtpack.

* Accepted by NeurIPS2021

Via

Access Paper or Ask Questions

Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

Dec 03, 2022

En Yu, Songtao Liu, Zhuoling Li, Jinrong Yang, Zeming li, Shoudong Han, Wenbing Tao

Figure 1 for Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

Figure 2 for Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

Figure 3 for Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

Figure 4 for Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

Abstract:Although existing multi-object tracking (MOT) algorithms have obtained competitive performance on various benchmarks, almost all of them train and validate models on the same domain. The domain generalization problem of MOT is hardly studied. To bridge this gap, we first draw the observation that the high-level information contained in natural language is domain invariant to different tracking domains. Based on this observation, we propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability. However, it is infeasible to label every tracking target with a textual description. To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM). Specifically, VCP generates visual prompts based on the input frames. VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description, which is domain invariant to different tracking scenes. Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.

* Accepted by AAAI2023

Via

Access Paper or Ask Questions

Metro: Memory-Enhanced Transformer for Retrosynthetic Planning via Reaction Tree

Sep 30, 2022

Songtao Liu, Rex Ying, Zuobai Zhang, Peilin Zhao, Jian Tang, Lu Lin, Dinghao Wu

Figure 1 for Metro: Memory-Enhanced Transformer for Retrosynthetic Planning via Reaction Tree

Figure 2 for Metro: Memory-Enhanced Transformer for Retrosynthetic Planning via Reaction Tree

Figure 3 for Metro: Memory-Enhanced Transformer for Retrosynthetic Planning via Reaction Tree

Figure 4 for Metro: Memory-Enhanced Transformer for Retrosynthetic Planning via Reaction Tree

Abstract:Retrosynthetic planning plays a critical role in drug discovery and organic chemistry. Starting from a target molecule as the root node, it aims to find a complete reaction tree subject to the constraint that all leaf nodes belong to a set of starting materials. The multi-step reactions are crucial because they determine the flow chart in the production of the Organic Chemical Industry. However, existing datasets lack curation of tree-structured multi-step reactions, and fail to provide such reaction trees, limiting models' understanding of organic molecule transformations. In this work, we first develop a benchmark curated for the retrosynthetic planning task, which consists of 124,869 reaction trees retrieved from the public USPTO-full dataset. On top of that, we propose Metro: Memory-Enhanced Transformer for RetrOsynthetic planning. Specifically, the dependency among molecules in the reaction tree is captured as context information for multi-step retrosynthesis predictions through transformers with a memory module. Extensive experiments show that Metro dramatically outperforms existing single-step retrosynthesis models by at least 10.7% in top-1 accuracy. The experiments demonstrate the superiority of exploiting context information in the retrosynthetic planning task. Moreover, the proposed model can be directly used for synthetic accessibility analysis, as it is trained on reaction trees with the shortest depths. Our work is the first step towards a brand new formulation for retrosynthetic planning in the aspects of data construction, model design, and evaluation. Code is available at https://github.com/SongtaoLiu0823/metro.

Via

Access Paper or Ask Questions

How Powerful is Implicit Denoising in Graph Neural Networks

Sep 29, 2022

Songtao Liu, Rex Ying, Hanze Dong, Lu Lin, Jinghui Chen, Dinghao Wu

Figure 1 for How Powerful is Implicit Denoising in Graph Neural Networks

Figure 2 for How Powerful is Implicit Denoising in Graph Neural Networks

Figure 3 for How Powerful is Implicit Denoising in Graph Neural Networks

Figure 4 for How Powerful is Implicit Denoising in Graph Neural Networks

Abstract:Graph Neural Networks (GNNs), which aggregate features from neighbors, are widely used for graph-structured data processing due to their powerful representation learning capabilities. It is generally believed that GNNs can implicitly remove the non-predictive noises. However, the analysis of implicit denoising effect in graph neural networks remains open. In this work, we conduct a comprehensive theoretical study and analyze when and why the implicit denoising happens in GNNs. Specifically, we study the convergence properties of noise matrix. Our theoretical analysis suggests that the implicit denoising largely depends on the connectivity, the graph size, and GNN architectures. Moreover, we formally define and propose the adversarial graph signal denoising (AGSD) problem by extending graph signal denoising problem. By solving such a problem, we derive a robust graph convolution, where the smoothness of the node representations and the implicit denoising effect can be enhanced. Extensive empirical evaluations verify our theoretical analyses and the effectiveness of our proposed model.

Via

Access Paper or Ask Questions

DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection

Jul 22, 2022

Jinrong Yang, Lin Song, Songtao Liu, Zeming Li, Xiaoping Li, Hongbin Sun, Jian Sun, Nanning Zheng

Figure 1 for DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection

Figure 2 for DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection

Figure 3 for DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection

Figure 4 for DBQ-SSD: Dynamic Ball Query for Efficient 3D Object Detection

Abstract:Many point-based 3D detectors adopt point-feature sampling strategies to drop some points for efficient inference. These strategies are typically based on fixed and handcrafted rules, making difficult to handle complicated scenes. Different from them, we propose a Dynamic Ball Query (DBQ) network to adaptively select a subset of input points according to the input features, and assign the feature transform with suitable receptive field for each selected point. It can be embedded into some state-of-the-art 3D detectors and trained in an end-to-end manner, which significantly reduces the computational cost. Extensive experiments demonstrate that our method can reduce latency by 30%-60% on KITTI and Waymo datasets. Specifically, the inference speed of our detector can reach 162 FPS and 30 FPS with negligible performance degradation on KITTI and Waymo datasets, respectively.

Via

Access Paper or Ask Questions

StreamYOLO: Real-time Object Detection for Streaming Perception

Jul 21, 2022

Jinrong Yang, Songtao Liu, Zeming Li, Xiaoping Li, Jian Sun

Figure 1 for StreamYOLO: Real-time Object Detection for Streaming Perception

Figure 2 for StreamYOLO: Real-time Object Detection for Streaming Perception

Figure 3 for StreamYOLO: Real-time Object Detection for Streaming Perception

Figure 4 for StreamYOLO: Real-time Object Detection for Streaming Perception

Abstract:The perceptive models of autonomous driving require fast inference within a low latency for safety. While existing works ignore the inevitable environmental changes after processing, streaming perception jointly evaluates the latency and accuracy into a single metric for video online perception, guiding the previous works to search trade-offs between accuracy and speed. In this paper, we explore the performance of real time models on this metric and endow the models with the capacity of predicting the future, significantly improving the results for streaming perception. Specifically, we build a simple framework with two effective modules. One is a Dual Flow Perception module (DFP). It consists of dynamic flow and static flow in parallel to capture moving tendency and basic detection feature, respectively. Trend Aware Loss (TAL) is the other module which adaptively generates loss weight for each object with its moving speed. Realistically, we consider multiple velocities driving scene and further propose Velocity-awared streaming AP (VsAP) to jointly evaluate the accuracy. In this realistic setting, we design a efficient mix-velocity training strategy to guide detector perceive any velocities. Our simple method achieves the state-of-the-art performance on Argoverse-HD dataset and improves the sAP and VsAP by 4.7% and 8.2% respectively compared to the strong baseline, validating its effectiveness.

* Extended version of arXiv:2203.12338

Via

Access Paper or Ask Questions

Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Jul 19, 2022

Hongyu Zhou, Zheng Ge, Songtao Liu, Weixin Mao, Zeming Li, Haiyan Yu, Jian Sun

Figure 1 for Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Figure 2 for Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Figure 3 for Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Figure 4 for Dense Teacher: Dense Pseudo-Labels for Semi-supervised Object Detection

Abstract:To date, the most powerful semi-supervised object detectors (SS-OD) are based on pseudo-boxes, which need a sequence of post-processing with fine-tuned hyper-parameters. In this work, we propose replacing the sparse pseudo-boxes with the dense prediction as a united and straightforward form of pseudo-label. Compared to the pseudo-boxes, our Dense Pseudo-Label (DPL) does not involve any post-processing method, thus retaining richer information. We also introduce a region selection technique to highlight the key information while suppressing the noise carried by dense labels. We name our proposed SS-OD algorithm that leverages the DPL as Dense Teacher. On COCO and VOC, Dense Teacher shows superior performance under various settings compared with the pseudo-box-based methods.

* ECCV2022

Via

Access Paper or Ask Questions

Real-time Object Detection for Streaming Perception

Mar 29, 2022

Jinrong Yang, Songtao Liu, Zeming Li, Xiaoping Li, Jian Sun

Figure 1 for Real-time Object Detection for Streaming Perception

Figure 2 for Real-time Object Detection for Streaming Perception

Figure 3 for Real-time Object Detection for Streaming Perception

Figure 4 for Real-time Object Detection for Streaming Perception

Abstract:Autonomous driving requires the model to perceive the environment and (re)act within a low latency for safety. While past works ignore the inevitable changes in the environment after processing, streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception. In this paper, instead of searching trade-offs between accuracy and speed like previous works, we point out that endowing real-time models with the ability to predict the future is the key to dealing with this problem. We build a simple and effective framework for streaming perception. It equips a novel DualFlow Perception module (DFP), which includes dynamic and static flows to capture the moving trend and basic detection feature for streaming prediction. Further, we introduce a Trend-Aware Loss (TAL) combined with a trend factor to generate adaptive weights for objects with different moving speeds. Our simple method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline, validating its effectiveness. Our code will be made available at https://github.com/yancie-yjr/StreamYOLO.

* CVPR 2022 Accepted Paper (Oral)

Via

Access Paper or Ask Questions

Local Augmentation for Graph Neural Networks

Sep 08, 2021

Songtao Liu, Hanze Dong, Lanqing Li, Tingyang Xu, Yu Rong, Peilin Zhao, Junzhou Huang, Dinghao Wu

Figure 1 for Local Augmentation for Graph Neural Networks

Figure 2 for Local Augmentation for Graph Neural Networks

Figure 3 for Local Augmentation for Graph Neural Networks

Figure 4 for Local Augmentation for Graph Neural Networks

Abstract:Data augmentation has been widely used in image data and linguistic data but remains under-explored on graph-structured data. Existing methods focus on augmenting the graph data from a global perspective and largely fall into two genres: structural manipulation and adversarial training with feature noise injection. However, the structural manipulation approach suffers information loss issues while the adversarial training approach may downgrade the feature quality by injecting noise. In this work, we introduce the local augmentation, which enhances node features by its local subgraph structures. Specifically, we model the data argumentation as a feature generation process. Given the central node's feature, our local augmentation approach learns the conditional distribution of its neighbors' features and generates the neighbors' optimal feature to boost the performance of downstream tasks. Based on the local augmentation, we further design a novel framework: LA-GNN, which can apply to any GNN models in a plug-and-play manner. Extensive experiments and analyses show that local augmentation consistently yields performance improvement for various GNN architectures across a diverse set of benchmarks. Code is available at https://github.com/Soughing0823/LAGNN.

* 16 pages, 5 figures

Via

Access Paper or Ask Questions

YOLOX: Exceeding YOLO Series in 2021

Aug 06, 2021

Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun

Figure 1 for YOLOX: Exceeding YOLO Series in 2021

Figure 2 for YOLOX: Exceeding YOLO Series in 2021

Figure 3 for YOLOX: Exceeding YOLO Series in 2021

Figure 4 for YOLOX: Exceeding YOLO Series in 2021

Abstract:In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector -- YOLOX. We switch the YOLO detector to an anchor-free manner and conduct other advanced detection techniques, i.e., a decoupled head and the leading label assignment strategy SimOTA to achieve state-of-the-art results across a large scale range of models: For YOLO-Nano with only 0.91M parameters and 1.08G FLOPs, we get 25.3% AP on COCO, surpassing NanoDet by 1.8% AP; for YOLOv3, one of the most widely used detectors in industry, we boost it to 47.3% AP on COCO, outperforming the current best practice by 3.0% AP; for YOLOX-L with roughly the same amount of parameters as YOLOv4-CSP, YOLOv5-L, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, exceeding YOLOv5-L by 1.8% AP. Further, we won the 1st Place on Streaming Perception Challenge (Workshop on Autonomous Driving at CVPR 2021) using a single YOLOX-L model. We hope this report can provide useful experience for developers and researchers in practical scenes, and we also provide deploy versions with ONNX, TensorRT, NCNN, and Openvino supported. Source code is at https://github.com/Megvii-BaseDetection/YOLOX.

Via

Access Paper or Ask Questions