Alert button
Picture for Yudong Li

Yudong Li

Alert button

VisorGPT: Learning Visual Prior via Generative Pre-Training

May 24, 2023
Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin Qinghong Lin, Yefeng Zheng, Linlin Shen, Mike Zheng Shou

Figure 1 for VisorGPT: Learning Visual Prior via Generative Pre-Training
Figure 2 for VisorGPT: Learning Visual Prior via Generative Pre-Training
Figure 3 for VisorGPT: Learning Visual Prior via Generative Pre-Training
Figure 4 for VisorGPT: Learning Visual Prior via Generative Pre-Training

Various stuff and things in visual data possess specific traits, which can be learned by deep neural networks and are implicitly represented as the visual prior, \emph{e.g.,} object location and shape, in the model. Such prior potentially impacts many vision tasks. For example, in conditional image synthesis, spatial conditions failing to adhere to the prior can result in visually inaccurate synthetic results. This work aims to explicitly learn the visual prior and enable the customization of sampling. Inspired by advances in language modeling, we propose to learn Visual prior via Generative Pre-Training, dubbed VisorGPT. By discretizing visual locations of objects, \emph{e.g.,} bounding boxes, human pose, and instance masks, into sequences, \our~can model visual prior through likelihood maximization. Besides, prompt engineering is investigated to unify various visual locations and enable customized sampling of sequential outputs from the learned prior. Experimental results demonstrate that \our~can effectively model the visual prior, which can be employed for many vision tasks, such as customizing accurate human pose for conditional image synthesis models like ControlNet. Code will be released at https://github.com/Sierkinhane/VisorGPT.

* Project web-page: https://sierkinhane.github.io/visor-gpt/ 
Viaarxiv icon

TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities

Dec 13, 2022
Zhe Zhao, Yudong Li, Cheng Hou, Jing Zhao, Rong Tian, Weijie Liu, Yiren Chen, Ningyuan Sun, Haoyan Liu, Weiquan Mao, Han Guo, Weigang Guo, Taiqiang Wu, Tao Zhu, Wenhang Shi, Chen Chen, Shan Huang, Sihong Chen, Liqun Liu, Feifei Li, Xiaoshuai Chen, Xingwu Sun, Zhanhui Kang, Xiaoyong Du, Linlin Shen, Kimmo Yan

Figure 1 for TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities
Figure 2 for TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities
Figure 3 for TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities
Figure 4 for TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities

Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.

Viaarxiv icon

Spikeformer: A Novel Architecture for Training High-Performance Low-Latency Spiking Neural Network

Nov 19, 2022
Yudong Li, Yunlin Lei, Xu Yang

Figure 1 for Spikeformer: A Novel Architecture for Training High-Performance Low-Latency Spiking Neural Network
Figure 2 for Spikeformer: A Novel Architecture for Training High-Performance Low-Latency Spiking Neural Network
Figure 3 for Spikeformer: A Novel Architecture for Training High-Performance Low-Latency Spiking Neural Network
Figure 4 for Spikeformer: A Novel Architecture for Training High-Performance Low-Latency Spiking Neural Network

Spiking neural networks (SNNs) have made great progress on both performance and efficiency over the last few years,but their unique working pattern makes it hard to train a high-performance low-latency SNN.Thus the development of SNNs still lags behind traditional artificial neural networks (ANNs).To compensate this gap,many extraordinary works have been proposed.Nevertheless,these works are mainly based on the same kind of network structure (i.e.CNN) and their performance is worse than their ANN counterparts,which limits the applications of SNNs.To this end,we propose a novel Transformer-based SNN,termed "Spikeformer",which outperforms its ANN counterpart on both static dataset and neuromorphic dataset and may be an alternative architecture to CNN for training high-performance SNNs.First,to deal with the problem of "data hungry" and the unstable training period exhibited in the vanilla model,we design the Convolutional Tokenizer (CT) module,which improves the accuracy of the original model on DVS-Gesture by more than 16%.Besides,in order to better incorporate the attention mechanism inside Transformer and the spatio-temporal information inherent to SNN,we adopt spatio-temporal attention (STA) instead of spatial-wise or temporal-wise attention.With our proposed method,we achieve competitive or state-of-the-art (SOTA) SNN performance on DVS-CIFAR10,DVS-Gesture,and ImageNet datasets with the least simulation time steps (i.e.low latency).Remarkably,our Spikeformer outperforms other SNNs on ImageNet by a large margin (i.e.more than 5%) and even outperforms its ANN counterpart by 3.1% and 2.2% on DVS-Gesture and ImageNet respectively,indicating that Spikeformer is a promising architecture for training large-scale SNNs and may be more suitable for SNNs compared to CNN.We believe that this work shall keep the development of SNNs in step with ANNs as much as possible.Code will be available.

Viaarxiv icon

Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems

Oct 20, 2022
Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun YU, Bo Hu, Zang Li, Yu Xu, Xiaohu Qie

Figure 1 for Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems
Figure 2 for Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems
Figure 3 for Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems
Figure 4 for Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems

Existing benchmark datasets for recommender systems (RS) either are created at a small scale or involve very limited forms of user feedback. RS models evaluated on such datasets often lack practical values for large-scale real-world applications. In this paper, we describe Tenrec, a novel and publicly available data collection for RS that records various user feedback from four different recommendation scenarios. To be specific, Tenrec has the following five characteristics: (1) it is large-scale, containing around 5 million users and 140 million interactions; (2) it has not only positive user feedback, but also true negative feedback (vs. one-class recommendation); (3) it contains overlapped users and items across four different scenarios; (4) it contains various types of user positive feedback, in forms of clicks, likes, shares, and follows, etc; (5) it contains additional features beyond the user IDs and item IDs. We verify Tenrec on ten diverse recommendation tasks by running several classical baseline models per task. Tenrec has the potential to become a useful benchmark dataset for a majority of popular recommendation tasks.

Viaarxiv icon

CSL: A Large-scale Chinese Scientific Literature Dataset

Sep 12, 2022
Yudong Li, Yuqing Zhang, Zhe Zhao, Linlin Shen, Weijie Liu, Weiquan Mao, Hui Zhang

Figure 1 for CSL: A Large-scale Chinese Scientific Literature Dataset
Figure 2 for CSL: A Large-scale Chinese Scientific Literature Dataset
Figure 3 for CSL: A Large-scale Chinese Scientific Literature Dataset
Figure 4 for CSL: A Large-scale Chinese Scientific Literature Dataset

Scientific literature serves as a high-quality corpus, supporting a lot of Natural Language Processing (NLP) research. However, existing datasets are centered around the English language, which restricts the development of Chinese scientific NLP. In this work, we present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396k papers. To our knowledge, CSL is the first scientific document dataset in Chinese. The CSL can serve as a Chinese corpus. Also, this semi-structured data is a natural annotation that can constitute many supervised NLP tasks. Based on CSL, we present a benchmark to evaluate the performance of models across scientific domain tasks, i.e., summarization, keyword generation and text classification. We analyze the behavior of existing text-to-text models on the evaluation tasks and reveal the challenges for Chinese scientific NLP tasks, which provides a valuable reference for future research. Data and code are available at https://github.com/ydli-ai/CSL

* to be published in COLING 2022 
Viaarxiv icon

RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization

Feb 09, 2021
Yizhou Wang, Zhongyu Jiang, Yudong Li, Jenq-Neng Hwang, Guanbin Xing, Hui Liu

Figure 1 for RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization
Figure 2 for RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization
Figure 3 for RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization
Figure 4 for RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization

Various autonomous or assisted driving strategies have been facilitated through the accurate and reliable perception of the environment around a vehicle. Among the commonly used sensors, radar has usually been considered as a robust and cost-effective solution even in adverse driving scenarios, e.g., weak/strong lighting or bad weather. Instead of considering to fuse the unreliable information from all available sensors, perception from pure radar data becomes a valuable alternative that is worth exploring. In this paper, we propose a deep radar object detection network, named RODNet, which is cross-supervised by a camera-radar fused algorithm without laborious annotation efforts, to effectively detect objects from the radio frequency (RF) images in real-time. First, the raw signals captured by millimeter-wave radars are transformed to RF images in range-azimuth coordinates. Second, our proposed RODNet takes a sequence of RF images as the input to predict the likelihood of objects in the radar field of view (FoV). Two customized modules are also added to handle multi-chirp information and object relative motion. Instead of using human-labeled ground truth for training, the proposed RODNet is cross-supervised by a novel 3D localization of detected objects using a camera-radar fusion (CRF) strategy in the training stage. Finally, we propose a method to evaluate the object detection performance of the RODNet. Due to no existing public dataset available for our task, we create a new dataset, named CRUW, which contains synchronized RGB and RF image sequences in various driving scenarios. With intensive experiments, our proposed cross-supervised RODNet achieves 86% average precision and 88% average recall of object detection performance, which shows the robustness to noisy scenarios in various driving conditions.

* IEEE Journal of Selected Topics in Signal Processing Special Issue on Recent Advances in Automotive Radar Signal Processing. arXiv admin note: text overlap with arXiv:2003.01816 
Viaarxiv icon

CLUE: A Chinese Language Understanding Evaluation Benchmark

Apr 14, 2020
Liang Xu, Xuanwei Zhang, Lu Li, Hai Hu, Chenjie Cao, Weitang Liu, Junyi Li, Yudong Li, Kai Sun, Yechen Xu, Yiming Cui, Cong Yu, Qianqian Dong, Yin Tian, Dian Yu, Bo Shi, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Zhenzhong Lan

Figure 1 for CLUE: A Chinese Language Understanding Evaluation Benchmark
Figure 2 for CLUE: A Chinese Language Understanding Evaluation Benchmark
Figure 3 for CLUE: A Chinese Language Understanding Evaluation Benchmark
Figure 4 for CLUE: A Chinese Language Understanding Evaluation Benchmark

We introduce CLUE, a Chinese Language Understanding Evaluation benchmark. It contains eight different tasks, including single-sentence classification, sentence pair classification, and machine reading comprehension. We evaluate CLUE on a number of existing full-network pre-trained models for Chinese. We also include a small hand-crafted diagnostic test set designed to probe specific linguistic phenomena using different models, some of which are unique to Chinese. Along with CLUE, we release a large clean crawled raw text corpus that can be used for model pre-training. We release CLUE, baselines and pre-training dataset on Github.

* 9 pages, 4 figures 
Viaarxiv icon