Alert button
Picture for Yuhong Li

Yuhong Li

Alert button

Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

Nov 20, 2022
Xichen Pan, Pengda Qin, Yuhong Li, Hui Xue, Wenhu Chen

Figure 1 for Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models
Figure 2 for Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models
Figure 3 for Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models
Figure 4 for Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models

Conditioned diffusion models have demonstrated state-of-the-art text-to-image synthesis capacity. Recently, most works focus on synthesizing independent images; While for real-world applications, it is common and necessary to generate a series of coherent images for story-stelling. In this work, we mainly focus on story visualization and continuation tasks and propose AR-LDM, a latent diffusion model auto-regressively conditioned on history captions and generated images. Moreover, AR-LDM can generalize to new characters through adaptation. To our best knowledge, this is the first work successfully leveraging diffusion models for coherent visual story synthesizing. Quantitative results show that AR-LDM achieves SoTA FID scores on PororoSV, FlintstonesSV, and the newly introduced challenging dataset VIST containing natural images. Large-scale human evaluations show that AR-LDM has superior performance in terms of quality, relevance, and consistency.

* Technical Report 
Viaarxiv icon

Extensible Proxy for Efficient NAS

Oct 17, 2022
Yuhong Li, Jiajie Li, Cong Han, Pan Li, Jinjun Xiong, Deming Chen

Figure 1 for Extensible Proxy for Efficient NAS
Figure 2 for Extensible Proxy for Efficient NAS
Figure 3 for Extensible Proxy for Efficient NAS
Figure 4 for Extensible Proxy for Efficient NAS

Neural Architecture Search (NAS) has become a de facto approach in the recent trend of AutoML to design deep neural networks (DNNs). Efficient or near-zero-cost NAS proxies are further proposed to address the demanding computational issues of NAS, where each candidate architecture network only requires one iteration of backpropagation. The values obtained from the proxies are considered the predictions of architecture performance on downstream tasks. However, two significant drawbacks hinder the extended usage of Efficient NAS proxies. (1) Efficient proxies are not adaptive to various search spaces. (2) Efficient proxies are not extensible to multi-modality downstream tasks. Based on the observations, we design a Extensible proxy (Eproxy) that utilizes self-supervised, few-shot training (i.e., 10 iterations of backpropagation) which yields near-zero costs. The key component that makes Eproxy efficient is an untrainable convolution layer termed barrier layer that add the non-linearities to the optimization spaces so that the Eproxy can discriminate the performance of architectures in the early stage. Furthermore, to make Eproxy adaptive to different downstream tasks/search spaces, we propose a Discrete Proxy Search (DPS) to find the optimized training settings for Eproxy with only handful of benchmarked architectures on the target tasks. Our extensive experiments confirm the effectiveness of both Eproxy and Eproxy+DPS. Code is available at https://github.com/leeyeehoo/GenNAS-Zero.

Viaarxiv icon

What Makes Convolutional Models Great on Long Sequence Modeling?

Oct 17, 2022
Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, Debadeepta Dey

Figure 1 for What Makes Convolutional Models Great on Long Sequence Modeling?
Figure 2 for What Makes Convolutional Models Great on Long Sequence Modeling?
Figure 3 for What Makes Convolutional Models Great on Long Sequence Modeling?
Figure 4 for What Makes Convolutional Models Great on Long Sequence Modeling?

Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. [2021] proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes. As a result, S4 is less intuitive and hard to use. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.

* The code is available at https://github.com/ctlllll/SGConv 
Viaarxiv icon

Efficient Machine Learning, Compilers, and Optimizations for Embedded Systems

Jun 06, 2022
Xiaofan Zhang, Yao Chen, Cong Hao, Sitao Huang, Yuhong Li, Deming Chen

Figure 1 for Efficient Machine Learning, Compilers, and Optimizations for Embedded Systems
Figure 2 for Efficient Machine Learning, Compilers, and Optimizations for Embedded Systems
Figure 3 for Efficient Machine Learning, Compilers, and Optimizations for Embedded Systems
Figure 4 for Efficient Machine Learning, Compilers, and Optimizations for Embedded Systems

Deep Neural Networks (DNNs) have achieved great success in a massive number of artificial intelligence (AI) applications by delivering high-quality computer vision, natural language processing, and virtual reality applications. However, these emerging AI applications also come with increasing computation and memory demands, which are challenging to handle especially for the embedded systems where limited computation/memory resources, tight power budgets, and small form factors are demanded. Challenges also come from the diverse application-specific requirements, including real-time responses, high-throughput performance, and reliable inference accuracy. To address these challenges, we will introduce a series of effective design methods in this book chapter to enable efficient algorithms, compilers, and various optimizations for embedded systems.

* This article will appear as a book chapter in a new book: Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing, Springer Nature 
Viaarxiv icon

Generic Neural Architecture Search via Regression

Aug 04, 2021
Yuhong Li, Cong Hao, Pan Li, Jinjun Xiong, Deming Chen

Figure 1 for Generic Neural Architecture Search via Regression
Figure 2 for Generic Neural Architecture Search via Regression
Figure 3 for Generic Neural Architecture Search via Regression
Figure 4 for Generic Neural Architecture Search via Regression

Most existing neural architecture search (NAS) algorithms are dedicated to the downstream tasks, e.g., image classification in computer vision. However, extensive experiments have shown that, prominent neural architectures, such as ResNet in computer vision and LSTM in natural language processing, are generally good at extracting patterns from the input data and perform well on different downstream tasks. These observations inspire us to ask: Is it necessary to use the performance of specific downstream tasks to evaluate and search for good neural architectures? Can we perform NAS effectively and efficiently while being agnostic to the downstream task? In this work, we attempt to affirmatively answer the above two questions and improve the state-of-the-art NAS solution by proposing a novel and generic NAS framework, termed Generic NAS (GenNAS). GenNAS does not use task-specific labels but instead adopts \textit{regression} on a set of manually designed synthetic signal bases for architecture evaluation. Such a self-supervised regression task can effectively evaluate the intrinsic power of an architecture to capture and transform the input signal patterns, and allow more sufficient usage of training samples. We then propose an automatic task search to optimize the combination of synthetic signals using limited downstream-task-specific labels, further improving the performance of GenNAS. We also thoroughly evaluate GenNAS's generality and end-to-end NAS performance on all search spaces, which outperforms almost all existing works with significant speedup.

Viaarxiv icon

TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data

Jun 13, 2021
Pengda Qin, Yuhong Li, Kefeng Deng, Qiang Wu

Figure 1 for TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data
Figure 2 for TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data
Figure 3 for TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data
Figure 4 for TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data

Among ubiquitous multimodal data in the real world, text is the modality generated by human, while image reflects the physical world honestly. In a visual understanding application, machines are expected to understand images like human. Inspired by this, we propose a novel self-supervised learning method, named Text-enhanced Visual Deep InfoMax (TVDIM), to learn better visual representations by fully utilizing the naturally-existing multimodal data. Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views of a shared context to a rational degree. Different from previous methods which only consider multiple views from a single modality, our work produces multiple views from different modalities, and jointly optimizes the mutual information for features pairs of intra-modality and inter-modality. Considering the information gap between inter-modality features pairs from data noise, we adopt a \emph{ranking-based} contrastive learning to optimize the mutual information. During evaluation, we directly use the pre-trained visual representations to complete various image classification tasks. Experimental results show that, TVDIM significantly outperforms previous visual self-supervised methods when processing the same set of images.

Viaarxiv icon

InfoBehavior: Self-supervised Representation Learning for Ultra-long Behavior Sequence via Hierarchical Grouping

Jun 13, 2021
Runshi Liu, Pengda Qin, Yuhong Li, Weigao Wen, Dong Li, Kefeng Deng, Qiang Wu

Figure 1 for InfoBehavior: Self-supervised Representation Learning for Ultra-long Behavior Sequence via Hierarchical Grouping
Figure 2 for InfoBehavior: Self-supervised Representation Learning for Ultra-long Behavior Sequence via Hierarchical Grouping
Figure 3 for InfoBehavior: Self-supervised Representation Learning for Ultra-long Behavior Sequence via Hierarchical Grouping
Figure 4 for InfoBehavior: Self-supervised Representation Learning for Ultra-long Behavior Sequence via Hierarchical Grouping

E-commerce companies have to face abnormal sellers who sell potentially-risky products. Typically, the risk can be identified by jointly considering product content (e.g., title and image) and seller behavior. This work focuses on behavior feature extraction as behavior sequences can provide valuable clues for the risk discovery by reflecting the sellers' operation habits. Traditional feature extraction techniques heavily depend on domain experts and adapt poorly to new tasks. In this paper, we propose a self-supervised method InfoBehavior to automatically extract meaningful representations from ultra-long raw behavior sequences instead of the costly feature selection procedure. InfoBehavior utilizes Bidirectional Transformer as feature encoder due to its excellent capability in modeling long-term dependency. However, it is intractable for commodity GPUs because the time and memory required by Transformer grow quadratically with the increase of sequence length. Thus, we propose a hierarchical grouping strategy to aggregate ultra-long raw behavior sequences to length-processable high-level embedding sequences. Moreover, we introduce two types of pretext tasks. Sequence-related pretext task defines a contrastive-based training objective to correctly select the masked-out coarse-grained/fine-grained behavior sequences against other "distractor" behavior sequences; Domain-related pretext task designs a classification training objective to correctly predict the domain-specific statistical results of anomalous behavior. We show that behavior representations from the pre-trained InfoBehavior can be directly used or integrated with features from other side information to support a wide range of downstream tasks. Experimental results demonstrate that InfoBehavior significantly improves the performance of Product Risk Management and Intellectual Property Protection.

Viaarxiv icon

Effective Algorithm-Accelerator Co-design for AI Solutions on Edge Devices

Oct 15, 2020
Cong Hao, Yao Chen, Xiaofan Zhang, Yuhong Li, Jinjun Xiong, Wen-mei Hwu, Deming Chen

Figure 1 for Effective Algorithm-Accelerator Co-design for AI Solutions on Edge Devices
Figure 2 for Effective Algorithm-Accelerator Co-design for AI Solutions on Edge Devices
Figure 3 for Effective Algorithm-Accelerator Co-design for AI Solutions on Edge Devices
Figure 4 for Effective Algorithm-Accelerator Co-design for AI Solutions on Edge Devices

High quality AI solutions require joint optimization of AI algorithms, such as deep neural networks (DNNs), and their hardware accelerators. To improve the overall solution quality as well as to boost the design productivity, efficient algorithm and accelerator co-design methodologies are indispensable. In this paper, we first discuss the motivations and challenges for the Algorithm/Accelerator co-design problem and then provide several effective solutions. Especially, we highlight three leading works of effective co-design methodologies: 1) the first simultaneous DNN/FPGA co-design method; 2) a bi-directional lightweight DNN and accelerator co-design method; 3) a differentiable and efficient DNN and accelerator co-search method. We demonstrate the effectiveness of the proposed co-design approaches using extensive experiments on both FPGAs and GPUs, with comparisons to existing works. This paper emphasizes the importance and efficacy of algorithm-accelerator co-design and calls for more research breakthroughs in this interesting and demanding area.

* GLSVLSI, September 7-9, 2020 
Viaarxiv icon

GAP++: Learning to generate target-conditioned adversarial examples

Jun 09, 2020
Xiaofeng Mao, Yuefeng Chen, Yuhong Li, Yuan He, Hui Xue

Figure 1 for GAP++: Learning to generate target-conditioned adversarial examples
Figure 2 for GAP++: Learning to generate target-conditioned adversarial examples
Figure 3 for GAP++: Learning to generate target-conditioned adversarial examples
Figure 4 for GAP++: Learning to generate target-conditioned adversarial examples

Adversarial examples are perturbed inputs which can cause a serious threat for machine learning models. Finding these perturbations is such a hard task that we can only use the iterative methods to traverse. For computational efficiency, recent works use adversarial generative networks to model the distribution of both the universal or image-dependent perturbations directly. However, these methods generate perturbations only rely on input images. In this work, we propose a more general-purpose framework which infers target-conditioned perturbations dependent on both input image and target label. Different from previous single-target attack models, our model can conduct target-conditioned attacks by learning the relations of attack target and the semantics in image. Using extensive experiments on the datasets of MNIST and CIFAR10, we show that our method achieves superior performance with single target attack models and obtains high fooling rates with small perturbation norms.

* Accepted to IJCAI 2019 AIBS Workshop 
Viaarxiv icon