Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bo Liu

Image Processing Center, Beihang University, Beijing, China

Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts

Oct 12, 2023

Lizhang Chen, Bo Liu, Kaizhao Liang, Qiang Liu

Figure 1 for Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts

Figure 2 for Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts

Figure 3 for Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts

Figure 4 for Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts

Abstract:Lion (Evolved Sign Momentum), a new optimizer discovered through program search, has shown promising results in training large AI models. It performs comparably or favorably to AdamW but with greater memory efficiency. As we can expect from the results of a random search program, Lion incorporates elements from several existing algorithms, including signed momentum, decoupled weight decay, Polak, and Nesterov momentum, but does not fit into any existing category of theoretically grounded optimizers. Thus, even though Lion appears to perform well as a general-purpose optimizer for a wide range of tasks, its theoretical basis remains uncertain. This lack of theoretical clarity limits opportunities to further enhance and expand Lion's efficacy. This work aims to demystify Lion. Based on both continuous-time and discrete-time analysis, we demonstrate that Lion is a theoretically novel and principled approach for minimizing a general loss function $f(x)$ while enforcing a bound constraint $\|x\|_\infty \leq 1/\lambda$. Lion achieves this through the incorporation of decoupled weight decay, where $\lambda$ represents the weight decay coefficient. Our analysis is made possible by the development of a new Lyapunov function for the Lion updates. It applies to a broader family of Lion-$\kappa$ algorithms, where the $\text{sign}(\cdot)$ operator in Lion is replaced by the subgradient of a convex function $\kappa$, leading to the solution of a general composite optimization problem of $\min_x f(x) + \kappa^*(x)$. Our findings provide valuable insights into the dynamics of Lion and pave the way for further improvements and extensions of Lion-related algorithms.

* 26 pages, 6 figures

Via

Access Paper or Ask Questions

UCF-Crime Annotation: A Benchmark for Surveillance Video-and-Language Understanding

Sep 25, 2023

Tongtong Yuan, Xuange Zhang, Kun Liu, Bo Liu, Jian Jin, Zhenzhen Jiao

Abstract:Surveillance videos are an essential component of daily life with various critical applications, particularly in public security. However, current surveillance video tasks mainly focus on classifying and localizing anomalous events. Existing methods are limited to detecting and classifying the predefined events with unsatisfactory generalization ability and semantic understanding, although they have obtained considerable performance. To address this issue, we propose constructing the first multimodal surveillance video dataset by manually annotating the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), provides a novel benchmark for multimodal surveillance video analysis. It not only describes events in detailed descriptions but also provides precise temporal grounding of the events in 0.1-second intervals. UCA contains 20,822 sentences, with an average length of 23 words, and its annotated videos are as long as 102 hours. Furthermore, we benchmark the state-of-the-art models of multiple multimodal tasks on this newly created dataset, including temporal sentence grounding in videos, video captioning, and dense video captioning. Through our experiments, we found that mainstream models used in previously publicly available datasets perform poorly on multimodal surveillance video scenarios, which highlights the necessity of constructing this dataset. The link to our dataset and code is provided at: https://github.com/Xuange923/UCA-dataset.

Via

Access Paper or Ask Questions

Causal-Story: Local Causal Attention Utilizing Parameter-Efficient Tuning For Visual Story Synthesis

Sep 21, 2023

Tianyi Song, Jiuxin Cao, Kun Wang, Bo Liu, Xiaofeng Zhang

Abstract:The excellent text-to-image synthesis capability of diffusion models has driven progress in synthesizing coherent visual stories. The current state-of-the-art method combines the features of historical captions, historical frames, and the current captions as conditions for generating the current frame. However, this method treats each historical frame and caption as the same contribution. It connects them in order with equal weights, ignoring that not all historical conditions are associated with the generation of the current frame. To address this issue, we propose Causal-Story. This model incorporates a local causal attention mechanism that considers the causal relationship between previous captions, frames, and current captions. By assigning weights based on this relationship, Causal-Story generates the current frame, thereby improving the global consistency of story generation. We evaluated our model on the PororoSV and FlintstonesSV datasets and obtained state-of-the-art FID scores, and the generated frames also demonstrate better storytelling in visuals.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

Sep 14, 2023

Lei Fan, Bo Liu, Haoxiang Li, Ying Wu, Gang Hua

Figure 1 for Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

Figure 2 for Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

Figure 3 for Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

Figure 4 for Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance

Abstract:In real-world scenarios, typical visual recognition systems could fail under two major causes, i.e., the misclassification between known classes and the excusable misbehavior on unknown-class images. To tackle these deficiencies, flexible visual recognition should dynamically predict multiple classes when they are unconfident between choices and reject making predictions when the input is entirely out of the training distribution. Two challenges emerge along with this novel task. First, prediction uncertainty should be separately quantified as confusion depicting inter-class uncertainties and ignorance identifying out-of-distribution samples. Second, both confusion and ignorance should be comparable between samples to enable effective decision-making. In this paper, we propose to model these two sources of uncertainty explicitly with the theory of Subjective Logic. Regarding recognition as an evidence-collecting process, confusion is then defined as conflicting evidence, while ignorance is the absence of evidence. By predicting Dirichlet concentration parameters for singletons, comprehensive subjective opinions, including confusion and ignorance, could be achieved via further evidence combinations. Through a series of experiments on synthetic data analysis, visual recognition, and open-set detection, we demonstrate the effectiveness of our methods in quantifying two sources of uncertainties and dealing with flexible recognition.

* Accepted by ICCV23

Via

Access Paper or Ask Questions

How Good Are Large Language Models at Out-of-Distribution Detection?

Aug 23, 2023

Bo Liu, Liming Zhan, Zexin Lu, Yujie Feng, Lei Xue, Xiao-Ming Wu

Figure 1 for How Good Are Large Language Models at Out-of-Distribution Detection?

Figure 2 for How Good Are Large Language Models at Out-of-Distribution Detection?

Figure 3 for How Good Are Large Language Models at Out-of-Distribution Detection?

Figure 4 for How Good Are Large Language Models at Out-of-Distribution Detection?

Abstract:Out-of-distribution (OOD) detection plays a vital role in enhancing the reliability of machine learning (ML) models. The emergence of large language models (LLMs) has catalyzed a paradigm shift within the ML community, showcasing their exceptional capabilities across diverse natural language processing tasks. While existing research has probed OOD detection with relative small-scale Transformers like BERT, RoBERTa and GPT-2, the stark differences in scales, pre-training objectives, and inference paradigms call into question the applicability of these findings to LLMs. This paper embarks on a pioneering empirical investigation of OOD detection in the domain of LLMs, focusing on LLaMA series ranging from 7B to 65B in size. We thoroughly evaluate commonly-used OOD detectors, scrutinizing their performance in both zero-grad and fine-tuning scenarios. Notably, we alter previous discriminative in-distribution fine-tuning into generative fine-tuning, aligning the pre-training objective of LLMs with downstream tasks. Our findings unveil that a simple cosine distance OOD detector demonstrates superior efficacy, outperforming other OOD detectors. We provide an intriguing explanation for this phenomenon by highlighting the isotropic nature of the embedding spaces of LLMs, which distinctly contrasts with the anisotropic property observed in smaller BERT family models. The new insight enhances our understanding of how LLMs detect OOD data, thereby enhancing their adaptability and reliability in dynamic environments.

* Work in progress

Via

Access Paper or Ask Questions

DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction

Aug 18, 2023

Xiaoxiao He, Chaowei Tan, Ligong Han, Bo Liu, Leon Axel, Kang Li, Dimitris N. Metaxas

Abstract:Accurate 3D cardiac reconstruction from cine magnetic resonance imaging (cMRI) is crucial for improved cardiovascular disease diagnosis and understanding of the heart's motion. However, current cardiac MRI-based reconstruction technology used in clinical settings is 2D with limited through-plane resolution, resulting in low-quality reconstructed cardiac volumes. To better reconstruct 3D cardiac volumes from sparse 2D image stacks, we propose a morphology-guided diffusion model for 3D cardiac volume reconstruction, DMCVR, that synthesizes high-resolution 2D images and corresponding 3D reconstructed volumes. Our method outperforms previous approaches by conditioning the cardiac morphology on the generative model, eliminating the time-consuming iterative optimization process of the latent code, and improving generation quality. The learned latent spaces provide global semantics, local cardiac morphology and details of each 2D cMRI slice with highly interpretable value to reconstruct 3D cardiac shape. Our experiments show that DMCVR is highly effective in several aspects, such as 2D generation and 3D reconstruction performance. With DMCVR, we can produce high-resolution 3D cardiac MRI reconstructions, surpassing current techniques. Our proposed framework has great potential for improving the accuracy of cardiac disease diagnosis and treatment planning. Code can be accessed at https://github.com/hexiaoxiao-cs/DMCVR.

* Accepted in MICCAI 2023

Via

Access Paper or Ask Questions

Exploring Winograd Convolution for Cost-effective Neural Network Fault Tolerance

Aug 16, 2023

Xinghua Xue, Cheng Liu, Bo Liu, Haitong Huang, Ying Wang, Tao Luo, Lei Zhang, Huawei Li, Xiaowei Li

Figure 1 for Exploring Winograd Convolution for Cost-effective Neural Network Fault Tolerance

Figure 2 for Exploring Winograd Convolution for Cost-effective Neural Network Fault Tolerance

Figure 3 for Exploring Winograd Convolution for Cost-effective Neural Network Fault Tolerance

Figure 4 for Exploring Winograd Convolution for Cost-effective Neural Network Fault Tolerance

Abstract:Winograd is generally utilized to optimize convolution performance and computational efficiency because of the reduced multiplication operations, but the reliability issues brought by winograd are usually overlooked. In this work, we observe the great potential of winograd convolution in improving neural network (NN) fault tolerance. Based on the observation, we evaluate winograd convolution fault tolerance comprehensively from different granularities ranging from models, layers, and operation types for the first time. Then, we explore the use of inherent fault tolerance of winograd convolution for cost-effective NN protection against soft errors. Specifically, we mainly investigate how winograd convolution can be effectively incorporated with classical fault-tolerant design approaches including triple modular redundancy (TMR), fault-aware retraining, and constrained activation functions. According to our experiments, winograd convolution can reduce the fault-tolerant design overhead by 55.77\% on average without any accuracy loss compared to standard convolution, and further reduce the computing overhead by 17.24\% when the inherent fault tolerance of winograd convolution is considered. When it is applied on fault-tolerant neural networks enhanced with fault-aware retraining and constrained activation functions, the resulting model accuracy generally shows significant improvement in presence of various faults.

Via

Access Paper or Ask Questions

Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification

Jul 31, 2023

Wenhao Tang, Sheng Huang, Xiaoxian Zhang, Fengtao Zhou, Yi Zhang, Bo Liu

Figure 1 for Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification

Figure 2 for Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification

Figure 3 for Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification

Figure 4 for Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification

Abstract:The whole slide image (WSI) classification is often formulated as a multiple instance learning (MIL) problem. Since the positive tissue is only a small fraction of the gigapixel WSI, existing MIL methods intuitively focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting hard-to-classify instances. Some literature has revealed that hard examples are beneficial for modeling a discriminative boundary accurately. By applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which uses a Siamese structure (Teacher-Student) with a consistency constraint to explore the potential hard instances. With several instance masking strategies based on attention scores, MHIM-MIL employs a momentum teacher to implicitly mine hard instances for training the student model, which can be any attention-based MIL model. This counter-intuitive strategy essentially enables the student to learn a better discriminating boundary. Moreover, the student is used to update the teacher with an exponential moving average (EMA), which in turn identifies new hard instances for subsequent training iterations and stabilizes the optimization. Experimental results on the CAMELYON-16 and TCGA Lung Cancer datasets demonstrate that MHIM-MIL outperforms other latest methods in terms of performance and training cost. The code is available at: https://github.com/DearCaat/MHIM-MIL.

* Accepted by ICCV2023

Via

Access Paper or Ask Questions

Click-Conversion Multi-Task Model with Position Bias Mitigation for Sponsored Search in eCommerce

Jul 29, 2023

Yibo Wang, Yanbing Xue, Bo Liu, Musen Wen, Wenting Zhao, Stephen Guo, Philip S. Yu

Figure 1 for Click-Conversion Multi-Task Model with Position Bias Mitigation for Sponsored Search in eCommerce

Figure 2 for Click-Conversion Multi-Task Model with Position Bias Mitigation for Sponsored Search in eCommerce

Figure 3 for Click-Conversion Multi-Task Model with Position Bias Mitigation for Sponsored Search in eCommerce

Figure 4 for Click-Conversion Multi-Task Model with Position Bias Mitigation for Sponsored Search in eCommerce

Abstract:Position bias, the phenomenon whereby users tend to focus on higher-ranked items of the search result list regardless of the actual relevance to queries, is prevailing in many ranking systems. Position bias in training data biases the ranking model, leading to increasingly unfair item rankings, click-through-rate (CTR), and conversion rate (CVR) predictions. To jointly mitigate position bias in both item CTR and CVR prediction, we propose two position-bias-free CTR and CVR prediction models: Position-Aware Click-Conversion (PACC) and PACC via Position Embedding (PACC-PE). PACC is built upon probability decomposition and models position information as a probability. PACC-PE utilizes neural networks to model product-specific position information as embedding. Experiments on the E-commerce sponsored product search dataset show that our proposed models have better ranking effectiveness and can greatly alleviate position bias in both CTR and CVR prediction.

* In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1884-1888. 2023
* Modified some typos of the published SIGIR version

Via

Access Paper or Ask Questions

Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

Jul 19, 2023

Lei Wang, Bo Liu, Fangfang Liang, Bincheng Wang

Figure 1 for Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

Figure 2 for Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

Figure 3 for Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

Figure 4 for Hierarchical Spatio-Temporal Representation Learning for Gait Recognition

Abstract:Gait recognition is a biometric technique that identifies individuals by their unique walking styles, which is suitable for unconstrained environments and has a wide range of applications. While current methods focus on exploiting body part-based representations, they often neglect the hierarchical dependencies between local motion patterns. In this paper, we propose a hierarchical spatio-temporal representation learning (HSTL) framework for extracting gait features from coarse to fine. Our framework starts with a hierarchical clustering analysis to recover multi-level body structures from the whole body to local details. Next, an adaptive region-based motion extractor (ARME) is designed to learn region-independent motion features. The proposed HSTL then stacks multiple ARMEs in a top-down manner, with each ARME corresponding to a specific partition level of the hierarchy. An adaptive spatio-temporal pooling (ASTP) module is used to capture gait features at different levels of detail to perform hierarchical feature mapping. Finally, a frame-level temporal aggregation (FTA) module is employed to reduce redundant information in gait sequences through multi-scale temporal downsampling. Extensive experiments on CASIA-B, OUMVLP, GREW, and Gait3D datasets demonstrate that our method outperforms the state-of-the-art while maintaining a reasonable balance between model accuracy and complexity.

* Accepted to ICCV2023

Via

Access Paper or Ask Questions