Alert button
Picture for Zhuang Liu

Zhuang Liu

Alert button

Semantic Graph Representation Learning for Handwritten Mathematical Expression Recognition

Aug 21, 2023
Zhuang Liu, Ye Yuan, Zhilong Ji, Jingfeng Bai, Xiang Bai

Handwritten mathematical expression recognition (HMER) has attracted extensive attention recently. However, current methods cannot explicitly study the interactions between different symbols, which may fail when faced similar symbols. To alleviate this issue, we propose a simple but efficient method to enhance semantic interaction learning (SIL). Specifically, we firstly construct a semantic graph based on the statistical symbol co-occurrence probabilities. Then we design a semantic aware module (SAM), which projects the visual and classification feature into semantic space. The cosine distance between different projected vectors indicates the correlation between symbols. And jointly optimizing HMER and SIL can explicitly enhances the model's understanding of symbol relationships. In addition, SAM can be easily plugged into existing attention-based models for HMER and consistently bring improvement. Extensive experiments on public benchmark datasets demonstrate that our proposed module can effectively enhance the recognition performance. Our method achieves better recognition performance than prior arts on both CROHME and HME100K datasets.

* 12 Pages 
Viaarxiv icon

A Simple and Effective Pruning Approach for Large Language Models

Jun 20, 2023
Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

Figure 1 for A Simple and Effective Pruning Approach for Large Language Models
Figure 2 for A Simple and Effective Pruning Approach for Large Language Models
Figure 3 for A Simple and Effective Pruning Approach for Large Language Models
Figure 4 for A Simple and Effective Pruning Approach for Large Language Models

As their size increases, Large Languages Models (LLMs) are natural candidates for network pruning methods: approaches that drop a subset of network weights while striving to preserve performance. Existing methods, however, require either retraining, which is rarely affordable for billion-scale LLMs, or solving a weight reconstruction problem reliant on second-order information, which may also be computationally expensive. In this paper, we introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs. Motivated by the recent observation of emergent large magnitude features in LLMs, our approach prune weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis. Notably, Wanda requires no retraining or weight update, and the pruned LLM can be used as is. We conduct a thorough evaluation of our method on LLaMA across various language benchmarks. Wanda significantly outperforms the established baseline of magnitude pruning and competes favorably against recent methods involving intensive weight update. Code is available at https://github.com/locuslab/wanda.

* Technical Report 
Viaarxiv icon

One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning

Jun 13, 2023
Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, Zhiqiang Shen

Figure 1 for One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning
Figure 2 for One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning
Figure 3 for One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning
Figure 4 for One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning

We present Generalized LoRA (GLoRA), an advanced approach for universal parameter-efficient fine-tuning tasks. Enhancing Low-Rank Adaptation (LoRA), GLoRA employs a generalized prompt module to optimize pre-trained model weights and adjust intermediate activations, providing more flexibility and capability across diverse tasks and datasets. Moreover, GLoRA facilitates efficient parameter adaptation by employing a scalable, modular, layer-wise structure search that learns individual adapter of each layer. Originating from a unified mathematical formulation, GLoRA exhibits strong transfer learning, few-shot learning and domain generalization abilities, as it adjusts to new tasks through additional dimensions on weights and activations. Comprehensive experiments demonstrate that GLoRA outperforms all previous methods in natural, specialized, and structured benchmarks, achieving superior accuracy with fewer parameters and computations on various datasets. Furthermore, our structural re-parameterization design ensures that GLoRA incurs no extra inference cost, rendering it a practical solution for resource-limited applications. Code is available at: https://github.com/Arnav0400/ViT-Slim/tree/master/GLoRA.

* Technical report 
Viaarxiv icon

ImageBind: One Embedding Space To Bind Them All

May 09, 2023
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

Figure 1 for ImageBind: One Embedding Space To Bind Them All
Figure 2 for ImageBind: One Embedding Space To Bind Them All
Figure 3 for ImageBind: One Embedding Space To Bind Them All
Figure 4 for ImageBind: One Embedding Space To Bind Them All

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks.

* CVPR 2023 (Highlighted Paper). Website: https://imagebind.metademolab.com/ Code/Models: https://github.com/facebookresearch/ImageBind 
Viaarxiv icon

A optimization framework for herbal prescription planning based on deep reinforcement learning

Apr 25, 2023
Kuo Yang, Zecong Yu, Xin Su, Xiong He, Ning Wang, Qiguang Zheng, Feidie Yu, Zhuang Liu, Tiancai Wen, Xuezhong Zhou

Figure 1 for A optimization framework for herbal prescription planning based on deep reinforcement learning
Figure 2 for A optimization framework for herbal prescription planning based on deep reinforcement learning
Figure 3 for A optimization framework for herbal prescription planning based on deep reinforcement learning
Figure 4 for A optimization framework for herbal prescription planning based on deep reinforcement learning

Treatment planning for chronic diseases is a critical task in medical artificial intelligence, particularly in traditional Chinese medicine (TCM). However, generating optimized sequential treatment strategies for patients with chronic diseases in different clinical encounters remains a challenging issue that requires further exploration. In this study, we proposed a TCM herbal prescription planning framework based on deep reinforcement learning for chronic disease treatment (PrescDRL). PrescDRL is a sequential herbal prescription optimization model that focuses on long-term effectiveness rather than achieving maximum reward at every step, thereby ensuring better patient outcomes. We constructed a high-quality benchmark dataset for sequential diagnosis and treatment of diabetes and evaluated PrescDRL against this benchmark. Our results showed that PrescDRL achieved a higher curative effect, with the single-step reward improving by 117% and 153% compared to doctors. Furthermore, PrescDRL outperformed the benchmark in prescription prediction, with precision improving by 40.5% and recall improving by 63%. Overall, our study demonstrates the potential of using artificial intelligence to improve clinical intelligent diagnosis and treatment in TCM.

* 13 pages, 4 figures 
Viaarxiv icon

Dropout Reduces Underfitting

Mar 02, 2023
Zhuang Liu, Zhiqiu Xu, Joseph Jin, Zhiqiang Shen, Trevor Darrell

Figure 1 for Dropout Reduces Underfitting
Figure 2 for Dropout Reduces Underfitting
Figure 3 for Dropout Reduces Underfitting
Figure 4 for Dropout Reduces Underfitting

Introduced by Hinton et al. in 2012, dropout has stood the test of time as a regularizer for preventing overfitting in neural networks. In this study, we demonstrate that dropout can also mitigate underfitting when used at the start of training. During the early phase, we find dropout reduces the directional variance of gradients across mini-batches and helps align the mini-batch gradients with the entire dataset's gradient. This helps counteract the stochasticity of SGD and limit the influence of individual batches on model training. Our findings lead us to a solution for improving performance in underfitting models - early dropout: dropout is applied only during the initial phases of training, and turned off afterwards. Models equipped with early dropout achieve lower final training loss compared to their counterparts without dropout. Additionally, we explore a symmetric technique for regularizing overfitting models - late dropout, where dropout is not used in the early iterations and is only activated later in training. Experiments on ImageNet and various vision tasks demonstrate that our methods consistently improve generalization accuracy. Our results encourage more research on understanding regularization in deep learning and our methods can be useful tools for future neural network training, especially in the era of large data. Code is available at https://github.com/facebookresearch/dropout .

* 16 pages 
Viaarxiv icon

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Jan 02, 2023
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie

Figure 1 for ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
Figure 2 for ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
Figure 3 for ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
Figure 4 for ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.

* Code and models available at https://github.com/facebookresearch/ConvNeXt-V2 
Viaarxiv icon

1st Place Solutions for UG2+ Challenge 2022 ATMOSPHERIC TURBULENCE MITIGATION

Oct 30, 2022
Zhuang Liu, Zhichao Zhao, Ye Yuan, Zhi Qiao, Jinfeng Bai, Zhilong Ji

Figure 1 for 1st Place Solutions for UG2+ Challenge 2022 ATMOSPHERIC TURBULENCE MITIGATION
Figure 2 for 1st Place Solutions for UG2+ Challenge 2022 ATMOSPHERIC TURBULENCE MITIGATION
Figure 3 for 1st Place Solutions for UG2+ Challenge 2022 ATMOSPHERIC TURBULENCE MITIGATION
Figure 4 for 1st Place Solutions for UG2+ Challenge 2022 ATMOSPHERIC TURBULENCE MITIGATION

In this technical report, we briefly introduce the solution of our team ''summer'' for Atomospheric Turbulence Mitigation in UG$^2$+ Challenge in CVPR 2022. In this task, we propose a unified end-to-end framework to reconstruct a high quality image from distorted frames, which is mainly consists of a Restormer-based image reconstruction module and a NIMA-based image quality assessment module. Our framework is efficient and generic, which is adapted to both hot-air image and text pattern. Moreover, we elaborately synthesize more than 10 thousands of images to simulate atmospheric turbulence. And these images improve the robustness of the model. Finally, we achieve the average accuracy of 98.53\% on the reconstruction result of the text patterns, ranking 1st on the final leaderboard.

Viaarxiv icon

Lumen Shape Reconstruction using a Soft Robotic Balloon Catheter and Electrical Impedance Tomography

Jul 25, 2022
James Avery, Mark Runciman, Cristina Fiani, Elena Monfort Sanchez, Saina Akhond, Zhuang Liu, Kirill Aristovich, George Mylonas

Figure 1 for Lumen Shape Reconstruction using a Soft Robotic Balloon Catheter and Electrical Impedance Tomography
Figure 2 for Lumen Shape Reconstruction using a Soft Robotic Balloon Catheter and Electrical Impedance Tomography
Figure 3 for Lumen Shape Reconstruction using a Soft Robotic Balloon Catheter and Electrical Impedance Tomography
Figure 4 for Lumen Shape Reconstruction using a Soft Robotic Balloon Catheter and Electrical Impedance Tomography

Incorrectly sized balloon catheters can lead to increased post-surgical complications, yet even with preoperative imaging, correct selection remains a challenge. With limited feedback during surgery, it is difficult to verify correct deployment. We propose the use of integrated impedance measurements and Electrical Impedance Tomography (EIT) imaging to assess the deformation of the balloon and determine the size and shape of the surrounding lumen. Previous work using single impedance measurements, or pressure data and analytical models, whilst demonstrating high sizing accuracy, have assumed a circular cross section. Here we extend these methods by adding a multitude of electrodes to detect elliptical and occluded lumen and obtain EIT images to localise deformations. Using a 14 Fr (5.3 mm) catheter as an example, numerical simulations were performed to find the optimal electrode configuration of two rings of 8 electrodes spaced 10 mm apart. The simulations predicted that the maximum detectable aspect ratio decreased from 0.9 for a 14mm balloon to 0.5 at 30mm. The sizing and ellipticity detection results were verified experimentally. A prototype robotic balloon catheter was constructed to automatically inflate a compliant balloon while simultaneously recording EIT and pressure data. Data were collected in experiments replicating stenotic vessels with an elliptical and asymmetrical profile, and the widening of a lumen during angioplasty. After calibration, the system was able to correctly localise the occlusion and detect aspect ratios of 0.75. EIT images further localised the occlusion and visualised the dilation of the lumen during balloon inflation.

* Accepted for IROS 2022 The IEEE/RSJ International Conference on Intelligent Robots and Systems 
Viaarxiv icon

A ConvNet for the 2020s

Jan 10, 2022
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie

Figure 1 for A ConvNet for the 2020s
Figure 2 for A ConvNet for the 2020s
Figure 3 for A ConvNet for the 2020s
Figure 4 for A ConvNet for the 2020s

The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

* Technical report; Code: https://github.com/facebookresearch/ConvNeXt 
Viaarxiv icon