Alert button
Picture for Zeyi Huang

Zeyi Huang

Alert button

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance

Sep 21, 2023
Zeyi Huang, Andy Zhou, Zijian Lin, Mu Cai, Haohan Wang, Yong Jae Lee

Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models.

* to appear at ICCV2023 
Viaarxiv icon

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Jun 09, 2023
Mu Cai, Zeyi Huang, Yuheng Li, Haohan Wang, Yong Jae Lee

Figure 1 for Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
Figure 2 for Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
Figure 3 for Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
Figure 4 for Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Recently, large language models (LLMs) have made significant advancements in natural language understanding and generation. However, their potential in computer vision remains largely unexplored. In this paper, we introduce a new, exploratory approach that enables LLMs to process images using the Scalable Vector Graphics (SVG) format. By leveraging the XML-based textual descriptions of SVG representations instead of raster images, we aim to bridge the gap between the visual and textual modalities, allowing LLMs to directly understand and manipulate images without the need for parameterized visual components. Our method facilitates simple image classification, generation, and in-context learning using only LLM capabilities. We demonstrate the promise of our approach across discriminative and generative tasks, highlighting its (i) robustness against distribution shift, (ii) substantial improvements achieved by tapping into the in-context learning abilities of LLMs, and (iii) image understanding and generation capabilities with human guidance. Our code, data, and models can be found here https://github.com/mu-cai/svg-llm.

Viaarxiv icon

Expeditious Saliency-guided Mix-up through Random Gradient Thresholding

Dec 17, 2022
Minh-Long Luu, Zeyi Huang, Eric P. Xing, Yong Jae Lee, Haohan Wang

Figure 1 for Expeditious Saliency-guided Mix-up through Random Gradient Thresholding
Figure 2 for Expeditious Saliency-guided Mix-up through Random Gradient Thresholding
Figure 3 for Expeditious Saliency-guided Mix-up through Random Gradient Thresholding
Figure 4 for Expeditious Saliency-guided Mix-up through Random Gradient Thresholding

Mix-up training approaches have proven to be effective in improving the generalization ability of Deep Neural Networks. Over the years, the research community expands mix-up methods into two directions, with extensive efforts to improve saliency-guided procedures but minimal focus on the arbitrary path, leaving the randomization domain unexplored. In this paper, inspired by the superior qualities of each direction over one another, we introduce a novel method that lies at the junction of the two routes. By combining the best elements of randomness and saliency utilization, our method balances speed, simplicity, and accuracy. We name our method R-Mix following the concept of "Random Mix-up". We demonstrate its effectiveness in generalization, weakly supervised object localization, calibration, and robustness to adversarial attacks. Finally, in order to address the question of whether there exists a better decision protocol, we train a Reinforcement Learning agent that decides the mix-up policies based on the classifier's performance, reducing dependency on human-designed objectives and hyperparameter tuning. Extensive experiments further show that the agent is capable of performing at the cutting-edge level, laying the foundation for a fully automatic mix-up. Our code is released at [https://github.com/minhlong94/Random-Mixup].

* Accepted Long paper at 2nd Practical-DL Workshop at AAAI 2023. V2 fix typo 
Viaarxiv icon

Toward Learning Robust and Invariant Representations with Alignment Regularization and Data Augmentation

Jun 04, 2022
Haohan Wang, Zeyi Huang, Xindi Wu, Eric P. Xing

Figure 1 for Toward Learning Robust and Invariant Representations with Alignment Regularization and Data Augmentation
Figure 2 for Toward Learning Robust and Invariant Representations with Alignment Regularization and Data Augmentation
Figure 3 for Toward Learning Robust and Invariant Representations with Alignment Regularization and Data Augmentation
Figure 4 for Toward Learning Robust and Invariant Representations with Alignment Regularization and Data Augmentation

Data augmentation has been proven to be an effective technique for developing machine learning models that are robust to known classes of distributional shifts (e.g., rotations of images), and alignment regularization is a technique often used together with data augmentation to further help the model learn representations invariant to the shifts used to augment the data. In this paper, motivated by a proliferation of options of alignment regularizations, we seek to evaluate the performances of several popular design choices along the dimensions of robustness and invariance, for which we introduce a new test procedure. Our synthetic experiment results speak to the benefits of squared l2 norm regularization. Further, we also formally analyze the behavior of alignment regularization to complement our empirical study under assumptions we consider realistic. Finally, we test this simple technique we identify (worst-case data augmentation with squared l2 norm alignment regularization) and show that the benefits of this method outrun those of the specially designed methods. We also release a software package in both TensorFlow and PyTorch for users to use the method with a couple of lines at https://github.com/jyanln/AlignReg.

* to appear at KDD 2022, the software package is at https://github.com/jyanln/AlignReg. arXiv admin note: text overlap with arXiv:2011.13052 
Viaarxiv icon

The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization

Apr 09, 2022
Zeyi Huang, Haohan Wang, Dong Huang, Yong Jae Lee, Eric P. Xing

Figure 1 for The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization
Figure 2 for The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization
Figure 3 for The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization
Figure 4 for The Two Dimensions of Worst-case Training and the Integrated Effect for Out-of-domain Generalization

Training with an emphasis on "hard-to-learn" components of the data has been proven as an effective method to improve the generalization of machine learning models, especially in the settings where robustness (e.g., generalization across distributions) is valued. Existing literature discussing this "hard-to-learn" concept are mainly expanded either along the dimension of the samples or the dimension of the features. In this paper, we aim to introduce a simple view merging these two dimensions, leading to a new, simple yet effective, heuristic to train machine learning models by emphasizing the worst-cases on both the sample and the feature dimensions. We name our method W2D following the concept of "Worst-case along Two Dimensions". We validate the idea and demonstrate its empirical strength over standard benchmarks.

* to appear at CVPR2022 
Viaarxiv icon

On the Integration of Self-Attention and Convolution

Nov 29, 2021
Xuran Pan, Chunjiang Ge, Rui Lu, Shiji Song, Guanfu Chen, Zeyi Huang, Gao Huang

Figure 1 for On the Integration of Self-Attention and Convolution
Figure 2 for On the Integration of Self-Attention and Convolution
Figure 3 for On the Integration of Self-Attention and Convolution
Figure 4 for On the Integration of Self-Attention and Convolution

Convolution and self-attention are two powerful techniques for representation learning, and they are usually considered as two peer approaches that are distinct from each other. In this paper, we show that there exists a strong underlying relation between them, in the sense that the bulk of computations of these two paradigms are in fact done with the same operation. Specifically, we first show that a traditional convolution with kernel size k x k can be decomposed into k^2 individual 1x1 convolutions, followed by shift and summation operations. Then, we interpret the projections of queries, keys, and values in self-attention module as multiple 1x1 convolutions, followed by the computation of attention weights and aggregation of the values. Therefore, the first stage of both two modules comprises the similar operation. More importantly, the first stage contributes a dominant computation complexity (square of the channel size) comparing to the second stage. This observation naturally leads to an elegant integration of these two seemingly distinct paradigms, i.e., a mixed model that enjoys the benefit of both self-Attention and Convolution (ACmix), while having minimum computational overhead compared to the pure convolution or self-attention counterpart. Extensive experiments show that our model achieves consistently improved results over competitive baselines on image recognition and downstream tasks. Code and pre-trained models will be released at https://github.com/Panxuran/ACmix and https://gitee.com/mindspore/models.

Viaarxiv icon

Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features

Nov 05, 2021
Haohan Wang, Zeyi Huang, Hanlin Zhang, Eric Xing

Figure 1 for Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features
Figure 2 for Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features
Figure 3 for Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features
Figure 4 for Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features

Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. In this paper, we aim to offer another view of this problem in a perspective assuming the reason behind this accuracy drop is the reliance of models on the features that are not aligned well with how a data annotator considers similar across these two datasets. We refer to these features as misaligned features. We extend the conventional generalization error bound to a new one for this setup with the knowledge of how the misaligned features are associated with the label. Our analysis offers a set of techniques for this problem, and these techniques are naturally linked to many previous methods in robust machine learning literature. We also compared the empirical strength of these methods demonstrated the performance when these previous techniques are combined.

* 10 pages of main contents 
Viaarxiv icon

Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

May 31, 2021
Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, Gao Huang

Figure 1 for Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length
Figure 2 for Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length
Figure 3 for Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length
Figure 4 for Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16. In this paper, we argue that every image has its own characteristics, and ideally the token number should be conditioned on each individual input. In fact, we have observed that there exist a considerable number of "easy" images which can be accurately predicted with a mere number of 4x4 tokens, while only a small fraction of "hard" ones need a finer representation. Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image. This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i.e., the inference is terminated once a sufficiently confident prediction is produced. We further design efficient feature reuse and relationship reuse mechanisms across different components of the Dynamic Transformer to reduce redundant computations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100 demonstrate that our method significantly outperforms the competitive baselines in terms of both theoretical computational efficiency and practical inference speed.

Viaarxiv icon

Squared $\ell_2$ Norm as Consistency Loss for Leveraging Augmented Data to Learn Robust and Invariant Representations

Nov 25, 2020
Haohan Wang, Zeyi Huang, Xindi Wu, Eric P. Xing

Figure 1 for Squared $\ell_2$ Norm as Consistency Loss for Leveraging Augmented Data to Learn Robust and Invariant Representations
Figure 2 for Squared $\ell_2$ Norm as Consistency Loss for Leveraging Augmented Data to Learn Robust and Invariant Representations
Figure 3 for Squared $\ell_2$ Norm as Consistency Loss for Leveraging Augmented Data to Learn Robust and Invariant Representations
Figure 4 for Squared $\ell_2$ Norm as Consistency Loss for Leveraging Augmented Data to Learn Robust and Invariant Representations

Data augmentation is one of the most popular techniques for improving the robustness of neural networks. In addition to directly training the model with original samples and augmented samples, a torrent of methods regularizing the distance between embeddings/representations of the original samples and their augmented counterparts have been introduced. In this paper, we explore these various regularization choices, seeking to provide a general understanding of how we should regularize the embeddings. Our analysis suggests the ideal choices of regularization correspond to various assumptions. With an invariance test, we argue that regularization is important if the model is to be used in a broader context than the accuracy-driven setting because non-regularized approaches are limited in learning the concept of invariance, despite equally high accuracy. Finally, we also show that the generic approach we identified (squared $\ell_2$ norm regularized augmentation) outperforms several recent methods, which are each specially designed for one task and significantly more complicated than ours, over three different tasks.

* 12 pages and an additional 9 pages as appendix 
Viaarxiv icon