Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xingjian Shi

First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting

Dec 15, 2022
Xiyuan Zhang, Xiaoyong Jin, Karthick Gopalswamy, Gaurav Gupta, Youngsuk Park, Xingjian Shi, Hao Wang, Danielle C. Maddix, Yuyang Wang

Figure 1 for First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting

Figure 2 for First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting

Figure 3 for First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting

Figure 4 for First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting

Transformer-based models have gained large popularity and demonstrated promising results in long-term time-series forecasting in recent years. In addition to learning attention in time domain, recent works also explore learning attention in frequency domains (e.g., Fourier domain, wavelet domain), given that seasonal patterns can be better captured in these domains. In this work, we seek to understand the relationships between attention models in different time and frequency domains. Theoretically, we show that attention models in different domains are equivalent under linear conditions (i.e., linear kernel to attention scores). Empirically, we analyze how attention models of different domains show different behaviors through various synthetic experiments with seasonality, trend and noise, with emphasis on the role of softmax operation therein. Both these theoretical and empirical analyses motivate us to propose a new method: TDformer (Trend Decomposition Transformer), that first applies seasonal-trend decomposition, and then additively combines an MLP which predicts the trend component with Fourier attention which predicts the seasonal component to obtain the final prediction. Extensive experiments on benchmark time-series forecasting datasets demonstrate that TDformer achieves state-of-the-art performance against existing attention-based models.

* NeurIPS 2022 All Things Attention Workshop

Via

Access Paper or Ask Questions

Are Multimodal Models Robust to Image and Text Perturbations?

Dec 15, 2022
Jielin Qiu, Yi Zhu, Xingjian Shi, Florian Wenzel, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li

Figure 1 for Are Multimodal Models Robust to Image and Text Perturbations?

Figure 2 for Are Multimodal Models Robust to Image and Text Perturbations?

Figure 3 for Are Multimodal Models Robust to Image and Text Perturbations?

Figure 4 for Are Multimodal Models Robust to Image and Text Perturbations?

Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating their robustness against distribution shifts is crucial before adopting them in real-world applications. In this paper, we investigate the robustness of 9 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (MMI and MOR) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models.

* The project webpage is at: https://mmrobustness.github.io/

Via

Access Paper or Ask Questions

A Transformer-Based Substitute Recommendation Model Incorporating Weakly Supervised Customer Behavior Data

Nov 04, 2022
Wenting Ye, Hongfei Yang, Shuai Zhao, Haoyang Fang, Xingjian Shi, Naveen Neppalli

Figure 1 for A Transformer-Based Substitute Recommendation Model Incorporating Weakly Supervised Customer Behavior Data

Figure 2 for A Transformer-Based Substitute Recommendation Model Incorporating Weakly Supervised Customer Behavior Data

Figure 3 for A Transformer-Based Substitute Recommendation Model Incorporating Weakly Supervised Customer Behavior Data

Figure 4 for A Transformer-Based Substitute Recommendation Model Incorporating Weakly Supervised Customer Behavior Data

The substitute-based recommendation is widely used in E-commerce to provide better alternatives to customers. However, existing research typically uses the customer behavior signals like co-view and view-but-purchase-another to capture the substitute relationship. Despite its intuitive soundness, we find that such an approach might ignore the functionality and characteristics of products. In this paper, we adapt substitute recommendation into language matching problem by taking product title description as model input to consider product functionality. We design a new transformation method to de-noise the signals derived from production data. In addition, we consider multilingual support from the engineering point of view. Our proposed end-to-end transformer-based model achieves both successes from offline and online experiments. The proposed model has been deployed in a large-scale E-commerce website for 11 marketplaces in 6 languages. Our proposed model is demonstrated to increase revenue by 19% based on an online A/B experiment.

* 6 pages, 3 figures, 5 tables, accepted in 21st IEEE International Conference on Machine Learning and Applications

Via

Access Paper or Ask Questions

Visual Prompt Tuning for Test-time Domain Adaptation

Oct 10, 2022
Yunhe Gao, Xingjian Shi, Yi Zhu, Hao Wang, Zhiqiang Tang, Xiong Zhou, Mu Li, Dimitris N. Metaxas

Figure 1 for Visual Prompt Tuning for Test-time Domain Adaptation

Figure 2 for Visual Prompt Tuning for Test-time Domain Adaptation

Figure 3 for Visual Prompt Tuning for Test-time Domain Adaptation

Figure 4 for Visual Prompt Tuning for Test-time Domain Adaptation

Models should have the ability to adapt to unseen data during test-time to avoid performance drop caused by inevitable distribution shifts in real-world deployment scenarios. In this work, we tackle the practical yet challenging test-time adaptation (TTA) problem, where a model adapts to the target domain without accessing the source data. We propose a simple recipe called data-efficient prompt tuning (DePT) with two key ingredients. First, DePT plugs visual prompts into the vision Transformer and only tunes these source-initialized prompts during adaptation. We find such parameter-efficient finetuning can efficiently adapt the model representation to the target domain without overfitting to the noise in the learning objective. Second, DePT bootstraps the source representation to the target domain by memory bank-based online pseudo labeling. A hierarchical self-supervised regularization specially designed for prompts is jointly optimized to alleviate error accumulation during self-training. With much fewer tunable parameters, DePT demonstrates not only state-of-the-art performance on major adaptation benchmarks, but also superior data efficiency, i.e., adaptation with only 1\% or 10\% data without much performance degradation compared to 100\% data. In addition, DePT is also versatile to be extended to online or multi-source TTA settings.

Via

Access Paper or Ask Questions

Earthformer: Exploring Space-Time Transformers for Earth System Forecasting

Jul 12, 2022
Zhihan Gao, Xingjian Shi, Hao Wang, Yi Zhu, Yuyang Wang, Mu Li, Dit-Yan Yeung

Figure 1 for Earthformer: Exploring Space-Time Transformers for Earth System Forecasting

Figure 2 for Earthformer: Exploring Space-Time Transformers for Earth System Forecasting

Figure 3 for Earthformer: Exploring Space-Time Transformers for Earth System Forecasting

Figure 4 for Earthformer: Exploring Space-Time Transformers for Earth System Forecasting

Conventionally, Earth system (e.g., weather and climate) forecasting relies on numerical simulation with complex physical models and are hence both expensive in computation and demanding on domain expertise. With the explosive growth of the spatiotemporal Earth observation data in the past decade, data-driven models that apply Deep Learning (DL) are demonstrating impressive potential for various Earth system forecasting tasks. The Transformer as an emerging DL architecture, despite its broad success in other domains, has limited adoption in this area. In this paper, we propose Earthformer, a space-time Transformer for Earth system forecasting. Earthformer is based on a generic, flexible and efficient space-time attention block, named Cuboid Attention. The idea is to decompose the data into cuboids and apply cuboid-level self-attention in parallel. These cuboids are further connected with a collection of global vectors. We conduct experiments on the MovingMNIST dataset and a newly proposed chaotic N-body MNIST dataset to verify the effectiveness of cuboid attention and figure out the best design of Earthformer. Experiments on two real-world benchmarks about precipitation nowcasting and El Nino/Southern Oscillation (ENSO) forecasting show Earthformer achieves state-of-the-art performance.

* Technical report

Via

Access Paper or Ask Questions

Removing Batch Normalization Boosts Adversarial Training

Jul 04, 2022
Haotao Wang, Aston Zhang, Shuai Zheng, Xingjian Shi, Mu Li, Zhangyang Wang

Figure 1 for Removing Batch Normalization Boosts Adversarial Training

Figure 2 for Removing Batch Normalization Boosts Adversarial Training

Figure 3 for Removing Batch Normalization Boosts Adversarial Training

Figure 4 for Removing Batch Normalization Boosts Adversarial Training

Adversarial training (AT) defends deep neural networks against adversarial attacks. One challenge that limits its practical application is the performance degradation on clean samples. A major bottleneck identified by previous works is the widely used batch normalization (BN), which struggles to model the different statistics of clean and adversarial training samples in AT. Although the dominant approach is to extend BN to capture this mixture of distribution, we propose to completely eliminate this bottleneck by removing all BN layers in AT. Our normalizer-free robust training (NoFrost) method extends recent advances in normalizer-free networks to AT for its unexplored advantage on handling the mixture distribution challenge. We show that NoFrost achieves adversarial robustness with only a minor sacrifice on clean sample accuracy. On ImageNet with ResNet50, NoFrost achieves $74.06\%$ clean accuracy, which drops merely $2.00\%$ from standard training. In contrast, BN-based AT obtains $59.28\%$ clean accuracy, suffering a significant $16.78\%$ drop from standard training. In addition, NoFrost achieves a $23.56\%$ adversarial robustness against PGD attack, which improves the $13.57\%$ robustness in BN-based AT. We observe better model smoothness and larger decision margins from NoFrost, which make the models less sensitive to input perturbations and thus more robust. Moreover, when incorporating more data augmentations into NoFrost, it achieves comprehensive robustness against multiple distribution shifts. Code and pre-trained models are public at https://github.com/amazon-research/normalizer-free-robust-training.

* ICML 2022

Via

Access Paper or Ask Questions

Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar

Apr 13, 2022
Yaojie Hu, Xingjian Shi, Qiang Zhou, Lee Pike

Figure 1 for Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar

Figure 2 for Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar

Figure 3 for Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar

Figure 4 for Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar

We introduce NSEdit (neural-symbolic edit), a novel Transformer-based code repair method. Given only the source code that contains bugs, NSEdit predicts an editing sequence that can fix the bugs. The edit grammar is formulated as a regular language, and the Transformer uses it as a neural-symbolic scripting interface to generate editing programs. We modify the Transformer and add a pointer network to select the edit locations. An ensemble of rerankers are trained to re-rank the editing sequences generated by beam search. We fine-tune the rerankers on the validation set to reduce over-fitting. NSEdit is evaluated on various code repair datasets and achieved a new state-of-the-art accuracy ($24.04\%$) on the Tufano small dataset of the CodeXGLUE benchmark. NSEdit performs robustly when programs vary from packages to packages and when buggy programs are concrete. We conduct detailed analysis on our methods and demonstrate the effectiveness of each component.

* ICLR 2022 Deep Learning for Code workshop

Via

Access Paper or Ask Questions

Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Nov 04, 2021
Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, Alexander J. Smola

Figure 1 for Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Figure 2 for Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Figure 3 for Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Figure 4 for Benchmarking Multimodal AutoML for Tabular Data with Text Fields

We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy which performs well over all 18 datasets will serve as a practical foundation for multimodal text/tabular AutoML, the diverse datasets in our benchmark vary greatly in: sample size, problem types (a mix of classification and regression tasks), number of features (with the number of text columns ranging from 1 to 28 between datasets), as well as how the predictive signal is decomposed between text vs. numeric/categorical features (and predictive interactions thereof). Over this benchmark, we evaluate various straightforward pipelines to model such data, including standard two-stage approaches where NLP is used to featurize the text such that AutoML for tabular data can then be applied. Compared with human data science teams, the fully automated methodology that performed best on our benchmark (stack ensembling a multimodal Transformer with various tree models) also manages to rank 1st place when fit to the raw text/tabular data in two MachineHack prediction competitions and 2nd place (out of 2380 teams) in Kaggle's Mercari Price Suggestion Challenge.

* Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks 2021

Via

Access Paper or Ask Questions

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Sep 23, 2021
Haoyu He, Xingjian Shi, Jonas Mueller, Zha Sheng, Mu Li, George Karypis

Figure 1 for Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Figure 2 for Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Figure 3 for Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Figure 4 for Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component's contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MI-$\alpha$ objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyperparameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-$\alpha$ performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.

Via

Access Paper or Ask Questions

GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing

Jul 09, 2019
Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng

Figure 1 for GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing

We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating). These toolkits provide state-of-the-art pre-trained models, training scripts, and training logs, to facilitate rapid prototyping and promote reproducible research. We also provide modular APIs with flexible building blocks to enable efficient customization. Leveraging the MXNet ecosystem, the deep learning models in GluonCV and GluonNLP can be deployed onto a variety of platforms with different programming languages. Benefiting from open source under the Apache 2.0 license, GluonCV and GluonNLP have attracted 100 contributors worldwide on GitHub. Models of GluonCV and GluonNLP have been downloaded for more than 1.6 million times in fewer than 10 months.

Via

Access Paper or Ask Questions