Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ming Wu

Compressing Sentence Representation for Semantic Retrieval via Homomorphic Projective Distillation

Mar 15, 2022

Xuandong Zhao, Zhiguo Yu, Ming Wu, Lei Li

Figure 1 for Compressing Sentence Representation for Semantic Retrieval via Homomorphic Projective Distillation

Figure 2 for Compressing Sentence Representation for Semantic Retrieval via Homomorphic Projective Distillation

Figure 3 for Compressing Sentence Representation for Semantic Retrieval via Homomorphic Projective Distillation

Figure 4 for Compressing Sentence Representation for Semantic Retrieval via Homomorphic Projective Distillation

Abstract:How to learn highly compact yet effective sentence representation? Pre-trained language models have been effective in many NLP tasks. However, these models are often huge and produce large sentence embeddings. Moreover, there is a big performance gap between large and small models. In this paper, we propose Homomorphic Projective Distillation (HPD) to learn compressed sentence embeddings. Our method augments a small Transformer encoder model with learnable projection layers to produce compact representations while mimicking a large pre-trained language model to retain the sentence representation quality. We evaluate our method with different model sizes on both semantic textual similarity (STS) and semantic retrieval (SR) tasks. Experiments show that our method achieves 2.7-4.5 points performance gain on STS tasks compared with previous best representations of the same size. In SR tasks, our method improves retrieval speed (8.2$\times$) and memory usage (8.0$\times$) compared with state-of-the-art large models.

* Findings of ACL 2022

Via

Access Paper or Ask Questions

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Jan 05, 2022

Haotian Yan, Chuang Zhang, Ming Wu

Figure 1 for Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Figure 2 for Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Figure 3 for Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Figure 4 for Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Abstract:Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the large window attention to capture the contextual information at multiple scales. Moreover, the framework of spatial pyramid pooling is adopted to collaborate with the large window attention, which presents a novel decoder named large window attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT. Our resulting ViT, Lawin Transformer, is composed of an efficient hierachical vision transformer (HVT) as encoder and a LawinASPP as decoder. The empirical results demonstrate that Lawin Transformer offers an improved efficiency compared to the existing method. Lawin Transformer further sets new state-of-the-art performance on Cityscapes (84.4\% mIoU), ADE20K (56.2\% mIoU) and COCO-Stuff datasets. The code will be released at https://github.com/yan-hao-tian/lawin.

Via

Access Paper or Ask Questions

SamplingAug: On the Importance of Patch Sampling Augmentation for Single Image Super-Resolution

Nov 30, 2021

Shizun Wang, Ming Lu, Kaixin Chen, Jiaming Liu, Xiaoqi Li, Chuang zhang, Ming Wu

Abstract:With the development of Deep Neural Networks (DNNs), plenty of methods based on DNNs have been proposed for Single Image Super-Resolution (SISR). However, existing methods mostly train the DNNs on uniformly sampled LR-HR patch pairs, which makes them fail to fully exploit informative patches within the image. In this paper, we present a simple yet effective data augmentation method. We first devise a heuristic metric to evaluate the informative importance of each patch pair. In order to reduce the computational cost for all patch pairs, we further propose to optimize the calculation of our metric by integral image, achieving about two orders of magnitude speedup. The training patch pairs are sampled according to their informative importance with our method. Extensive experiments show our sampling augmentation can consistently improve the convergence and boost the performance of various SISR architectures, including EDSR, RCAN, RDN, SRCNN and ESPCN across different scaling factors (x2, x3, x4). Code is available at https://github.com/littlepure2333/SamplingAug

* BMVC 2021

Via

Access Paper or Ask Questions

Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation

Aug 18, 2021

Jiaming Liu, Ming Lu, Kaixin Chen, Xiaoqi Li, Shizun Wang, Zhaoqing Wang, Enhua Wu, Yurong Chen, Chuang Zhang, Ming Wu

Figure 1 for Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation

Figure 2 for Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation

Figure 3 for Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation

Figure 4 for Overfitting the Data: Compact Neural Video Delivery via Content-aware Feature Modulation

Abstract:Internet video delivery has undergone a tremendous explosion of growth over the past few years. However, the quality of video delivery system greatly depends on the Internet bandwidth. Deep Neural Networks (DNNs) are utilized to improve the quality of video delivery recently. These methods divide a video into chunks, and stream LR video chunks and corresponding content-aware models to the client. The client runs the inference of models to super-resolve the LR chunks. Consequently, a large number of models are streamed in order to deliver a video. In this paper, we first carefully study the relation between models of different chunks, then we tactfully design a joint training framework along with the Content-aware Feature Modulation (CaFM) layer to compress these models for neural video delivery. {\bf With our method, each video chunk only requires less than $1\% $ of original parameters to be streamed, achieving even better SR performance.} We conduct extensive experiments across various SR backbones, video time length, and scaling factors to demonstrate the advantages of our method. Besides, our method can be also viewed as a new approach of video coding. Our primary experiments achieve better video quality compared with the commercial H.264 and H.265 standard under the same storage cost, showing the great potential of the proposed method. Code is available at:\url{https://github.com/Neural-video-delivery/CaFM-Pytorch-ICCV2021}

* Accepted by ICCV 2021

Via

Access Paper or Ask Questions

ConTNet: Why not use convolution and transformer at the same time?

May 10, 2021

Haotian Yan, Zhe Li, Weijian Li, Changhu Wang, Ming Wu, Chuang Zhang

Figure 1 for ConTNet: Why not use convolution and transformer at the same time?

Figure 2 for ConTNet: Why not use convolution and transformer at the same time?

Figure 3 for ConTNet: Why not use convolution and transformer at the same time?

Figure 4 for ConTNet: Why not use convolution and transformer at the same time?

Abstract:Although convolutional networks (ConvNets) have enjoyed great success in computer vision (CV), it suffers from capturing global information crucial to dense prediction tasks such as object detection and segmentation. In this work, we innovatively propose ConTNet (ConvolutionTransformer Network), combining transformer with ConvNet architectures to provide large receptive fields. Unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that are sensitive to hyper-parameters and extremely dependent on a pile of data augmentations when trained from scratch on a midsize dataset (e.g., ImageNet1k), ConTNet can be optimized like normal ConvNets (e.g., ResNet) and preserve an outstanding robustness. It is also worth pointing that, given identical strong data augmentations, the performance improvement of ConTNet is more remarkable than that of ResNet. We present its superiority and effectiveness on image classification and downstream tasks. For example, our ConTNet achieves 81.8% top-1 accuracy on ImageNet which is the same as DeiT-B with less than 40% computational complexity. ConTNet-M also outperforms ResNet50 as the backbone of both Faster-RCNN (by 2.6%) and Mask-RCNN (by 3.2%) on COCO2017 dataset. We hope that ConTNet could serve as a useful backbone for CV tasks and bring new ideas for model design

Via

Access Paper or Ask Questions

Adaptive Self-training for Few-shot Neural Sequence Labeling

Oct 07, 2020

Yaqing Wang, Subhabrata Mukherjee, Haoda Chu, Yuancheng Tu, Ming Wu, Jing Gao, Ahmed Hassan Awadallah

Figure 1 for Adaptive Self-training for Few-shot Neural Sequence Labeling

Figure 2 for Adaptive Self-training for Few-shot Neural Sequence Labeling

Figure 3 for Adaptive Self-training for Few-shot Neural Sequence Labeling

Figure 4 for Adaptive Self-training for Few-shot Neural Sequence Labeling

Abstract:Neural sequence labeling is an important technique employed for many Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER), slot tagging for dialog systems and semantic parsing. Large-scale pre-trained language models obtain very good performance on these tasks when fine-tuned on large amounts of task-specific labeled data. However, such large-scale labeled datasets are difficult to obtain for several tasks and domains due to the high cost of human annotation as well as privacy and data access constraints for sensitive user applications. This is exacerbated for sequence labeling tasks requiring such annotations at token-level. In this work, we develop techniques to address the label scarcity challenge for neural sequence labeling models. Specifically, we develop self-training and meta-learning techniques for few-shot training of neural sequence taggers, namely MetaST. While self-training serves as an effective mechanism to learn from large amounts of unlabeled data -- meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels. Extensive experiments on six benchmark datasets including two massive multilingual NER datasets and four slot tagging datasets for task-oriented dialog systems demonstrate the effectiveness of our method with around 10% improvement over state-of-the-art systems for the 10-shot setting.

Via

Access Paper or Ask Questions

GINet: Graph Interaction Network for Scene Parsing

Sep 14, 2020

Tianyi Wu, Yu Lu, Yu Zhu, Chuang Zhang, Ming Wu, Zhanyu Ma, Guodong Guo

Figure 1 for GINet: Graph Interaction Network for Scene Parsing

Figure 2 for GINet: Graph Interaction Network for Scene Parsing

Figure 3 for GINet: Graph Interaction Network for Scene Parsing

Figure 4 for GINet: Graph Interaction Network for Scene Parsing

Abstract:Recently, context reasoning using image regions beyond local convolution has shown great potential for scene parsing. In this work, we explore how to incorporate the linguistic knowledge to promote context reasoning over image regions by proposing a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss). The GI unit is capable of enhancing feature representations of convolution networks over high-level semantics and learning the semantic coherency adaptively to each sample. Specifically, the dataset-based linguistic knowledge is first incorporated in the GI unit to promote context reasoning over the visual graph, then the evolved representations of the visual graph are mapped to each local representation to enhance the discriminated capability for scene parsing. GI unit is further improved by the SC-loss to enhance the semantic representations over the exemplar-based semantic graph. We perform full ablation studies to demonstrate the effectiveness of each component in our approach. Particularly, the proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.

* Accepted by ECCV 2020

Via

Access Paper or Ask Questions

FGSD: A Dataset for Fine-Grained Ship Detection in High Resolution Satellite Images

Mar 15, 2020

Kaiyan Chen, Ming Wu, Jiaming Liu, Chuang Zhang

Figure 1 for FGSD: A Dataset for Fine-Grained Ship Detection in High Resolution Satellite Images

Figure 2 for FGSD: A Dataset for Fine-Grained Ship Detection in High Resolution Satellite Images

Figure 3 for FGSD: A Dataset for Fine-Grained Ship Detection in High Resolution Satellite Images

Figure 4 for FGSD: A Dataset for Fine-Grained Ship Detection in High Resolution Satellite Images

Abstract:Ship detection using high-resolution remote sensing images is an important task, which contribute to sea surface regulation. The complex background and special visual angle make ship detection relies in high quality datasets to a certain extent. However, there is few works on giving both precise classification and accurate location of ships in existing ship detection datasets. To further promote the research of ship detection, we introduced a new fine-grained ship detection datasets, which is named as FGSD. The dataset collects high-resolution remote sensing images that containing ship samples from multiple large ports around the world. Ship samples were fine categorized and annotated with both horizontal and rotating bounding boxes. To further detailed the information of the dataset, we put forward a new representation method of ships' orientation. For future research, the dock as a new class was annotated in the dataset. Besides, rich information of images were provided in FGSD, including the source port, resolution and corresponding GoogleEarth' s resolution level of each image. As far as we know, FGSD is the most comprehensive ship detection dataset currently and it'll be available soon. Some baselines for FGSD are also provided in this paper.

Via

Access Paper or Ask Questions

The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification

Feb 11, 2020

Dongliang Chang, Yifeng Ding, Jiyang Xie, Ayan Kumar Bhunia, Xiaoxu Li, Zhanyu Ma, Ming Wu, Jun Guo, Yi-Zhe Song

Figure 1 for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification

Figure 2 for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification

Figure 3 for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification

Figure 4 for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification

Abstract:Key for solving fine-grained image categorization is finding discriminate and local regions that correspond to subtle visual traits. Great strides have been made, with complex networks designed specifically to learn part-level discriminate feature representations. In this paper, we show it is possible to cultivate subtle details without the need for overly complicated network designs or training mechanisms -- a single loss is all it takes. The main trick lies with how we delve into individual feature channels early on, as opposed to the convention of starting from a consolidated feature map. The proposed loss function, termed as mutual-channel loss (MC-Loss), consists of two channel-specific components: a discriminality component and a diversity component. The discriminality component forces all feature channels belonging to the same class to be discriminative, through a novel channel-wise attention mechanism. The diversity component additionally constraints channels so that they become mutually exclusive on spatial-wise. The end result is therefore a set of feature channels that each reflects different locally discriminative regions for a specific class. The MC-Loss can be trained end-to-end, without the need for any bounding-box/part annotations, and yields highly discriminative regions during inference. Experimental results show our MC-Loss when implemented on top of common base networks can achieve state-of-the-art performance on all four fine-grained categorization datasets (CUB-Birds, FGVC-Aircraft, Flowers-102, and Stanford-Cars). Ablative studies further demonstrate the superiority of MC-Loss when compared with other recently proposed general-purpose losses for visual classification, on two different base networks. Code available at https://github.com/dongliangchang/Mutual-Channel-Loss

Via

Access Paper or Ask Questions

C-DLinkNet: considering multi-level semantic features for human parsing

Jan 31, 2020

Yu Lu, Muyan Feng, Ming Wu, Chuang Zhang

Figure 1 for C-DLinkNet: considering multi-level semantic features for human parsing

Figure 2 for C-DLinkNet: considering multi-level semantic features for human parsing

Figure 3 for C-DLinkNet: considering multi-level semantic features for human parsing

Figure 4 for C-DLinkNet: considering multi-level semantic features for human parsing

Abstract:Human parsing is an essential branch of semantic segmentation, which is a fine-grained semantic segmentation task to identify the constituent parts of human. The challenge of human parsing is to extract effective semantic features to resolve deformation and multi-scale variations. In this work, we proposed an end-to-end model called C-DLinkNet based on LinkNet, which contains a new module named Smooth Module to combine the multi-level features in Decoder part. C-DLinkNet is capable of producing competitive parsing performance compared with the state-of-the-art methods with smaller input sizes and no additional information, i.e., achiving mIoU=53.05 on the validation set of LIP dataset.

Via

Access Paper or Ask Questions