Alert button
Picture for Liping Jing

Liping Jing

Alert button

Divert More Attention to Vision-Language Object Tracking

Jul 19, 2023
Mingzhe Guo, Zhipeng Zhang, Liping Jing, Haibin Ling, Heng Fan

Figure 1 for Divert More Attention to Vision-Language Object Tracking
Figure 2 for Divert More Attention to Vision-Language Object Tracking
Figure 3 for Divert More Attention to Vision-Language Object Tracking
Figure 4 for Divert More Attention to Vision-Language Object Tracking

Multimodal vision-language (VL) learning has noticeably pushed the tendency toward generic intelligence owing to emerging large foundation models. However, tracking, as a fundamental vision problem, surprisingly enjoys less bonus from recent flourishing VL learning. We argue that the reasons are two-fold: the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning of current works. These nuisances motivate us to design more effective vision-language representation for tracking, meanwhile constructing a large database with language annotation for model learning. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer). To further improve VL representation, we introduce a contrastive loss to align different modalities. To thoroughly evidence the effectiveness of our method, we integrate the proposed framework on three tracking methods with different designs, i.e., the CNN-based SiamCAR, the Transformer-based OSTrack, and the hybrid structure TransT. The experiments demonstrate that our framework can significantly improve all baselines on six benchmarks. Besides empirical results, we theoretically analyze our approach to show its rationality. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking with diversified multimodal messages.

* 16 pages, 9 figures 
Viaarxiv icon

Recognizable Information Bottleneck

Apr 28, 2023
Yilin Lyu, Xin Liu, Mingyang Song, Xinyue Wang, Yaxin Peng, Tieyong Zeng, Liping Jing

Figure 1 for Recognizable Information Bottleneck
Figure 2 for Recognizable Information Bottleneck
Figure 3 for Recognizable Information Bottleneck
Figure 4 for Recognizable Information Bottleneck

Information Bottlenecks (IBs) learn representations that generalize to unseen data by information compression. However, existing IBs are practically unable to guarantee generalization in real-world scenarios due to the vacuous generalization bound. The recent PAC-Bayes IB uses information complexity instead of information compression to establish a connection with the mutual information generalization bound. However, it requires the computation of expensive second-order curvature, which hinders its practical application. In this paper, we establish the connection between the recognizability of representations and the recent functional conditional mutual information (f-CMI) generalization bound, which is significantly easier to estimate. On this basis we propose a Recognizable Information Bottleneck (RIB) which regularizes the recognizability of representations through a recognizability critic optimized by density ratio matching under the Bregman divergence. Extensive experiments on several commonly used datasets demonstrate the effectiveness of the proposed method in regularizing the model and estimating the generalization gap.

* 12 pages. To appear in IJCAI 2023 
Viaarxiv icon

Is ChatGPT A Good Keyphrase Generator? A Preliminary Study

Mar 23, 2023
Mingyang Song, Haiyun Jiang, Shuming Shi, Songfang Yao, Shilong Lu, Yi Feng, Huafeng Liu, Liping Jing

Figure 1 for Is ChatGPT A Good Keyphrase Generator? A Preliminary Study
Figure 2 for Is ChatGPT A Good Keyphrase Generator? A Preliminary Study
Figure 3 for Is ChatGPT A Good Keyphrase Generator? A Preliminary Study
Figure 4 for Is ChatGPT A Good Keyphrase Generator? A Preliminary Study

The emergence of ChatGPT has recently garnered significant attention from the computational linguistics community. To demonstrate its capabilities as a keyphrase generator, we conduct a preliminary evaluation of ChatGPT for the keyphrase generation task. We evaluate its performance in various aspects, including keyphrase generation prompts, keyphrase generation diversity, multi-domain keyphrase generation, and long document understanding. Our evaluation is based on six benchmark datasets, and we adopt the prompt suggested by OpenAI while extending it to six candidate prompts. We find that ChatGPT performs exceptionally well on all six candidate prompts, with minor performance differences observed across the datasets. Based on our findings, we conclude that ChatGPT has great potential for keyphrase generation. Moreover, we discover that ChatGPT still faces challenges when it comes to generating absent keyphrases. Meanwhile, in the final section, we also present some limitations and future expansions of this report.

* Technical Report, 7 pages 
Viaarxiv icon

Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification

Nov 19, 2022
Lin Xiao, Pengyu Xu, Liping Jing, Xiangliang Zhang

Figure 1 for Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification
Figure 2 for Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification
Figure 3 for Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification
Figure 4 for Pairwise Instance Relation Augmentation for Long-tailed Multi-label Text Classification

Multi-label text classification (MLTC) is one of the key tasks in natural language processing. It aims to assign multiple target labels to one document. Due to the uneven popularity of labels, the number of documents per label follows a long-tailed distribution in most cases. It is much more challenging to learn classifiers for data-scarce tail labels than for data-rich head labels. The main reason is that head labels usually have sufficient information, e.g., a large intra-class diversity, while tail labels do not. In response, we propose a Pairwise Instance Relation Augmentation Network (PIRAN) to augment tailed-label documents for balancing tail labels and head labels. PIRAN consists of a relation collector and an instance generator. The former aims to extract the document pairwise relations from head labels. Taking these relations as perturbations, the latter tries to generate new document instances in high-level feature space around the limited given tailed-label instances. Meanwhile, two regularizers (diversity and consistency) are designed to constrain the generation process. The consistency-regularizer encourages the variance of tail labels to be close to head labels and further balances the whole datasets. And diversity-regularizer makes sure the generated instances have diversity and avoids generating redundant instances. Extensive experimental results on three benchmark datasets demonstrate that PIRAN consistently outperforms the SOTA methods, and dramatically improves the performance of tail labels.

Viaarxiv icon

Approximate better, Attack stronger: Adversarial Example Generation via Asymptotically Gaussian Mixture Distribution

Sep 24, 2022
Zhengwei Fang, Rui Wang, Tao Huang, Liping Jing

Figure 1 for Approximate better, Attack stronger: Adversarial Example Generation via Asymptotically Gaussian Mixture Distribution
Figure 2 for Approximate better, Attack stronger: Adversarial Example Generation via Asymptotically Gaussian Mixture Distribution
Figure 3 for Approximate better, Attack stronger: Adversarial Example Generation via Asymptotically Gaussian Mixture Distribution
Figure 4 for Approximate better, Attack stronger: Adversarial Example Generation via Asymptotically Gaussian Mixture Distribution

Strong adversarial examples are the keys to evaluating and enhancing the robustness of deep neural networks. The popular adversarial attack algorithms maximize the non-concave loss function using the gradient ascent. However, the performance of each attack is usually sensitive to, for instance, minor image transformations due to insufficient information (only one input example, few white-box source models and unknown defense strategies). Hence, the crafted adversarial examples are prone to overfit the source model, which limits their transferability to unidentified architectures. In this paper, we propose Multiple Asymptotically Normal Distribution Attacks (MultiANDA), a novel method that explicitly characterizes adversarial perturbations from a learned distribution. Specifically, we approximate the posterior distribution over the perturbations by taking advantage of the asymptotic normality property of stochastic gradient ascent (SGA), then apply the ensemble strategy on this procedure to estimate a Gaussian mixture model for a better exploration of the potential optimization space. Drawing perturbations from the learned distribution allow us to generate any number of adversarial examples for each input. The approximated posterior essentially describes the stationary distribution of SGA iterations, which captures the geometric information around the local optimum. Thus, the samples drawn from the distribution reliably maintain the transferability. Our proposed method outperforms nine state-of-the-art black-box attacks on deep learning models with or without defenses through extensive experiments on seven normally trained and seven defence models.

Viaarxiv icon

Divert More Attention to Vision-Language Tracking

Jul 03, 2022
Mingzhe Guo, Zhipeng Zhang, Heng Fan, Liping Jing

Figure 1 for Divert More Attention to Vision-Language Tracking
Figure 2 for Divert More Attention to Vision-Language Tracking
Figure 3 for Divert More Attention to Vision-Language Tracking
Figure 4 for Divert More Attention to Vision-Language Tracking

Relying on Transformer for complex visual feature learning, object tracking has witnessed the new standard for state-of-the-arts (SOTAs). However, this advancement accompanies by larger training data and longer training period, making tracking increasingly expensive. In this paper, we demonstrate that the Transformer-reliance is not necessary and the pure ConvNets are still competitive and even better yet more economical and friendly in achieving SOTA tracking. Our solution is to unleash the power of multimodal vision-language (VL) tracking, simply using ConvNets. The essence lies in learning novel unified-adaptive VL representations with our modality mixer (ModaMixer) and asymmetrical ConvNet search. We show that our unified-adaptive VL representation, learned purely with the ConvNets, is a simple yet strong alternative to Transformer visual features, by unbelievably improving a CNN-based Siamese tracker by 14.5% in SUC on challenging LaSOT (50.7% > 65.2%), even outperforming several Transformer-based SOTA trackers. Besides empirical results, we theoretically analyze our approach to evidence its effectiveness. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking beyond Transformer. Code and models will be released at https://github.com/JudasDie/SOTS.

* 18 pages, 7 figures 
Viaarxiv icon

Hyperbolic Relevance Matching for Neural Keyphrase Extraction

May 04, 2022
Mingyang Song, Yi Feng, Liping Jing

Figure 1 for Hyperbolic Relevance Matching for Neural Keyphrase Extraction
Figure 2 for Hyperbolic Relevance Matching for Neural Keyphrase Extraction
Figure 3 for Hyperbolic Relevance Matching for Neural Keyphrase Extraction
Figure 4 for Hyperbolic Relevance Matching for Neural Keyphrase Extraction

Keyphrase extraction is a fundamental task in natural language processing and information retrieval that aims to extract a set of phrases with important information from a source document. Identifying important keyphrase is the central component of the keyphrase extraction task, and its main challenge is how to represent information comprehensively and discriminate importance accurately. In this paper, to address these issues, we design a new hyperbolic matching model (HyperMatch) to represent phrases and documents in the same hyperbolic space and explicitly estimate the phrase-document relevance via the Poincar\'e distance as the important score of each phrase. Specifically, to capture the hierarchical syntactic and semantic structure information, HyperMatch takes advantage of the hidden representations in multiple layers of RoBERTa and integrates them as the word embeddings via an adaptive mixing layer. Meanwhile, considering the hierarchical structure hidden in the document, HyperMatch embeds both phrases and documents in the same hyperbolic space via a hyperbolic phrase encoder and a hyperbolic document encoder. This strategy can further enhance the estimation of phrase-document relevance due to the good properties of hyperbolic space. In this setting, the keyphrase extraction can be taken as a matching problem and effectively implemented by minimizing a hyperbolic margin-based triplet loss. Extensive experiments are conducted on six benchmarks and demonstrate that HyperMatch outperforms the state-of-the-art baselines.

* 12 pages, 3 figures, Accepted by NAACL 2022 (main conference) 
Viaarxiv icon

Learning Target-aware Representation for Visual Tracking via Informative Interactions

Jan 07, 2022
Mingzhe Guo, Zhipeng Zhang, Heng Fan, Liping Jing, Yilin Lyu, Bing Li, Weiming Hu

Figure 1 for Learning Target-aware Representation for Visual Tracking via Informative Interactions
Figure 2 for Learning Target-aware Representation for Visual Tracking via Informative Interactions
Figure 3 for Learning Target-aware Representation for Visual Tracking via Informative Interactions
Figure 4 for Learning Target-aware Representation for Visual Tracking via Informative Interactions

We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking. Specifically, having observed that de facto frameworks perform feature matching simply using the outputs from backbone for target localization, there is no direct feedback from the matching module to the backbone network, especially the shallow layers. More concretely, only the matching module can directly access the target information (in the reference frame), while the representation learning of candidate frame is blind to the reference target. As a consequence, the accumulation effect of target-irrelevant interference in the shallow stages may degrade the feature quality of deeper layers. In this paper, we approach the problem from a different angle by conducting multiple branch-wise interactions inside the Siamese-like backbone networks (InBN). At the core of InBN is a general interaction modeler (GIM) that injects the prior knowledge of reference image to different stages of the backbone network, leading to better target-perception and robust distractor-resistance of candidate feature representation with negligible computation cost. The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer for improvements, as evidenced by our extensive experiments on multiple benchmarks. In particular, the CNN version (based on SiamCAR) improves the baseline with 3.2/6.9 absolute gains of SUC on LaSOT/TNL2K, respectively. The Transformer version obtains SUC scores of 65.7/52.0 on LaSOT/TNL2K, which are on par with recent state of the arts. Code and models will be released.

* 9 pages, 6 figures 
Viaarxiv icon

Topic-Aware Encoding for Extractive Summarization

Dec 17, 2021
Mingyang Song, Liping Jing

Figure 1 for Topic-Aware Encoding for Extractive Summarization
Figure 2 for Topic-Aware Encoding for Extractive Summarization
Figure 3 for Topic-Aware Encoding for Extractive Summarization

Document summarization provides an instrument for faster understanding the collection of text documents and has several real-life applications. With the growth of online text data, numerous summarization models have been proposed recently. The Sequence-to-Sequence (Seq2Seq) based neural summarization model is the most widely used in the summarization field due to its high performance. This is because semantic information and structure information in the text is adequately considered when encoding. However, the existing extractive summarization models pay little attention to and use the central topic information to assist the generation of summaries, which leads to models not ensuring the generated summary under the primary topic. A lengthy document can span several topics, and a single summary cannot do justice to all the topics. Therefore, the key to generating a high-quality summary is determining the central topic and building a summary based on it, especially for a long document. We propose a topic-aware encoding for document summarization to deal with this issue. This model effectively combines syntactic-level and topic-level information to build a comprehensive sentence representation. Specifically, a neural topic model is added in the neural-based sentence-level representation learning to adequately consider the central topic information for capturing the critical content in the original document. The experimental results on three public datasets show that our model outperforms the state-of-the-art models.

* 4 pages, 0 figures 
Viaarxiv icon

Reinforced Abstractive Summarization with Adaptive Length Controlling

Dec 17, 2021
Mingyang Song, Yi Feng, Liping Jing

Figure 1 for Reinforced Abstractive Summarization with Adaptive Length Controlling
Figure 2 for Reinforced Abstractive Summarization with Adaptive Length Controlling
Figure 3 for Reinforced Abstractive Summarization with Adaptive Length Controlling
Figure 4 for Reinforced Abstractive Summarization with Adaptive Length Controlling

Document summarization, as a fundamental task in natural language generation, aims to generate a short and coherent summary for a given document. Controllable summarization, especially of the length, is an important issue for some practical applications, especially how to trade-off the length constraint and information integrity. In this paper, we propose an \textbf{A}daptive \textbf{L}ength \textbf{C}ontrolling \textbf{O}ptimization (\textbf{ALCO}) method to leverage two-stage abstractive summarization model via reinforcement learning. ALCO incorporates length constraint into the stage of sentence extraction to penalize the overlength extracted sentences. Meanwhile, a saliency estimation mechanism is designed to preserve the salient information in the generated sentences. A series of experiments have been conducted on a wildly-used benchmark dataset \textit{CNN/Daily Mail}. The results have shown that ALCO performs better than the popular baselines in terms of length controllability and content preservation.

* 9 pages, 3 figures 
Viaarxiv icon