Alert button
Picture for Hang Hua

Hang Hua

Alert button

VideoXum: Cross-modal Visual and Textural Summarization of Videos

Mar 21, 2023
Jingyang Lin, Hang Hua, Ming Chen, Yikang Li, Jenhao Hsiao, Chiuman Ho, Jiebo Luo

Figure 1 for VideoXum: Cross-modal Visual and Textural Summarization of Videos
Figure 2 for VideoXum: Cross-modal Visual and Textural Summarization of Videos
Figure 3 for VideoXum: Cross-modal Visual and Textural Summarization of Videos
Figure 4 for VideoXum: Cross-modal Visual and Textural Summarization of Videos

Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset -- VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model -- VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.

Viaarxiv icon

PromptCap: Prompt-Guided Task-Aware Image Captioning

Nov 15, 2022
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo

Figure 1 for PromptCap: Prompt-Guided Task-Aware Image Captioning
Figure 2 for PromptCap: Prompt-Guided Task-Aware Image Captioning
Figure 3 for PromptCap: Prompt-Guided Task-Aware Image Captioning
Figure 4 for PromptCap: Prompt-Guided Task-Aware Image Captioning

Image captioning aims to describe an image with a natural language sentence, allowing powerful language models to understand images. The framework of combining image captioning with language models has been successful on various vision-language tasks. However, an image contains much more information than a single sentence, leading to underspecification of which visual entities should be described in the caption sentence. For example, when performing visual questioning answering (VQA), generic image captions often miss visual details that are essential for the language model to answer correctly. To address this challenge, we propose PromptCap, a captioning model that takes a natural-language prompt to control the contents of the generated caption. The prompt contains a question that the caption should help to answer, and also supports taking auxiliary text inputs such as scene text within the image itself. To finetune a general image caption model for prompt-guided captioning, we propose a pipeline to synthesize and filter training examples with GPT-3 and existing VQA datasets. For evaluation, we start with an existing pipeline in which a language model is prompted with image captions to carry out VQA. With the same language model, a higher QA accuracy shows that our generated captions are more relevant to the question prompts. PromptCap outperforms generic captions by a large margin on a variety of VQA tasks and achieves the state-of-the-art accuracy of 58.8 % on OK-VQA and 58.0 % on A-OKVQA. Zero-shot experiments on WebQA show that PromptCap generalizes well to unseen domains.

Viaarxiv icon

Fine-tuning Pre-trained Language Models with Noise Stability Regularization

Jun 12, 2022
Hang Hua, Xingjian Li, Dejing Dou, Cheng-Zhong Xu, Jiebo Luo

Figure 1 for Fine-tuning Pre-trained Language Models with Noise Stability Regularization
Figure 2 for Fine-tuning Pre-trained Language Models with Noise Stability Regularization
Figure 3 for Fine-tuning Pre-trained Language Models with Noise Stability Regularization
Figure 4 for Fine-tuning Pre-trained Language Models with Noise Stability Regularization

The advent of large-scale pre-trained language models has contributed greatly to the recent progress in natural language processing. Many state-of-the-art language models are first trained on a large text corpus and then fine-tuned on downstream tasks. Despite its recent success and wide adoption, fine-tuning a pre-trained language model often suffers from overfitting, which leads to poor generalizability due to the extremely high complexity of the model and the limited training samples from downstream tasks. To address this problem, we propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR). Specifically, we propose to inject the standard Gaussian noise or In-manifold noise and regularize hidden representations of the fine-tuned model. We first provide theoretical analyses to support the efficacy of our method. We then demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART. While these previous works only verify the effectiveness of their methods on relatively simple text classification tasks, we also verify the effectiveness of our method on question answering tasks, where the target problem is much more difficult and more training examples are available. Furthermore, extensive experimental results indicate that the proposed algorithm can not only enhance the in-domain performance of the language models but also improve the domain generalization performance on out-of-domain data.

* Preprint. Under Review 
Viaarxiv icon

Noise Stability Regularization for Improving BERT Fine-tuning

Jul 10, 2021
Hang Hua, Xingjian Li, Dejing Dou, Cheng-Zhong Xu, Jiebo Luo

Figure 1 for Noise Stability Regularization for Improving BERT Fine-tuning
Figure 2 for Noise Stability Regularization for Improving BERT Fine-tuning
Figure 3 for Noise Stability Regularization for Improving BERT Fine-tuning
Figure 4 for Noise Stability Regularization for Improving BERT Fine-tuning

Fine-tuning pre-trained language models such as BERT has become a common practice dominating leaderboards across various NLP tasks. Despite its recent success and wide adoption, this process is unstable when there are only a small number of training samples available. The brittleness of this process is often reflected by the sensitivity to random seeds. In this paper, we propose to tackle this problem based on the noise stability property of deep nets, which is investigated in recent literature (Arora et al., 2018; Sanyal et al., 2020). Specifically, we introduce a novel and effective regularization method to improve fine-tuning on NLP tasks, referred to as Layer-wise Noise Stability Regularization (LNSR). We extend the theories about adding noise to the input and prove that our method gives a stabler regularization effect. We provide supportive evidence by experimentally confirming that well-performing models show a low sensitivity to noise and fine-tuning with LNSR exhibits clearly higher generalizability and stability. Furthermore, our method also demonstrates advantages over other state-of-the-art algorithms including L2-SP (Li et al., 2018), Mixout (Lee et al., 2020) and SMART (Jiang et al., 2020).

* Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 
Viaarxiv icon

Controllable Unsupervised Text Attribute Transfer via Editing Entangled Latent Representation

May 30, 2019
Ke Wang, Hang Hua, Xiaojun Wan

Figure 1 for Controllable Unsupervised Text Attribute Transfer via Editing Entangled Latent Representation
Figure 2 for Controllable Unsupervised Text Attribute Transfer via Editing Entangled Latent Representation
Figure 3 for Controllable Unsupervised Text Attribute Transfer via Editing Entangled Latent Representation
Figure 4 for Controllable Unsupervised Text Attribute Transfer via Editing Entangled Latent Representation

Unsupervised text attribute transfer automatically transforms a text to alter a specific attribute (e.g. sentiment) without using any parallel data, while simultaneously preserving its attribute-independent content. The dominant approaches are trying to model the content-independent attribute separately, e.g., learning different attributes' representations or using multiple attribute-specific decoders. However, it may lead to inflexibility from the perspective of controlling the degree of transfer or transferring over multiple aspects at the same time. To address the above problems, we propose a more flexible unsupervised text attribute transfer framework which replaces the process of modeling attribute with minimal editing of latent representations based on an attribute classifier. Specifically, we first propose a Transformer-based autoencoder to learn an entangled latent representation for a discrete text, then we transform the attribute transfer task to an optimization problem and propose the Fast-Gradient-Iterative-Modification algorithm to edit the latent representation until conforming to the target attribute. Extensive experimental results demonstrate that our model achieves very competitive performance on three public data sets. Furthermore, we also show that our model can not only control the degree of transfer freely but also allow to transfer over multiple aspects at the same time.

* Under review 
Viaarxiv icon