Alert button
Picture for Qin Jin

Qin Jin

Alert button

Explore and Tell: Embodied Visual Captioning in 3D Environments

Aug 21, 2023
Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin

Figure 1 for Explore and Tell: Embodied Visual Captioning in 3D Environments
Figure 2 for Explore and Tell: Embodied Visual Captioning in 3D Environments
Figure 3 for Explore and Tell: Embodied Visual Captioning in 3D Environments
Figure 4 for Explore and Tell: Embodied Visual Captioning in 3D Environments

While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell.

* 12 pages; 10 figures; ICCV 2023 
Viaarxiv icon

A Systematic Exploration of Joint-training for Singing Voice Synthesis

Aug 05, 2023
Yuning Wu, Yifeng Yu, Jiatong Shi, Tao Qian, Qin Jin

Figure 1 for A Systematic Exploration of Joint-training for Singing Voice Synthesis
Figure 2 for A Systematic Exploration of Joint-training for Singing Voice Synthesis
Figure 3 for A Systematic Exploration of Joint-training for Singing Voice Synthesis
Figure 4 for A Systematic Exploration of Joint-training for Singing Voice Synthesis

There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar problem has been addressed in the TTS systems by joint-training or by replacing acoustic features with a latent representation, adopting corresponding approaches to SVS is not an easy task. How to improve the joint-training of SVS systems has not been well explored. In this paper, we conduct a systematic investigation of how to better perform a joint-training of an acoustic model and a vocoder for SVS. We carry out extensive experiments and demonstrate that our joint-training strategy outperforms baselines, achieving more stable performance across different datasets while also increasing the interpretability of the entire framework.

Viaarxiv icon

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

Jul 31, 2023
Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin Jin

Stylized visual captioning aims to generate image or video descriptions with specific styles, making them more attractive and emotionally appropriate. One major challenge with this task is the lack of paired stylized captions for visual content, so most existing works focus on unsupervised methods that do not rely on parallel datasets. However, these approaches still require training with sufficient examples that have style labels, and the generated captions are limited to predefined styles. To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training. We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module. Our two-step training scheme proceeds as follows: first, we train a style extractor to generate style representations on an unlabeled text-only corpus. Then, we freeze the extractor and enable our decoder to generate stylized descriptions based on the extracted style vector and projected visual content vectors. During inference, our model can generate desired stylized captions by deriving the style representation from user-supplied examples. Our automatic evaluation results for few-shot sentimental visual captioning outperform state-of-the-art approaches and are comparable to models that are fully trained on labeled style corpora. Human evaluations further confirm our model s ability to handle multiple styles.

* 9 pages, 6 figures 
Viaarxiv icon

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

Jul 20, 2023
Qi Zhang, Sipeng Zheng, Qin Jin

Figure 1 for No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection
Figure 2 for No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection
Figure 3 for No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection
Figure 4 for No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video. A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR. Prior works have addressed this challenge using sophisticated techniques. In this paper, we propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection. The multi-scale neighboring attention restricts each video token to only aggregate visual contexts from its neighbor, enabling the extraction of the most distinguishing information with multi-scale feature hierarchies from high-ratio noises. The zoom-in boundary detection then focuses on local-wise discrimination of the selected top candidates for fine-grained grounding adjustment. With an end-to-end training strategy, our model achieves competitive performance on different TVG benchmarks, while also having the advantage of faster inference speed and lighter model parameters, thanks to its lightweight architecture.

Viaarxiv icon

Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Jun 27, 2023
Zihao Yue, Anwen Hu, Liang Zhang, Qin Jin

Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.

Viaarxiv icon

Movie101: A New Movie Understanding Benchmark

May 20, 2023
Zihao Yue, Qi Zhang, Anwen Hu, Liang Zhang, Ziheng Wang, Qin Jin

Figure 1 for Movie101: A New Movie Understanding Benchmark
Figure 2 for Movie101: A New Movie Understanding Benchmark
Figure 3 for Movie101: A New Movie Understanding Benchmark
Figure 4 for Movie101: A New Movie Understanding Benchmark

To help the visually impaired enjoy movies, automatic movie narrating systems are expected to narrate accurate, coherent, and role-aware plots when there are no speaking lines of actors. Existing works benchmark this challenge as a normal video captioning task via some simplifications, such as removing role names and evaluating narrations with ngram-based metrics, which makes it difficult for automatic systems to meet the needs of real application scenarios. To narrow this gap, we construct a large-scale Chinese movie benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking. External knowledge, such as role information and movie genres, is also provided for better movie understanding. Besides, we propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation, which achieves the best correlation with human evaluation. Our benchmark also supports the Temporal Narration Grounding (TNG) task to investigate clip localization given text descriptions. For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines. The dataset and codes are released at https://github.com/yuezih/Movie101.

* Accepted to ACL 2023 
Viaarxiv icon

Edit As You Wish: Video Description Editing with Multi-grained Commands

May 15, 2023
Linli Yao, Yuanmeng Zhang, Ziheng Wang, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin Jin

Figure 1 for Edit As You Wish: Video Description Editing with Multi-grained Commands
Figure 2 for Edit As You Wish: Video Description Editing with Multi-grained Commands
Figure 3 for Edit As You Wish: Video Description Editing with Multi-grained Commands
Figure 4 for Edit As You Wish: Video Description Editing with Multi-grained Commands

Automatically narrating a video with natural language can assist people in grasping and managing massive videos on the Internet. From the perspective of video uploaders, they may have varied preferences for writing the desired video description to attract more potential followers, e.g. catching customers' attention for product videos. The Controllable Video Captioning task is therefore proposed to generate a description conditioned on the user demand and video content. However, existing works suffer from two shortcomings: 1) the control signal is fixed and can only express single-grained control; 2) the video description can not be further edited to meet dynamic user demands. In this paper, we propose a novel Video Description Editing (VDEdit) task to automatically revise an existing video description guided by flexible user requests. Inspired by human writing-revision habits, we design the user command as a {operation, position, attribute} triplet to cover multi-grained use requirements, which can express coarse-grained control (e.g. expand the description) as well as fine-grained control (e.g. add specified details in specified position) in a unified format. To facilitate the VDEdit task, we first automatically construct a large-scale benchmark dataset namely VATEX-EDIT in the open domain describing diverse human activities. Considering the real-life application scenario, we further manually collect an e-commerce benchmark dataset called EMMAD-EDIT. We propose a unified framework to convert the {operation, position, attribute} triplet into a textual control sequence to handle multi-grained editing commands. For VDEdit evaluation, we adopt comprehensive metrics to measure three aspects of model performance, including caption quality, caption-command consistency, and caption-video alignment.

Viaarxiv icon

InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

May 10, 2023
Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin

Figure 1 for InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation
Figure 2 for InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation
Figure 3 for InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation
Figure 4 for InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

Automatic image captioning evaluation is critical for benchmarking and promoting advances in image captioning research. Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative. Instead, we humans can easily identify the problems of captions in details, e.g., which words are inaccurate and which salient objects are not described, and then rate the caption quality. To support such informative feedback, we propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC). Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level, and also provide a text precision score, a vision recall score and an overall quality score at coarse-grained level. The coarse-grained score of InfoMetIC achieves significantly better correlation with human judgements than existing metrics on multiple benchmarks. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation. Our code and datasets are publicly available at https://github.com/HAWLYQ/InfoMetIC.

* Accepted by ACL 2023 main conference 
Viaarxiv icon

Knowledge Enhanced Model for Live Video Comment Generation

Apr 28, 2023
Jieting Chen, Junkai Ding, Wenping Chen, Qin Jin

Figure 1 for Knowledge Enhanced Model for Live Video Comment Generation
Figure 2 for Knowledge Enhanced Model for Live Video Comment Generation
Figure 3 for Knowledge Enhanced Model for Live Video Comment Generation
Figure 4 for Knowledge Enhanced Model for Live Video Comment Generation

Live video commenting is popular on video media platforms, as it can create a chatting atmosphere and provide supplementary information for users while watching videos. Automatically generating live video comments can improve user experience and enable human-like generation for bot chatting. Existing works mostly focus on short video datasets while ignoring other important video types such as long videos like movies. In this work, we collect a new Movie Live Comments (MovieLC) dataset to support research on live video comment generation for long videos. We also propose a knowledge enhanced generation model inspired by the divergent and informative nature of live video comments. Our model adopts a pre-training encoder-decoder framework and incorporates external knowledge. Extensive experiments show that both objective metrics and human evaluation demonstrate the effectiveness of our proposed model. The MovieLC dataset and our code will be released.

Viaarxiv icon