Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qin Jin

Renmin University of China

ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains

May 17, 2024

Zhaopei Huang, Jinming Zhao, Qin Jin

Figure 1 for ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains

Figure 2 for ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains

Figure 3 for ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains

Figure 4 for ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains

Abstract:Understanding the process of emotion generation is crucial for analyzing the causes behind emotions. Causal Emotion Entailment (CEE), an emotion-understanding task, aims to identify the causal utterances in a conversation that stimulate the emotions expressed in a target utterance. However, current works in CEE mainly focus on modeling semantic and emotional interactions in conversations, neglecting the exploration of the emotion-generation process. This hinders the models from deeply understanding emotions, restricting their ability to produce explainable predictions. In this work, inspired by the emotion generation process of "stimulus-appraisal-emotion" in the cognitive appraisal theory, we introduce a step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to infer the stimulus from the target emotional expressions in conversations. Specifically, we first introduce the ECR-Chain to ChatGPT via few-shot prompting, which significantly improves its performance on the CEE task. We further propose an automated construction process to utilize ChatGPT in building an ECR-Chain set, which can enhance the reasoning abilities of smaller models through supervised training and assist the Vicuna-7B model in achieving state-of-the-art CEE performance. Moreover, our methods can enable these generative language models to effectively perform emotion-cause reasoning in an explainable manner. Our code, data and more details are at https://github.com/hzp3517/ECR-Chain.

* Accepted by IJCAI 2024

Via

Access Paper or Ask Questions

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Apr 25, 2024

Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang

Figure 1 for TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Figure 2 for TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Figure 3 for TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Figure 4 for TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Abstract:Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart.

* 13 pages, 11 figures

Via

Access Paper or Ask Questions

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Apr 23, 2024

Qingrong He, Kejun Lin, Shizhe Chen, Anwen Hu, Qin Jin

Figure 1 for Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Figure 2 for Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Figure 3 for Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Figure 4 for Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Abstract:This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of leveraging large language models (LLMs) for visual reasoning, we propose LLM-TPC, a novel framework that leverages the planning, tool usage, and reflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules. Finally, the Rectify phase adjusts the plan and code if the program fails to execute. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method. Our code is publicly available at https://qingrongh.github.io/LLM-TPC/.

Via

Access Paper or Ask Questions

Movie101v2: Improved Movie Narration Benchmark

Apr 20, 2024

Zihao Yue, Yepeng Zhang, Ziheng Wang, Qin Jin

Figure 1 for Movie101v2: Improved Movie Narration Benchmark

Figure 2 for Movie101v2: Improved Movie Narration Benchmark

Figure 3 for Movie101v2: Improved Movie Narration Benchmark

Figure 4 for Movie101v2: Improved Movie Narration Benchmark

Abstract:Automatic movie narration targets at creating video-aligned plot descriptions to assist visually impaired audiences. It differs from standard video captioning in that it requires not only describing key visual details but also inferring the plots developed across multiple movie shots, thus posing unique and ongoing challenges. To advance the development of automatic movie narrating systems, we first revisit the limitations of existing datasets and develop a large-scale, bilingual movie narration dataset, Movie101v2. Second, taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages and tentatively focus on the initial stages featuring understanding within individual clips. We also introduce a new narration assessment to align with our staged task goals. Third, using our new dataset, we baseline several leading large vision-language models, including GPT-4V, and conduct in-depth investigations into the challenges current models face for movie narration generation. Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.

Via

Access Paper or Ask Questions

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Mar 19, 2024

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang(+1 more)

Figure 1 for mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Figure 2 for mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Figure 3 for mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Figure 4 for mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Abstract:Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5.

* 21 pages, 15 figures

Via

Access Paper or Ask Questions

SPAFormer: Sequential 3D Part Assembly with Transformers

Mar 09, 2024

Boshen Xu, Sipeng Zheng, Qin Jin

Figure 1 for SPAFormer: Sequential 3D Part Assembly with Transformers

Figure 2 for SPAFormer: Sequential 3D Part Assembly with Transformers

Figure 3 for SPAFormer: Sequential 3D Part Assembly with Transformers

Figure 4 for SPAFormer: Sequential 3D Part Assembly with Transformers

Abstract:We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task. This task requires accurate prediction of each part's pose and shape in sequential steps, and as the number of parts increases, the possible assembly combinations increase exponentially, leading to a combinatorial explosion that severely hinders the efficacy of 3D-PA. SPAFormer addresses this problem by leveraging weak constraints from assembly sequences, effectively reducing the solution space's complexity. Since assembly part sequences convey construction rules similar to sentences being structured through words, our model explores both parallel and autoregressive generation. It further enhances assembly through knowledge enhancement strategies that utilize the attributes of parts and their sequence information, enabling it to capture the inherent assembly pattern and relationships among sequentially ordered parts. We also construct a more challenging benchmark named PartNet-Assembly covering 21 varied categories to more comprehensively validate the effectiveness of SPAFormer. Extensive experiments demonstrate the superior generalization capabilities of SPAFormer, particularly with multi-tasking and in scenarios requiring long-horizon assembly. Codes and model weights will be released at \url{https://github.com/xuboshen/SPAFormer}.

* Code will be released at https://github.com/xuboshen/SPAFormer

Via

Access Paper or Ask Questions

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

Mar 09, 2024

Boshen Xu, Sipeng Zheng, Qin Jin

Abstract:We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization. Our code is available at \url{https://github.com/xuboshen/pov_acmmm2023}.

* Proceedings of the 31st ACM International Conference on Multimedia (2023). Association for Computing Machinery, New York, NY, USA, 2807-2816
* Accepted by ACM MM 2023. Project page: https://xuboshen.github.io/

Via

Access Paper or Ask Questions

Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective

Feb 22, 2024

Zihao Yue, Liang Zhang, Qin Jin

Abstract:Large Multimodal Models (LMMs) often suffer from multimodal hallucinations, wherein they may create content that is not present in the visual inputs. In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits. By investigating how the model decides to terminate generation with EOS, the special end-of-sentence token, we find that the model assesses the completeness of the entire sequence by comparing the generated text with the image. This observation suggests that the model possesses an inherent potential of making proper EOS decisions based on its visual perception to avoid overly lengthy outputs. To take advantage of such potential, we explore two methods to mitigate multimodal hallucinations: a training objective that enables the model to reduce hallucinations by learning from regular instruction data, and a data filtering strategy to prevent harmful training data from exacerbating model hallucinations. Both methods significantly improve the hallucination performance of LMMs, without requiring any additional data or knowledge.

Via

Access Paper or Ask Questions

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2

Jan 31, 2024

Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin Jin, Shinji Watanabe

Figure 1 for Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2

Figure 2 for Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2

Figure 3 for Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2

Figure 4 for Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2

Abstract:In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability, a constraint less common in text-to-speech (TTS). This study proposes a new approach to address this data scarcity. We utilize an existing singing voice synthesizer for data augmentation and apply precise manual tuning to reduce unnatural voice synthesis. Our development of two extensive singing voice corpora, ACE-Opencpop and KiSing-v2, facilitates large-scale, multi-singer voice synthesis. Utilizing pre-trained models derived from these corpora, we achieve notable improvements in voice quality, evident in both in-domain and out-of-domain scenarios. The corpora, pre-trained models, and their related training recipes are publicly available at Muskits-ESPnet (https://github.com/espnet/espnet).

Via

Access Paper or Ask Questions

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Oct 08, 2023

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang(+4 more)

Figure 1 for UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Figure 2 for UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Figure 3 for UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Figure 4 for UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Abstract:Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets will be released.

Via

Access Paper or Ask Questions