Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ming Cheng

Visual Zero-Shot E-Commerce Product Attribute Value Extraction

Feb 21, 2025

Jiaying Gong, Ming Cheng, Hongda Shen, Pierre-Yves Vandenbussche, Janet Jenq, Hoda Eldardiry

Abstract:Existing zero-shot product attribute value (aspect) extraction approaches in e-Commerce industry rely on uni-modal or multi-modal models, where the sellers are asked to provide detailed textual inputs (product descriptions) for the products. However, manually providing (typing) the product descriptions is time-consuming and frustrating for the sellers. Thus, we propose a cross-modal zero-shot attribute value generation framework (ViOC-AG) based on CLIP, which only requires product images as the inputs. ViOC-AG follows a text-only training process, where a task-customized text decoder is trained with the frozen CLIP text encoder to alleviate the modality gap and task disconnection. During the zero-shot inference, product aspects are generated by the frozen CLIP image encoder connected with the trained task-customized text decoder. OCR tokens and outputs from a frozen prompt-based LLM correct the decoded outputs for out-of-domain attribute values. Experiments show that ViOC-AG significantly outperforms other fine-tuned vision-language models for zero-shot attribute value extraction.

* 10 pages, 4 figures, accepted for publication in NAACL 2025 Industry Track

Via

Access Paper or Ask Questions

Learning Musical Representations for Music Performance Question Answering

Feb 10, 2025

Xingjian Diao, Chunhui Zhang, Tingxuan Wu, Ming Cheng, Zhongyu Ouyang, Weiyi Wu, Jiang Gui

Figure 1 for Learning Musical Representations for Music Performance Question Answering

Figure 2 for Learning Musical Representations for Music Performance Question Answering

Figure 3 for Learning Musical Representations for Music Performance Question Answering

Figure 4 for Learning Musical Representations for Music Performance Question Answering

Abstract:Music performances are representative scenarios for audio-visual modeling. Unlike common scenarios with sparse audio, music performances continuously involve dense audio signals throughout. While existing multimodal learning methods on the audio-video QA demonstrate impressive capabilities in general scenarios, they are incapable of dealing with fundamental problems within the music performances: they underexplore the interaction between the multimodal signals in performance and fail to consider the distinctive characteristics of instruments and music. Therefore, existing methods tend to answer questions regarding musical performances inaccurately. To bridge the above research gaps, (i) given the intricate multimodal interconnectivity inherent to music data, our primary backbone is designed to incorporate multimodal interactions within the context of music; (ii) to enable the model to learn music characteristics, we annotate and release rhythmic and music sources in the current music datasets; (iii) for time-aware audio-visual modeling, we align the model's music predictions with the temporal dimension. Our experiments show state-of-the-art effects on the Music AVQA datasets. Our code is available at https://github.com/xid32/Amuse.

* Accepted at EMNLP 2024

Via

Access Paper or Ask Questions

Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Feb 09, 2025

Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush Vosoughi, Jiang Gui

Figure 1 for Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Figure 2 for Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Figure 3 for Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Figure 4 for Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding

Abstract:Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.

* Accepted at NAACL 2025

Via

Access Paper or Ask Questions

Sequence-to-Sequence Neural Diarization with Automatic Speaker Detection and Representation

Nov 21, 2024

Ming Cheng, Yuke Lin, Ming Li

Abstract:This paper proposes a novel Sequence-to-Sequence Neural Diarization (SSND) framework to perform online and offline speaker diarization. It is developed from the sequence-to-sequence architecture of our previous target-speaker voice activity detection system and then evolves into a new diarization paradigm by addressing two critical problems. 1) Speaker Detection: The proposed approach can utilize incompletely given speaker embeddings to discover the unknown speaker and predict the target voice activities in the audio signal. It does not require a prior diarization system for speaker enrollment in advance. 2) Speaker Representation: The proposed approach can adopt the predicted voice activities as reference information to extract speaker embeddings from the audio signal simultaneously. The representation space of speaker embedding is jointly learned within the whole diarization network without using an extra speaker embedding model. During inference, the SSND framework can process long audio recordings blockwise. The detection module utilizes the previously obtained speaker-embedding buffer to predict both enrolled and unknown speakers' voice activities for each coming audio block. Next, the speaker-embedding buffer is updated according to the predictions of the representation module. Assuming that up to one new speaker may appear in a small block shift, our model iteratively predicts the results of each block and extracts target embeddings for the subsequent blocks until the signal ends. Finally, the last speaker-embedding buffer can re-score the entire audio, achieving highly accurate diarization performance as an offline system. (......)

Via

Access Paper or Ask Questions

VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models

Nov 07, 2024

Ming Cheng, Jiaying Gong, Chenhan Yuan, William A. Ingram, Edward Fox, Hoda Eldardiry

Figure 1 for VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models

Figure 2 for VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models

Figure 3 for VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models

Figure 4 for VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models

Abstract:Existing text simplification or paraphrase datasets mainly focus on sentence-level text generation in a general domain. These datasets are typically developed without using domain knowledge. In this paper, we release a novel dataset, VTechAGP, which is the first academic-to-general-audience text paraphrase dataset consisting of 4,938 document-level these and dissertation academic and general-audience abstract pairs from 8 colleges authored over 25 years. We also propose a novel dynamic soft prompt generative language model, DSPT5. For training, we leverage a contrastive-generative loss function to learn the keyword vectors in the dynamic prompt. For inference, we adopt a crowd-sampling decoding strategy at both semantic and structural levels to further select the best output candidate. We evaluate DSPT5 and various state-of-the-art large language models (LLMs) from multiple perspectives. Results demonstrate that the SOTA LLMs does not provide satisfactory outcomes, while the lightweight DSPT5 can achieve competitive results. To the best of our knowledge, we are the first to build a benchmark dataset and solutions for academic-to-general-audience text paraphrase dataset.

* 21 pages, 3 figures

Via

Access Paper or Ask Questions

GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Aug 20, 2024

Changkun Liu, Shuai Chen, Yash Bhalgat, Siyan Hu, Zirui Wang, Ming Cheng, Victor Adrian Prisacariu, Tristan Braud

Figure 1 for GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Figure 2 for GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Figure 3 for GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Figure 4 for GSLoc: Efficient Camera Pose Refinement via 3D Gaussian Splatting

Abstract:We leverage 3D Gaussian Splatting (3DGS) as a scene representation and propose a novel test-time camera pose refinement framework, GSLoc. This framework enhances the localization accuracy of state-of-the-art absolute pose regression and scene coordinate regression methods. The 3DGS model renders high-quality synthetic images and depth maps to facilitate the establishment of 2D-3D correspondences. GSLoc obviates the need for training feature extractors or descriptors by operating directly on RGB images, utilizing the 3D vision foundation model, MASt3R, for precise 2D matching. To improve the robustness of our model in challenging outdoor environments, we incorporate an exposure-adaptive module within the 3DGS framework. Consequently, GSLoc enables efficient pose refinement given a single RGB query and a coarse initial pose estimation. Our proposed approach surpasses leading NeRF-based optimization methods in both accuracy and runtime across indoor and outdoor visual localization benchmarks, achieving state-of-the-art accuracy on two indoor datasets.

* The project page is available at https://gsloc.active.vision

Via

Access Paper or Ask Questions

Distortion Recovery: A Two-Stage Method for Guitar Effect Removal

Jul 23, 2024

Ying-Shuo Lee, Yueh-Po Peng, Jui-Te Wu, Ming Cheng, Li Su, Yi-Hsuan Yang

Figure 1 for Distortion Recovery: A Two-Stage Method for Guitar Effect Removal

Figure 2 for Distortion Recovery: A Two-Stage Method for Guitar Effect Removal

Figure 3 for Distortion Recovery: A Two-Stage Method for Guitar Effect Removal

Figure 4 for Distortion Recovery: A Two-Stage Method for Guitar Effect Removal

Abstract:Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions that may be too simplistic to accurately capture the complexities seen in real-world recordings. In this paper, we tackle the task by using a dataset of guitar recordings rendered with commercial-grade audio effect VST plugins. Moreover, we introduce a novel two-stage methodology for audio distortion recovery. The idea is to firstly process the audio signal in the Mel-spectrogram domain in the first stage, and then use a neural vocoder to generate the pristine original guitar sound from the processed Mel-spectrogram in the second stage. We report a set of experiments demonstrating the effectiveness of our approach over existing methods, through both subjective and objective evaluation metrics.

* DAFx 2024

Via

Access Paper or Ask Questions

VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark

Jul 16, 2024

Yuke Lin, Ming Cheng, Fulin Zhang, Yingying Gao, Shilei Zhang, Ming Li

Abstract:In this paper, we provide a large audio-visual speaker recognition dataset, VoxBlink2, which includes approximately 10M utterances with videos from 110K+ speakers in the wild. This dataset represents a significant expansion over the VoxBlink dataset, encompassing a broader diversity of speakers and scenarios by the grace of an optimized data collection pipeline. Afterward, we explore the impact of training strategies, data scale, and model complexity on speaker verification and finally establish a new single-model state-of-the-art EER at 0.170% and minDCF at 0.006% on the VoxCeleb1-O test set. Such remarkable results motivate us to explore speaker recognition from a new challenging perspective. We raise the Open-Set Speaker-Identification task, which is designed to either match a probe utterance with a known gallery speaker or categorize it as an unknown query. Associated with this task, we design concrete benchmark and evaluation protocols. The data and model resources can be found in http://voxblink2.github.io.

* Accepted By InterSpeech2024

Via

Access Paper or Ask Questions

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

May 19, 2024

Jianbo Dai, Jianqiao Lu, Yunlong Feng, Rongju Ruan, Ming Cheng, Haochen Tan, Zhijiang Guo

Figure 1 for MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Figure 2 for MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Figure 3 for MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Figure 4 for MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Abstract:Recent advancements in large language models (LLMs) have greatly improved code generation, specifically at the function level. For instance, GPT-4 has achieved an 88.4% pass rate on HumanEval. However, this draws into question the adequacy of existing benchmarks in thoroughly assessing function-level code generation capabilities. Our study analyzed two common benchmarks, HumanEval and MBPP, and found that these might not thoroughly evaluate LLMs' code generation capacities due to limitations in quality, difficulty, and granularity. To resolve this, we introduce the Mostly Hard Python Problems (MHPP) dataset, consisting of 140 unique human-curated problems. By focusing on the combination of natural language and code reasoning, MHPP gauges LLMs' abilities to comprehend specifications and restrictions, engage in multi-step reasoning, and apply coding knowledge effectively. Initial evaluations of 22 LLMs using MHPP showed many high-performing models on HumanEval failed to achieve similar success on MHPP. Moreover, MHPP highlighted various previously undiscovered limitations within various LLMs, leading us to believe that it could pave the way for a better understanding of LLMs' capabilities and limitations. Dataset and code are available at https://github.com/SparksofAGI/MHPP.

* 39 pages, dataset and code are available at https://github.com/SparksofAGI/MHPP

Via

Access Paper or Ask Questions

GluMarker: A Novel Predictive Modeling of Glycemic Control Through Digital Biomarkers

Apr 19, 2024

Ziyi Zhou, Ming Cheng, Xingjian Diao, Yanjun Cui, Xiangling Li

Figure 1 for GluMarker: A Novel Predictive Modeling of Glycemic Control Through Digital Biomarkers

Figure 2 for GluMarker: A Novel Predictive Modeling of Glycemic Control Through Digital Biomarkers

Figure 3 for GluMarker: A Novel Predictive Modeling of Glycemic Control Through Digital Biomarkers

Figure 4 for GluMarker: A Novel Predictive Modeling of Glycemic Control Through Digital Biomarkers

Abstract:The escalating prevalence of diabetes globally underscores the need for diabetes management. Recent research highlights the growing focus on digital biomarkers in diabetes management, with innovations in computational frameworks and noninvasive monitoring techniques using personalized glucose metrics. However, they predominantly focus on insulin dosing and specific glucose values, or with limited attention given to overall glycemic control. This leaves a gap in expanding the scope of digital biomarkers for overall glycemic control in diabetes management. To address such a research gap, we propose GluMarker -- an end-to-end framework for modeling digital biomarkers using broader factors sources to predict glycemic control. Through the assessment and refinement of various machine learning baselines, GluMarker achieves state-of-the-art on Anderson's dataset in predicting next-day glycemic control. Moreover, our research identifies key digital biomarkers for the next day's glycemic control prediction. These identified biomarkers are instrumental in illuminating the daily factors that influence glycemic management, offering vital insights for diabetes care.

Via

Access Paper or Ask Questions