Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bin Wang

and Other Contributors

First-place Solution for Streetscape Shop Sign Recognition Competition

Jan 06, 2025

Bin Wang, Li Jing

Abstract:Text recognition technology applied to street-view storefront signs is increasingly utilized across various practical domains, including map navigation, smart city planning analysis, and business value assessments in commercial districts. This technology holds significant research and commercial potential. Nevertheless, it faces numerous challenges. Street view images often contain signboards with complex designs and diverse text styles, complicating the text recognition process. A notable advancement in this field was introduced by our team in a recent competition. We developed a novel multistage approach that integrates multimodal feature fusion, extensive self-supervised training, and a Transformer-based large model. Furthermore, innovative techniques such as BoxDQN, which relies on reinforcement learning, and text rectification methods were employed, leading to impressive outcomes. Comprehensive experiments have validated the effectiveness of these methods, showcasing our potential to enhance text recognition capabilities in complex urban environments.

* technical report

Via

Access Paper or Ask Questions

Learn A Flexible Exploration Model for Parameterized Action Markov Decision Processes

Jan 06, 2025

Zijian Wang, Bin Wang, Mingwen Shao, Hongbo Dou, Boxiang Tao

Abstract:Hybrid action models are widely considered an effective approach to reinforcement learning (RL) modeling. The current mainstream method is to train agents under Parameterized Action Markov Decision Processes (PAMDPs), which performs well in specific environments. Unfortunately, these models either exhibit drastic low learning efficiency in complex PAMDPs or lose crucial information in the conversion between raw space and latent space. To enhance the learning efficiency and asymptotic performance of the agent, we propose a model-based RL (MBRL) algorithm, FLEXplore. FLEXplore learns a parameterized-action-conditioned dynamics model and employs a modified Model Predictive Path Integral control. Unlike conventional MBRL algorithms, we carefully design the dynamics loss function and reward smoothing process to learn a loose yet flexible model. Additionally, we use the variational lower bound to maximize the mutual information between the state and the hybrid action, enhancing the exploration effectiveness of the agent. We theoretically demonstrate that FLEXplore can reduce the regret of the rollout trajectory through the Wasserstein Metric under given Lipschitz conditions. Our empirical results on several standard benchmarks show that FLEXplore has outstanding learning efficiency and asymptotic performance compared to other baselines.

Via

Access Paper or Ask Questions

Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Jan 02, 2025

Bin Wang, Xunlong Zou, Shuo Sun, Wenyu Zhang, Yingxu He, Zhuohan Liu, Chengwei Wei, Nancy F. Chen, AiTi Aw

Figure 1 for Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Figure 2 for Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Figure 3 for Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Figure 4 for Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models

Abstract:Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains underexplored, limiting insights into its linguistic structure and applications. To address this gap, we standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC). These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), and Paralinguistic Question Answering (PQA). We release standardized splits and a human-verified test set to facilitate further research. Additionally, we propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently. Experiments reveal our models adaptability to Singlish context, achieving state-of-the-art performance and outperforming prior models by 10-30% in comparison with other AudioLLMs and cascaded solutions.

* Open-Source: https://github.com/AudioLLMs/Singlish

Via

Access Paper or Ask Questions

DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

Dec 30, 2024

Xiaolin Hu, Xiang Cheng, Peiyu Liu, Wei Liu, Jian Luan, Bin Wang, Yong Liu

Figure 1 for DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

Figure 2 for DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

Figure 3 for DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

Figure 4 for DoTA: Weight-Decomposed Tensor Adaptation for Large Language Models

Abstract:Low-rank adaptation (LoRA) reduces the computational and memory demands of fine-tuning large language models (LLMs) by approximating updates with low-rank matrices. However, low-rank approximation in two-dimensional space fails to capture high-dimensional structures within the target matrix. Recently, tensor decomposition methods have been explored for fine-tuning LLMs, leveraging their ability to extract structured information. Yet, these approaches primarily rely on random initialization, and the impact of initialization on tensor adaptation remains underexplored. In this paper, we reveal that random initialization significantly diverges from the validation loss achieved by full fine-tuning. To address this, we propose Weight-Decomposed Tensor Adaptation (DoTA), which leverages the Matrix Product Operator (MPO) decomposition of pre-trained weights for effective initialization in fine-tuning LLMs. Additionally, we introduce QDoTA, a quantized version of DoTA designed for 4-bit quantization. Experiments on commonsense and arithmetic reasoning tasks show that DoTA outperforms random initialization methods with fewer parameters. QDoTA further reduces memory consumption and achieves comparable performance to DoTA on commonsense reasoning tasks. We will release our code to support future research.

* 12 pages, 6 figures

Via

Access Paper or Ask Questions

Semantic Hierarchical Prompt Tuning for Parameter-Efficient Fine-Tuning

Dec 24, 2024

Haowei Zhu, Fangyuan Zhang, Rui Qin, Tianxiang Pan, Junhai Yong, Bin Wang

Figure 1 for Semantic Hierarchical Prompt Tuning for Parameter-Efficient Fine-Tuning

Figure 2 for Semantic Hierarchical Prompt Tuning for Parameter-Efficient Fine-Tuning

Figure 3 for Semantic Hierarchical Prompt Tuning for Parameter-Efficient Fine-Tuning

Figure 4 for Semantic Hierarchical Prompt Tuning for Parameter-Efficient Fine-Tuning

Abstract:As the scale of vision models continues to grow, Visual Prompt Tuning (VPT) has emerged as a parameter-efficient transfer learning technique, noted for its superior performance compared to full fine-tuning. However, indiscriminately applying prompts to every layer without considering their inherent correlations, can cause significant disturbances, leading to suboptimal transferability. Additionally, VPT disrupts the original self-attention structure, affecting the aggregation of visual features, and lacks a mechanism for explicitly mining discriminative visual features, which are crucial for classification. To address these issues, we propose a Semantic Hierarchical Prompt (SHIP) fine-tuning strategy. We adaptively construct semantic hierarchies and use semantic-independent and semantic-shared prompts to learn hierarchical representations. We also integrate attribute prompts and a prompt matching loss to enhance feature discrimination and employ decoupled attention for robustness and reduced inference costs. SHIP significantly improves performance, achieving a 4.9% gain in accuracy over VPT with a ViT-B/16 backbone on VTAB-1k tasks. Our code is available at https://github.com/haoweiz23/SHIP.

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Dec 23, 2024

Ge Zhang, Mohammad Ali Alomrani, Hongjian Gu, Jiaming Zhou, Yaochen Hu, Bin Wang, Qun Liu, Mark Coates, Yingxue Zhang, Jianye Hao

Figure 1 for Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Figure 2 for Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Figure 3 for Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Figure 4 for Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Abstract:Large language models (LLMs) possess vast semantic knowledge but often struggle with complex reasoning tasks, particularly in relational reasoning problems such as kinship or spatial reasoning. In this paper, we present Path-of-Thoughts (PoT), a novel framework designed to tackle relation reasoning by decomposing the task into three key stages: graph extraction, path identification, and reasoning. Unlike previous approaches, PoT efficiently extracts a task-agnostic graph that identifies crucial entities, relations, and attributes within the problem context. Subsequently, PoT identifies relevant reasoning chains within the graph corresponding to the posed question, facilitating inference of potential answers. Experimental evaluations on four benchmark datasets, demanding long reasoning chains, demonstrate that PoT surpasses state-of-the-art baselines by a significant margin (maximum 21.3%) without necessitating fine-tuning or extensive LLM calls. Furthermore, as opposed to prior neuro-symbolic methods, PoT exhibits improved resilience against LLM errors by leveraging the compositional nature of graphs.

Via

Access Paper or Ask Questions

MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models

Dec 18, 2024

Yingxu He, Zhuohan Liu, Shuo Sun, Bin Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, Ai Ti Aw

Figure 1 for MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models

Figure 2 for MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models

Abstract:We introduce MERaLiON-AudioLLM (Multimodal Empathetic Reasoning and Learning in One Network), the first speech-text model tailored for Singapore's multilingual and multicultural landscape. Developed under the National Large Language Models Funding Initiative, Singapore, MERaLiON-AudioLLM integrates advanced speech and text processing to address the diverse linguistic nuances of local accents and dialects, enhancing accessibility and usability in complex, multilingual environments. Our results demonstrate improvements in both speech recognition and task-specific understanding, positioning MERaLiON-AudioLLM as a pioneering solution for region specific AI applications. We envision this release to set a precedent for future models designed to address localised linguistic and cultural contexts in a global framework.

Via

Access Paper or Ask Questions

CoinMath: Harnessing the Power of Coding Instruction for Math LLMs

Dec 16, 2024

Chengwei Wei, Bin Wang, Jung-jae Kim, Guimei Liu, Nancy F. Chen

Figure 1 for CoinMath: Harnessing the Power of Coding Instruction for Math LLMs

Figure 2 for CoinMath: Harnessing the Power of Coding Instruction for Math LLMs

Figure 3 for CoinMath: Harnessing the Power of Coding Instruction for Math LLMs

Figure 4 for CoinMath: Harnessing the Power of Coding Instruction for Math LLMs

Abstract:Large Language Models (LLMs) have shown strong performance in solving mathematical problems, with code-based solutions proving particularly effective. However, the best practice to leverage coding instruction data to enhance mathematical reasoning remains underexplored. This study investigates three key questions: (1) How do different coding styles of mathematical code-based rationales impact LLMs' learning performance? (2) Can general-domain coding instructions improve performance? (3) How does integrating textual rationales with code-based ones during training enhance mathematical reasoning abilities? Our findings reveal that code-based rationales with concise comments, descriptive naming, and hardcoded solutions are beneficial, while improvements from general-domain coding instructions and textual rationales are relatively minor. Based on these insights, we propose CoinMath, a learning strategy designed to enhance mathematical reasoning by diversifying the coding styles of code-based rationales. CoinMath generates a variety of code-based rationales incorporating concise comments, descriptive naming conventions, and hardcoded solutions. Experimental results demonstrate that CoinMath significantly outperforms its baseline model, MAmmoTH, one of the SOTA math LLMs.

Via

Access Paper or Ask Questions

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Dec 16, 2024

Renqiu Xia, Mingsheng Li, Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang(+5 more)

Figure 1 for GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Figure 2 for GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Figure 3 for GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Figure 4 for GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Abstract:Despite their proficiency in general tasks, Multi-modal Large Language Models (MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands understanding diagrams, interpreting symbols, and performing complex reasoning. This limitation arises from their pre-training on natural images and texts, along with the lack of automated verification in the problem-solving process. Besides, current geometric specialists are limited by their task-specific designs, making them less effective for broader geometric problems. To this end, we present GeoX, a multi-modal large model focusing on geometric understanding and reasoning tasks. Given the significant differences between geometric diagram-symbol and natural image-text, we introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora. Furthermore, we introduce geometry-language alignment, an effective pre-training paradigm that bridges the modality gap between unimodal geometric experts. We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals. Finally, GeoX benefits from visual instruction tuning, empowering it to take geometric images and questions as input and generate verifiable solutions. Experiments show that GeoX outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as GeoQA, UniGeo, Geometry3K, and PGPS9k.

* Our code is available at https://github.com/UniModal4Reasoning/GeoX

Via

Access Paper or Ask Questions

MERaLiON-AudioLLM: Technical Report

Dec 13, 2024

Yingxu He, Zhuohan Liu, Shuo Sun, Bin Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, Ai Ti Aw

Figure 1 for MERaLiON-AudioLLM: Technical Report

Figure 2 for MERaLiON-AudioLLM: Technical Report

Via

Access Paper or Ask Questions