Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen Huang

CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

Jul 01, 2024

Yuxuan Wang, Yijun Liu, Fei Yu, Chen Huang, Kexin Li, Zhiguo Wan, Wanxiang Che

Figure 1 for CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

Figure 2 for CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

Figure 3 for CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

Figure 4 for CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

Abstract:Despite the rapid development of Chinese vision-language models (VLMs), most existing Chinese vision-language (VL) datasets are constructed on Western-centric images from existing English VL datasets. The cultural bias in the images makes these datasets unsuitable for evaluating VLMs in Chinese culture. To remedy this issue, we present a new Chinese Vision- Language Understanding Evaluation (CVLUE) benchmark dataset, where the selection of object categories and images is entirely driven by Chinese native speakers, ensuring that the source images are representative of Chinese culture. The benchmark contains four distinct VL tasks ranging from image-text retrieval to visual question answering, visual grounding and visual dialogue. We present a detailed statistical analysis of CVLUE and provide a baseline performance analysis with several open-source multilingual VLMs on CVLUE and its English counterparts to reveal their performance gap between English and Chinese. Our in-depth category-level analysis reveals a lack of Chinese cultural knowledge in existing VLMs. We also find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.

Via

Access Paper or Ask Questions

Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets

Jun 12, 2024

Duanyu Feng, Bowen Qin, Chen Huang, Youcheng Huang, Zheng Zhang, Wenqiang Lei

Figure 1 for Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets

Figure 2 for Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets

Figure 3 for Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets

Figure 4 for Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets

Abstract:The success of the reward model in distinguishing between responses with subtle safety differences depends critically on the high-quality preference dataset, which should capture the fine-grained nuances of harmful and harmless responses. This motivates the need to develop a dataset involving preference margins, which accurately quantify how harmless one response is compared to another. In this paper, we take the first step to propose an effective and cost-efficient framework to promote the margin-enhanced preference dataset development. Our framework, Legend, Leverages representation engineering to annotate preference datasets. It constructs the specific direction within the LLM's embedding space that represents safety. By leveraging this safety direction, Legend can then leverage the semantic distances of paired responses along this direction to annotate margins automatically. We experimentally demonstrate our effectiveness in both reward modeling and harmless alignment for LLMs. Legend also stands out for its efficiency, requiring only the inference time rather than additional training. This efficiency allows for easier implementation and scalability, making Legend particularly valuable for practical applications in aligning LLMs with safe conversations.

* Our code is available at https://github.com/colfeng/Legend

Via

Access Paper or Ask Questions

Selective Annotation via Data Allocation: These Data Should Be Triaged to Experts for Annotation Rather Than the Model

May 20, 2024

Chen Huang, Yang Deng, Wenqiang Lei, Jiancheng Lv, Ido Dagan

Figure 1 for Selective Annotation via Data Allocation: These Data Should Be Triaged to Experts for Annotation Rather Than the Model

Figure 2 for Selective Annotation via Data Allocation: These Data Should Be Triaged to Experts for Annotation Rather Than the Model

Figure 3 for Selective Annotation via Data Allocation: These Data Should Be Triaged to Experts for Annotation Rather Than the Model

Figure 4 for Selective Annotation via Data Allocation: These Data Should Be Triaged to Experts for Annotation Rather Than the Model

Abstract:To obtain high-quality annotations under limited budget, semi-automatic annotation methods are commonly used, where a portion of the data is annotated by experts and a model is then trained to complete the annotations for the remaining data. However, these methods mainly focus on selecting informative data for expert annotations to improve the model predictive ability (i.e., triage-to-human data), while the rest of the data is indiscriminately assigned to model annotation (i.e., triage-to-model data). This may lead to inefficiencies in budget allocation for annotations, as easy data that the model could accurately annotate may be unnecessarily assigned to the expert, and hard data may be misclassified by the model. As a result, the overall annotation quality may be compromised. To address this issue, we propose a selective annotation framework called SANT. It effectively takes advantage of both the triage-to-human and triage-to-model data through the proposed error-aware triage and bi-weighting mechanisms. As such, informative or hard data is assigned to the expert for annotation, while easy data is handled by the model. Experimental results show that SANT consistently outperforms other baselines, leading to higher-quality annotation through its proper allocation of data to both expert and model workers. We provide pioneering work on data annotation within budget constraints, establishing a landmark for future triage-based annotation studies.

* 18 pages, 4 figures

Via

Access Paper or Ask Questions

ARAIDA: Analogical Reasoning-Augmented Interactive Data Annotation

May 20, 2024

Chen Huang, Yiping Jin, Ilija Ilievski, Wenqiang Lei, Jiancheng Lv

Figure 1 for ARAIDA: Analogical Reasoning-Augmented Interactive Data Annotation

Figure 2 for ARAIDA: Analogical Reasoning-Augmented Interactive Data Annotation

Figure 3 for ARAIDA: Analogical Reasoning-Augmented Interactive Data Annotation

Figure 4 for ARAIDA: Analogical Reasoning-Augmented Interactive Data Annotation

Abstract:Human annotation is a time-consuming task that requires a significant amount of effort. To address this issue, interactive data annotation utilizes an annotation model to provide suggestions for humans to approve or correct. However, annotation models trained with limited labeled data are prone to generating incorrect suggestions, leading to extra human correction effort. To tackle this challenge, we propose Araida, an analogical reasoning-based approach that enhances automatic annotation accuracy in the interactive data annotation setting and reduces the need for human corrections. Araida involves an error-aware integration strategy that dynamically coordinates an annotation model and a k-nearest neighbors (KNN) model, giving more importance to KNN's predictions when predictions from the annotation model are deemed inaccurate. Empirical studies demonstrate that Araida is adaptable to different annotation tasks and models. On average, it reduces human correction labor by 11.02% compared to vanilla interactive data annotation methods.

* Accepted to ACL 2024

Via

Access Paper or Ask Questions

STYLE: Improving Domain Transferability of Asking Clarification Questions in Large Language Model Powered Conversational Agents

May 20, 2024

Yue Chen, Chen Huang, Yang Deng, Wenqiang Lei, Dingnan Jin, Jia Liu, Tat-Seng Chua

Abstract:Equipping a conversational search engine with strategies regarding when to ask clarification questions is becoming increasingly important across various domains. Attributing to the context understanding capability of LLMs and their access to domain-specific sources of knowledge, LLM-based clarification strategies feature rapid transfer to various domains in a post-hoc manner. However, they still struggle to deliver promising performance on unseen domains, struggling to achieve effective domain transferability. We take the first step to investigate this issue and existing methods tend to produce one-size-fits-all strategies across diverse domains, limiting their search effectiveness. In response, we introduce a novel method, called Style, to achieve effective domain transferability. Our experimental results indicate that Style bears strong domain transferability, resulting in an average search performance improvement of ~10% on four unseen domains.

* Accepted to Findings of ACL 2024

Via

Access Paper or Ask Questions

CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models

May 20, 2024

Tong Zhang, Peixin Qin, Yang Deng, Chen Huang, Wenqiang Lei, Junhong Liu, Dingnan Jin, Hongru Liang, Tat-Seng Chua

Figure 1 for CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models

Figure 2 for CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models

Figure 3 for CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models

Figure 4 for CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models

Abstract:Large language models (LLMs) are increasingly used to meet user information needs, but their effectiveness in dealing with user queries that contain various types of ambiguity remains unknown, ultimately risking user trust and satisfaction. To this end, we introduce CLAMBER, a benchmark for evaluating LLMs using a well-organized taxonomy. Building upon the taxonomy, we construct ~12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries, even enhanced by chain-of-thought (CoT) and few-shot prompting. These techniques may result in overconfidence in LLMs and yield only marginal enhancements in identifying ambiguity. Furthermore, current LLMs fall short in generating high-quality clarifying questions due to a lack of conflict resolution and inaccurate utilization of inherent knowledge. In this paper, CLAMBER presents a guidance and promotes further research on proactive and trustworthy LLMs. Our dataset is available at https://github.com/zt991211/CLAMBER

* Accepted to ACL 2024

Via

Access Paper or Ask Questions

Co-Matching: Towards Human-Machine Collaborative Legal Case Matching

May 16, 2024

Chen Huang, Xinwei Yang, Yang Deng, Wenqiang Lei, JianCheng Lv, Tat-Seng Chua

Abstract:Recent efforts have aimed to improve AI machines in legal case matching by integrating legal domain knowledge. However, successful legal case matching requires the tacit knowledge of legal practitioners, which is difficult to verbalize and encode into machines. This emphasizes the crucial role of involving legal practitioners in high-stakes legal case matching. To address this, we propose a collaborative matching framework called Co-Matching, which encourages both the machine and the legal practitioner to participate in the matching process, integrating tacit knowledge. Unlike existing methods that rely solely on the machine, Co-Matching allows both the legal practitioner and the machine to determine key sentences and then combine them probabilistically. Co-Matching introduces a method called ProtoEM to estimate human decision uncertainty, facilitating the probabilistic combination. Experimental results demonstrate that Co-Matching consistently outperforms existing legal case matching methods, delivering significant performance improvements over human- and machine-based matching in isolation (on average, +5.51% and +8.71%, respectively). Further analysis shows that Co-Matching also ensures better human-machine collaboration effectiveness. Our study represents a pioneering effort in human-machine collaboration for the matching task, marking a milestone for future collaborative matching studies.

* Draft V1: 23 pages, 7 figures

Via

Access Paper or Ask Questions

Concept -- An Evaluation Protocol on Conversation Recommender Systems with System-centric and User-centric Factors

Apr 06, 2024

Chen Huang, Peixin Qin, Yang Deng, Wenqiang Lei, Jiancheng Lv, Tat-Seng Chua

Figure 1 for Concept -- An Evaluation Protocol on Conversation Recommender Systems with System-centric and User-centric Factors

Figure 2 for Concept -- An Evaluation Protocol on Conversation Recommender Systems with System-centric and User-centric Factors

Figure 3 for Concept -- An Evaluation Protocol on Conversation Recommender Systems with System-centric and User-centric Factors

Figure 4 for Concept -- An Evaluation Protocol on Conversation Recommender Systems with System-centric and User-centric Factors

Abstract:The conversational recommendation system (CRS) has been criticized regarding its user experience in real-world scenarios, despite recent significant progress achieved in academia. Existing evaluation protocols for CRS may prioritize system-centric factors such as effectiveness and fluency in conversation while neglecting user-centric aspects. Thus, we propose a new and inclusive evaluation protocol, Concept, which integrates both system- and user-centric factors. We conceptualise three key characteristics in representing such factors and further divide them into six primary abilities. To implement Concept, we adopt a LLM-based user simulator and evaluator with scoring rubrics that are tailored for each primary ability. Our protocol, Concept, serves a dual purpose. First, it provides an overview of the pros and cons in current CRS models. Second, it pinpoints the problem of low usability in the "omnipotent" ChatGPT and offers a comprehensive reference guide for evaluating CRS, thereby setting the foundation for CRS improvement.

* 27 pages, 18 tables, and 10 figures

Via

Access Paper or Ask Questions

Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective

Apr 06, 2024

Duanyu Feng, Bowen Qin, Chen Huang, Zheng Zhang, Wenqiang Lei

Figure 1 for Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective

Abstract:Direct Preference Optimization (DPO), which derives reward signals directly from pairwise preference data, has shown its effectiveness on aligning Large Language Models (LLMs) with human preferences. Despite its widespread use across various tasks, DPO has been criticized for its sensitivity to the SFT's effectiveness and its hindrance to the learning capacity towards human-preferred responses, leading to less satisfactory performance. To overcome those limitations, the theoretical understanding of DPO are indispensable but still lacking. To this end, we take a step towards theoretically analyzing and understanding the limitations of DPO. Specifically, we provide an analytical framework using the field theory to analyze the optimization process of DPO. By analyzing the gradient vector field of the DPO loss function, we find that the DPO loss function decreases the probability of producing human dispreferred data at a faster rate than it increases the probability of producing preferred data. This provides theoretical insights for understanding the limitations of DPO discovered in the related research experiments, thereby setting the foundation for its improvement.

* Draft version

Via

Access Paper or Ask Questions

Strength Lies in Differences! Towards Effective Non-collaborative Dialogues via Tailored Strategy Planning

Mar 11, 2024

Tong Zhang, Chen Huang, Yang Deng, Hongru Liang, Jia Liu, Zujie Wen, Wenqiang Lei, Tat-Seng Chua

Figure 1 for Strength Lies in Differences! Towards Effective Non-collaborative Dialogues via Tailored Strategy Planning

Figure 2 for Strength Lies in Differences! Towards Effective Non-collaborative Dialogues via Tailored Strategy Planning

Figure 3 for Strength Lies in Differences! Towards Effective Non-collaborative Dialogues via Tailored Strategy Planning

Figure 4 for Strength Lies in Differences! Towards Effective Non-collaborative Dialogues via Tailored Strategy Planning

Abstract:We investigate non-collaborative dialogue agents that must engage in tailored strategic planning for diverse users to secure a favorable agreement. This poses challenges for existing dialogue agents due to two main reasons: their inability to integrate user-specific characteristics into their strategic planning and their training paradigm's failure to produce strategic planners that can generalize to diverse users. To address these challenges, we propose TRIP to enhance the capability in tailored strategic planning, incorporating a user-aware strategic planning module and a population-based training paradigm. Through experiments on benchmark non-collaborative dialogue tasks, we demonstrate the effectiveness of TRIP in catering to diverse users.

* A draft version and appendix will be coming soon!

Via

Access Paper or Ask Questions