Recent advancements in generative Large Language Models(LLMs) have been remarkable, however, the quality of the text generated by these models often reveals persistent issues. Evaluating the quality of text generated by these models, especially in open-ended text, has consistently presented a significant challenge. Addressing this, recent work has explored the possibility of using LLMs as evaluators. While using a single LLM as an evaluation agent shows potential, it is filled with significant uncertainty and instability. To address these issues, we propose the MATEval: A "Multi-Agent Text Evaluation framework" where all agents are played by LLMs like GPT-4. The MATEval framework emulates human collaborative discussion methods, integrating multiple agents' interactions to evaluate open-ended text. Our framework incorporates self-reflection and Chain-of-Thought (CoT) strategies, along with feedback mechanisms, enhancing the depth and breadth of the evaluation process and guiding discussions towards consensus, while the framework generates comprehensive evaluation reports, including error localization, error types and scoring. Experimental results show that our framework outperforms existing open-ended text evaluation methods and achieves the highest correlation with human evaluation, which confirms the effectiveness and advancement of our framework in addressing the uncertainties and instabilities in evaluating LLMs-generated text. Furthermore, our framework significantly improves the efficiency of text evaluation and model iteration in industrial scenarios.
Automatic methods for evaluating machine-generated texts hold significant importance due to the expanding applications of generative systems. Conventional methods tend to grapple with a lack of explainability, issuing a solitary numerical score to signify the assessment outcome. Recent advancements have sought to mitigate this limitation by incorporating large language models (LLMs) to offer more detailed error analyses, yet their applicability remains constrained, particularly in industrial contexts where comprehensive error coverage and swift detection are paramount. To alleviate these challenges, we introduce DEE, a Dual-stage Explainable Evaluation method for estimating the quality of text generation. Built upon Llama 2, DEE follows a dual-stage principle guided by stage-specific instructions to perform efficient identification of errors in generated texts in the initial stage and subsequently delves into providing comprehensive diagnostic reports in the second stage. DEE is fine-tuned on our elaborately assembled dataset AntEval, which encompasses 15K examples from 4 real-world applications of Alipay that employ generative systems. The dataset concerns newly emerged issues like hallucination and toxicity, thereby broadening the scope of DEE's evaluation criteria. Experimental results affirm that DEE's superiority over existing evaluation methods, achieving significant improvements in both human correlation as well as efficiency.
Transformers have achieved remarkable performance in multivariate time series(MTS) forecasting due to their capability to capture long-term dependencies. However, the canonical attention mechanism has two key limitations: (1) its quadratic time complexity limits the sequence length, and (2) it generates future values from the entire historical sequence. To address this, we propose a Dozer Attention mechanism consisting of three sparse components: (1) Local, each query exclusively attends to keys within a localized window of neighboring time steps. (2) Stride, enables each query to attend to keys at predefined intervals. (3) Vary, allows queries to selectively attend to keys from a subset of the historical sequence. Notably, the size of this subset dynamically expands as forecasting horizons extend. Those three components are designed to capture essential attributes of MTS data, including locality, seasonality, and global temporal dependencies. Additionally, we present the Dozerformer Framework, incorporating the Dozer Attention mechanism for the MTS forecasting task. We evaluated the proposed Dozerformer framework with recent state-of-the-art methods on nine benchmark datasets and confirmed its superior performance. The code will be released after the manuscript is accepted.
This paper presents a work-in-progress on a learn-ing system that will provide robotics students with a personalized learning environment. This addresses both the scarcity of skilled robotics instructors, particularly in community colleges and the expensive demand for training equipment. The study of robotics at the college level represents a wide range of interests, experiences, and aims. This project works to provide students the flexibility to adapt their learning to their own goals and prior experience. We are developing a system to enable robotics instruction through a web-based interface that is compatible with less expensive hardware. Therefore, the free distribution of teaching materials will empower educators. This project has the potential to increase the number of robotics courses offered at both two- and four-year schools and universities. The course materials are being designed with small units and a hierarchical dependency tree in mind; students will be able to customize their course of study based on the robotics skills they have already mastered. We present an evaluation of a five module mini-course in robotics. Students indicated that they had a positive experience with the online content. They also scored the experience highly on relatedness, mastery, and autonomy perspectives, demonstrating strong motivation potential for this approach.
Multivariate Time Series (MTS) forecasting involves modeling temporal dependencies within historical records. Transformers have demonstrated remarkable performance in MTS forecasting due to their capability to capture long-term dependencies. However, prior work has been confined to modeling temporal dependencies at either a fixed scale or multiple scales that exponentially increase (most with base 2). This limitation hinders their effectiveness in capturing diverse seasonalities, such as hourly and daily patterns. In this paper, we introduce a dimension invariant embedding technique that captures short-term temporal dependencies and projects MTS data into a higher-dimensional space, while preserving the dimensions of time steps and variables in MTS data. Furthermore, we present a novel Multi-scale Transformer Pyramid Network (MTPNet), specifically designed to effectively capture temporal dependencies at multiple unconstrained scales. The predictions are inferred from multi-scale latent representations obtained from transformers at various scales. Extensive experiments on nine benchmark datasets demonstrate that the proposed MTPNet outperforms recent state-of-the-art methods.
Recommender systems are fundamental information filtering techniques to recommend content or items that meet users' personalities and potential needs. As a crucial solution to address the difficulty of user identification and unavailability of historical information, session-based recommender systems provide recommendation services that only rely on users' behaviors in the current session. However, most existing studies are not well-designed for modeling heterogeneous user behaviors and capturing the relationships between them in practical scenarios. To fill this gap, in this paper, we propose a novel graph-based method, namely Heterogeneous Information Crossing on Graphs (HICG). HICG utilizes multiple types of user behaviors in the sessions to construct heterogeneous graphs, and captures users' current interests with their long-term preferences by effectively crossing the heterogeneous information on the graphs. In addition, we also propose an enhanced version, named HICG-CL, which incorporates contrastive learning (CL) technique to enhance item representation ability. By utilizing the item co-occurrence relationships across different sessions, HICG-CL improves the recommendation performance of HICG. We conduct extensive experiments on three real-world recommendation datasets, and the results verify that (i) HICG achieves the state-of-the-art performance by utilizing multiple types of behaviors on the heterogeneous graph. (ii) HICG-CL further significantly improves the recommendation performance of HICG by the proposed contrastive learning module.
Medication recommendation targets to provide a proper set of medicines according to patients' diagnoses, which is a critical task in clinics. Currently, the recommendation is manually conducted by doctors. However, for complicated cases, like patients with multiple diseases at the same time, it's difficult to propose a considerate recommendation even for experienced doctors. This urges the emergence of automatic medication recommendation which can help treat the diagnosed diseases without causing harmful drug-drug interactions.Due to the clinical value, medication recommendation has attracted growing research interests.Existing works mainly formulate medication recommendation as a multi-label classification task to predict the set of medicines. In this paper, we propose the Conditional Generation Net (COGNet) which introduces a novel copy-or-predict mechanism to generate the set of medicines. Given a patient, the proposed model first retrieves his or her historical diagnoses and medication recommendations and mines their relationship with current diagnoses. Then in predicting each medicine, the proposed model decides whether to copy a medicine from previous recommendations or to predict a new one. This process is quite similar to the decision process of human doctors. We validate the proposed model on the public MIMIC data set, and the experimental results show that the proposed model can outperform state-of-the-art approaches.
As a critical cue for understanding human intention, human gaze provides a key signal for Human-Computer Interaction(HCI) applications. Appearance-based gaze estimation, which directly regresses the gaze vector from eye images, has made great progress recently based on Convolutional Neural Networks(ConvNets) architecture and open-source large-scale gaze datasets. However, encoding model-based knowledge into CNN model to further improve the gaze estimation performance remains a topic that needs to be explored. In this paper, we propose HybridGazeNet(HGN), a unified framework that encodes the geometric eyeball model into the appearance-based CNN architecture explicitly. Composed of a multi-branch network and an uncertainty module, HybridGazeNet is trained using a hyridized strategy. Experiments on multiple challenging gaze datasets shows that HybridGazeNet has better accuracy and generalization ability compared with existing SOTA methods. The code will be released later.
Video moment retrieval aims to search the moment most relevant to a given language query. However, most existing methods in this community often require temporal boundary annotations which are expensive and time-consuming to label. Hence weakly supervised methods have been put forward recently by only using coarse video-level label. Despite effectiveness, these methods usually process moment candidates independently, while ignoring a critical issue that the natural temporal dependencies between candidates in different temporal scales. To cope with this issue, we propose a Multi-scale 2D Representation Learning method for weakly supervised video moment retrieval. Specifically, we first construct a two-dimensional map for each temporal scale to capture the temporal dependencies between candidates. Two dimensions in this map indicate the start and end time points of these candidates. Then, we select top-K candidates from each scale-varied map with a learnable convolutional neural network. With a newly designed Moments Evaluation Module, we obtain the alignment scores of the selected candidates. At last, the similarity between captions and language query is served as supervision for further training the candidates' selector. Experiments on two benchmark datasets Charades-STA and ActivityNet Captions demonstrate that our approach achieves superior performance to state-of-the-art results.