Evaluating the compatibility between textual descriptions and corresponding images represents a core endeavor within multi-modal research. In recent years, a proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. Empirical evidence has substantiated that these innovative approaches exhibit a higher correlation with human judgment, marking a significant advancement in the field. However, does a higher correlation with human evaluations alone sufficiently denote the complete of a metric? In response to this question, in this paper, we study if there are any deficiencies in reference-free metrics. Specifically, inspired by the Cobra Effect, we utilize metric scores as rewards to direct the captioning model toward generating descriptions that closely align with the metric's criteria. If a certain metric has flaws, it will be exploited by the model and reflected in the generated sentences. Our findings reveal that descriptions guided by these metrics contain significant flaws, e.g. incoherent statements and excessive repetition. Subsequently, we propose a novel method termed Self-Improving to rectify the identified shortcomings within these metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance. In addition, we also introduce a challenging evaluation benchmark called Flaws Caption to evaluate reference-free image captioning metrics comprehensively. Our code is available at https://github.com/aaronma2020/robust_captioning_metric
Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. To eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. However, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. To address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (EFUF), which can eliminate hallucinations without the need for paired data. Extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. Our code and datasets will be publicly available.
Multimodal Aspect-based Sentiment Analysis (MABSA) is a fine-grained Sentiment Analysis task, which has attracted growing research interests recently. Existing work mainly utilizes image information to improve the performance of MABSA task. However, most of the studies overestimate the importance of images since there are many noise images unrelated to the text in the dataset, which will have a negative impact on model learning. Although some work attempts to filter low-quality noise images by setting thresholds, relying on thresholds will inevitably filter out a lot of useful image information. Therefore, in this work, we focus on whether the negative impact of noisy images can be reduced without modifying the data. To achieve this goal, we borrow the idea of Curriculum Learning and propose a Multi-grained Multi-curriculum Denoising Framework (M2DF), which can achieve denoising by adjusting the order of training data. Extensive experimental results show that our framework consistently outperforms state-of-the-art work on three sub-tasks of MABSA.
Multimodal Entity Linking (MEL) is a task that aims to link ambiguous mentions within multimodal contexts to referential entities in a multimodal knowledge base. Recent methods for MEL adopt a common framework: they first interact and fuse the text and image to obtain representations of the mention and entity respectively, and then compute the similarity between them to predict the correct entity. However, these methods still suffer from two limitations: first, as they fuse the features of text and image before matching, they cannot fully exploit the fine-grained alignment relations between the mention and entity. Second, their alignment is static, leading to low performance when dealing with complex and diverse data. To address these issues, we propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks. DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples. Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach.
In-Context Learning (ICL) is a new paradigm for natural language processing (NLP), where a large language model (LLM) observes a small number of demonstrations and a test instance as its input, and directly makes predictions without updating model parameters. Previous studies have revealed that ICL is sensitive to the selection and the ordering of demonstrations. However, there are few studies regarding the impact of the demonstration number on the ICL performance within a limited input length of LLM, because it is commonly believed that the number of demonstrations is positively correlated with model performance. In this paper, we found this conclusion does not always hold true. Through pilot experiments, we discover that increasing the number of demonstrations does not necessarily lead to improved performance. Building upon this insight, we propose a Dynamic Demonstrations Controller (D$^2$Controller), which can improve the ICL performance by adjusting the number of demonstrations dynamically. The experimental results show that D$^2$Controller yields a 5.4% relative improvement on eight different sizes of LLMs across ten datasets. Moreover, we also extend our method to previous ICL models and achieve competitive results.
Safe robot motion generation is critical for practical applications from manufacturing to homes. In this work, we proposed a stochastic optimization-based motion generation method to generate collision-free and time-optimal motion for the articulated robot represented by composite signed distance field (SDF) networks. First, we propose composite SDF networks to learn the SDF for articulated robots. The learned composite SDF networks combined with the kinematics of the robot allow for quick and accurate estimates of the minimum distance between the robot and obstacles in a batch fashion. Then, a stochastic optimization-based trajectory planning algorithm generates a spatial-optimized and collision-free trajectory offline with the learned composite SDF networks. This stochastic trajectory planner is formulated as a Bayesian Inference problem with a time-normalized Gaussian process prior and exponential likelihood function. The Gaussian process prior can enforce initial and goal position constraints in Configuration Space. Besides, it can encode the correlation of waypoints in time series. The likelihood function aims at encoding task-related cost terms, such as collision avoidance, trajectory length penalty, boundary avoidance, etc. The kernel updating strategies combined with model-predictive path integral (MPPI) is proposed to solve the maximum a posteriori inference problems. Lastly, we integrate the learned composite SDF networks into the trajectory planning algorithm and apply it to a Franka Emika Panda robot. The simulation and experiment results validate the effectiveness of the proposed method.
* 8 pages, 9 figures, under review of IEEE Robotics and Automation
Aspect Sentiment Triplet Extraction (ASTE) is widely used in various applications. However, existing ASTE datasets are limited in their ability to represent real-world scenarios, hindering the advancement of research in this area. In this paper, we introduce a new dataset, named DMASTE, which is manually annotated to better fit real-world scenarios by providing more diverse and realistic reviews for the task. The dataset includes various lengths, diverse expressions, more aspect types, and more domains than existing datasets. We conduct extensive experiments on DMASTE in multiple settings to evaluate previous ASTE approaches. Empirical results demonstrate that DMASTE is a more challenging ASTE dataset. Further analyses of in-domain and cross-domain settings provide promising directions for future research. Our code and dataset are available at https://github.com/NJUNLP/DMASTE.
Multi-Label Few-Shot Aspect Category Detection (FS-ACD) is a new sub-task of aspect-based sentiment analysis, which aims to detect aspect categories accurately with limited training instances. Recently, dominant works use the prototypical network to accomplish this task, and employ the attention mechanism to extract keywords of aspect category from the sentences to produce the prototype for each aspect. However, they still suffer from serious noise problems: (1) due to lack of sufficient supervised data, the previous methods easily catch noisy words irrelevant to the current aspect category, which largely affects the quality of the generated prototype; (2) the semantically-close aspect categories usually generate similar prototypes, which are mutually noisy and confuse the classifier seriously. In this paper, we resort to the label information of each aspect to tackle the above problems, along with proposing a novel Label-Driven Denoising Framework (LDF). Extensive experimental results show that our framework achieves better performance than other state-of-the-art methods.
This article proposes a method for learning and robotic replication of dynamic collaborative tasks from offline videos. The objective is to extend the concept of learning from demonstration (LfD) to dynamic scenarios, benefiting from widely available or easily producible offline videos. To achieve this goal, we decode important dynamic information, such as the Configuration Dependent Stiffness (CDS), which reveals the contribution of arm pose to the arm endpoint stiffness, from a three-dimensional human skeleton model. Next, through encoding of the CDS via Gaussian Mixture Model (GMM) and decoding via Gaussian Mixture Regression (GMR), the robot's Cartesian impedance profile is estimated and replicated. We demonstrate the proposed method in a collaborative sawing task with leader-follower structure, considering environmental constraints and dynamic uncertainties. The experimental setup includes two Panda robots, which replicate the leader-follower roles and the impedance profiles extracted from a two-persons sawing video.
Human manipulation skills can be transferred to robots conveniently through learning from demonstrations (LfD) methods. However, most of these works either only encode motion trajectories or suffer from the complexity and incompleteness when estimating stiffness profiles. To solve these problems, we propose a simple and effective stiffness estimation method that estimates a complete end-effector stiffness matrix from the variation of demonstrations. To that end, Gaussian Mixture Regression (GMR) is applied to extract the reference pose trajectory and the variability. Quaternion logarithmic map is integrated to generate complete rotational stiffness. Besides, the Dynamic Movement Primitives (DMPs) model is further developed to encode and schedule both the movement trajectory and stiffness profiles in task space simultaneously. Finally, the effectiveness of our approach is validated on a real-world 7 DoF robot with the variable impedance controller.