IBM T. J. Watson Research Center




Abstract:Ranking items regarding individual user interests is a core technique of multiple downstream tasks such as recommender systems. Learning such a personalized ranker typically relies on the implicit feedback from users' past click-through behaviors. However, collected feedback is biased toward previously highly-ranked items and directly learning from it would result in a "rich-get-richer" phenomenon. In this paper, we propose a simple yet sufficient unbiased learning-to-rank paradigm named InfoRank that aims to simultaneously address both position and popularity biases. We begin by consolidating the impacts of those biases into a single observation factor, thereby providing a unified approach to addressing bias-related issues. Subsequently, we minimize the mutual information between the observation estimation and the relevance estimation conditioned on the input features. By doing so, our relevance estimation can be proved to be free of bias. To implement InfoRank, we first incorporate an attention mechanism to capture latent correlations within user-item features, thereby generating estimations of observation and relevance. We then introduce a regularization term, grounded in conditional mutual information, to promote conditional independence between relevance estimation and observation estimation. Experimental evaluations conducted across three extensive recommendation and search datasets reveal that InfoRank learns more precise and unbiased ranking strategies.
Abstract:With the productive evolution of large language models (LLMs) in the field of natural language processing (NLP), tons of effort has been made to effectively fine-tune common pre-trained LLMs to fulfill a variety of tasks in one or multiple specific domain. In practice, there are two prevailing ways, in which the adaptation can be achieved: (i) Multiple Independent Models: Pre-trained LLMs are fine-tuned a few times independently using the corresponding training samples from each task. (ii) An Integrated Model: Samples from all tasks are employed to fine-tune a pre-trianed LLM unitedly. To address the high computing cost and seesawing issue simultaneously, we propose a unified framework that implements a 1 + N mutli-task fine-tuning pattern in LLMs using a novel Customized Gate Control (CGC) Low-rank Adaptation (LoRA) algorithm. Our work aims to take an advantage of both MTL (i.e., CGC) and PEFT (i.e., LoRA) scheme. For a given cluster of tasks, we design an innovative layer that contains two types of experts as additional trainable parameters to make LoRA be compatible with MTL. To comprehensively evaluate the proposed framework, we conduct well-designed experiments on two public datasets. The experimental results demonstrate that the unified framework with CGC-LoRA modules achieves higher evaluation scores than all benchmarks on both two datasets.




Abstract:One of the fundamental challenges in microscopy (MS) image analysis is instance segmentation (IS), particularly when segmenting cluster regions where multiple objects of varying sizes and shapes may be connected or even overlapped in arbitrary orientations. Existing IS methods usually fail in handling such scenarios, as they rely on coarse instance representations such as keypoints and horizontal bounding boxes (h-bboxes). In this paper, we propose a novel one-stage framework named A2B-IS to address this challenge and enhance the accuracy of IS in MS images. Our approach represents each instance with a pixel-level mask map and a rotated bounding box (r-bbox). Unlike two-stage methods that use box proposals for segmentations, our method decouples mask and box predictions, enabling simultaneous processing to streamline the model pipeline. Additionally, we introduce a Gaussian skeleton map to aid the IS task in two key ways: (1) It guides anchor placement, reducing computational costs while improving the model's capacity to learn RoI-aware features by filtering out noise from background regions. (2) It ensures accurate isolation of densely packed instances by rectifying erroneous box predictions near instance boundaries. To further enhance the performance, we integrate two modules into the framework: (1) An Atrous Attention Block (A2B) designed to extract high-resolution feature maps with fine-grained multiscale information, and (2) A Semi-Supervised Learning (SSL) strategy that leverages both labeled and unlabeled images for model training. Our method has been thoroughly validated on two large-scale MS datasets, demonstrating its superiority over most state-of-the-art approaches.




Abstract:A wide range of real-world applications is characterized by their symbolic nature, necessitating a strong capability for symbolic reasoning. This paper investigates the potential application of Large Language Models (LLMs) as symbolic reasoners. We focus on text-based games, significant benchmarks for agents with natural language capabilities, particularly in symbolic tasks like math, map reading, sorting, and applying common sense in text-based worlds. To facilitate these agents, we propose an LLM agent designed to tackle symbolic challenges and achieve in-game objectives. We begin by initializing the LLM agent and informing it of its role. The agent then receives observations and a set of valid actions from the text-based games, along with a specific symbolic module. With these inputs, the LLM agent chooses an action and interacts with the game environments. Our experimental results demonstrate that our method significantly enhances the capability of LLMs as automated agents for symbolic reasoning, and our LLM agent is effective in text-based games involving symbolic tasks, achieving an average performance of 88% across all tasks.
Abstract:Vein recognition has received increasing attention due to its high security and privacy. Recently, deep neural networks such as Convolutional neural networks (CNN) and Transformers have been introduced for vein recognition and achieved state-of-the-art performance. Despite the recent advances, however, existing solutions for finger-vein feature extraction are still not optimal due to scarce training image samples. To overcome this problem, in this paper, we propose an adversarial masking contrastive learning (AMCL) approach, that generates challenging samples to train a more robust contrastive learning model for the downstream palm-vein recognition task, by alternatively optimizing the encoder in the contrastive learning model and a set of latent variables. First, a huge number of masks are generated to train a robust generative adversarial network (GAN). The trained generator transforms a latent variable from the latent variable space into a mask space. Then, we combine the trained generator with a contrastive learning model to obtain our AMCL, where the generator produces challenging masking images to increase the contrastive loss and the contrastive learning model is trained based on the harder images to learn a more robust feature representation. After training, the trained encoder in the contrastive learning model is combined with a classification layer to build a classifier, which is further fine-tuned on labeled training data for vein recognition. The experimental results on three databases demonstrate that our approach outperforms existing contrastive learning approaches in terms of improving identification accuracy of vein classifiers and achieves state-of-the-art recognition results.
Abstract:Some argue that the essence of humanity, such as creativity and sentiment, can never be mimicked by machines. This paper casts doubt on this belief by studying a vital question: Can AI compose poetry as well as humans? To answer the question, we propose ProFTAP, a novel evaluation framework inspired by Turing test to assess AI's poetry writing capability. We apply it on current large language models (LLMs) and find that recent LLMs do indeed possess the ability to write classical Chinese poems nearly indistinguishable from those of humans. We also reveal that various open-source LLMs can outperform GPT-4 on this task.




Abstract:Over the past decade, visual gaze estimation has garnered growing attention within the research community, thanks to its wide-ranging application scenarios. While existing estimation approaches have achieved remarkable success in enhancing prediction accuracy, they primarily infer gaze directions from single-image signals and discard the huge potentials of the currently dominant text guidance. Notably, visual-language collaboration has been extensively explored across a range of visual tasks, such as image synthesis and manipulation, leveraging the remarkable transferability of large-scale Contrastive Language-Image Pre-training (CLIP) model. Nevertheless, existing gaze estimation approaches ignore the rich semantic cues conveyed by linguistic signals and priors in CLIP feature space, thereby yielding performance setbacks. In pursuit of making up this gap, we delve deeply into the text-eye collaboration protocol and introduce a novel gaze estimation framework in this paper, referred to as GazeCLIP. Specifically, we intricately design a linguistic description generator to produce text signals with coarse directional cues. Additionally, a CLIP-based backbone that excels in characterizing text-eye pairs for gaze estimation is presented. This is followed by the implementation of a fine-grained multi-modal fusion module aimed at modeling the interrelationships between heterogeneous inputs. Extensive experiments on three challenging datasets demonstrate the superiority of the proposed GazeCLIP which surpasses the previous approaches and achieves the state-of-the-art estimation accuracy.




Abstract:An artificial lateral line (ALL) is a bioinspired flow sensing system of an underwater robot that consists of distributed flow sensors. The ALL has achieved great success in sensing the motion states of bioinspired underwater robots, e.g., robotic fish, that are driven by body undulation and/or tail flapping. However, the ALL has not been systematically tested and studied in the sensing of underwater robots driven by rotating propellers due to the highly dynamic and complex flow field therein. This paper makes a bold hypothesis that the distributed flow measurements sampled from the propeller wake flow, although infeasible to represent the entire flow dynamics, provides sufficient information for estimating the lateral motion states of the leader underwater robot. An experimental testbed is constructed to investigate the feasibility of such a state estimator which comprises a cylindrical ALL sensory system, a rotating leader propeller, and a water tank with a planar sliding guide. Specifically, a hybrid network that consists of a one-dimensional convolution network (1DCNN) and a bidirectional long short-term memory network (BiLSTM) is designed to extract the spatiotemporal features of the time series of distributed pressure measurements. A multi-output deep learning network is adopted to estimate the lateral motion states of the leader propeller. In addition, the state estimator is optimized using the whale optimization algorithm (WOA) considering the comprehensive estimation performance. Extensive experiments are conducted the results of which validate the proposed data-driven algorithm in estimating the motion states of the leader underwater robot by propeller wake sensing.
Abstract:Prompt treatment for melanoma is crucial. To assist physicians in identifying lesion areas precisely in a quick manner, we propose a novel skin lesion segmentation technique namely SLP-Net, an ultra-lightweight segmentation network based on the spiking neural P(SNP) systems type mechanism. Most existing convolutional neural networks achieve high segmentation accuracy while neglecting the high hardware cost. SLP-Net, on the contrary, has a very small number of parameters and a high computation speed. We design a lightweight multi-scale feature extractor without the usual encoder-decoder structure. Rather than a decoder, a feature adaptation module is designed to replace it and implement multi-scale information decoding. Experiments at the ISIC2018 challenge demonstrate that the proposed model has the highest Acc and DSC among the state-of-the-art methods, while experiments on the PH2 dataset also demonstrate a favorable generalization ability. Finally, we compare the computational complexity as well as the computational speed of the models in experiments, where SLP-Net has the highest overall superiority




Abstract:Breast cancer(BC) is a prevalent type of malignant tumor in women. Early diagnosis and treatment are vital for enhancing the patients' survival rate. Downsampling in deep networks may lead to loss of information, so for compensating the detail and edge information and allowing convolutional neural networks to pay more attention to seek the lesion region, we propose a multi-stages attention architecture based on NSNP neurons with autapses. First, unlike the single-scale attention acquisition methods of existing methods, we set up spatial attention acquisition at each feature map scale of the convolutional network to obtain an fusion global information on attention guidance. Then we introduce a new type of NSNP variants called NSNP neurons with autapses. Specifically, NSNP systems are modularized as feature encoders, recoding the features extracted from convolutional neural network as well as the fusion of attention information and preserve the key characteristic elements in feature maps. This ensures the retention of valuable data while gradually transforming high-dimensional complicated info into low-dimensional ones. The proposed method is evaluated on the public dataset BreakHis at various magnifications and classification tasks. It achieves a classification accuracy of 96.32% at all magnification cases, outperforming state-of-the-art methods. Ablation studies are also performed, verifying the proposed model's efficacy. The source code is available at XhuBobYoung/Breast-cancer-Classification.