Alert button
Picture for Zijun Long

Zijun Long

Alert button

Large Multi-modal Encoders for Recommendation

Nov 03, 2023
Zixuan Yi, Zijun Long, Iadh Ounis, Craig Macdonald, Richard Mccreadie

Figure 1 for Large Multi-modal Encoders for Recommendation
Figure 2 for Large Multi-modal Encoders for Recommendation
Figure 3 for Large Multi-modal Encoders for Recommendation
Figure 4 for Large Multi-modal Encoders for Recommendation

In recent years, the rapid growth of online multimedia services, such as e-commerce platforms, has necessitated the development of personalised recommendation approaches that can encode diverse content about each item. Indeed, modern multi-modal recommender systems exploit diverse features obtained from raw images and item descriptions to enhance the recommendation performance. However, the existing multi-modal recommenders primarily depend on the features extracted individually from different media through pre-trained modality-specific encoders, and exhibit only shallow alignments between different modalities - limiting these systems' ability to capture the underlying relationships between the modalities. In this paper, we investigate the usage of large multi-modal encoders within the specific context of recommender systems, as these have previously demonstrated state-of-the-art effectiveness when ranking items across various domains. Specifically, we tailor two state-of-the-art multi-modal encoders (CLIP and VLMo) for recommendation tasks using a range of strategies, including the exploration of pre-trained and fine-tuned encoders, as well as the assessment of the end-to-end training of these encoders. We demonstrate that pre-trained large multi-modal encoders can generate more aligned and effective user/item representations compared to existing modality-specific encoders across three multi-modal recommendation datasets. Furthermore, we show that fine-tuning these large multi-modal encoders with recommendation datasets leads to an enhanced recommendation performance. In terms of different training paradigms, our experiments highlight the essential role of the end-to-end training of large multi-modal encoders in multi-modal recommendation systems.

Viaarxiv icon

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Oct 16, 2023
Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa

Figure 1 for RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models
Figure 2 for RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models
Figure 3 for RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models
Figure 4 for RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Robotic vision applications often necessitate a wide range of visual perception tasks, such as object detection, segmentation, and identification. While there have been substantial advances in these individual tasks, integrating specialized models into a unified vision pipeline presents significant engineering challenges and costs. Recently, Multimodal Large Language Models (MLLMs) have emerged as novel backbones for various downstream tasks. We argue that leveraging the pre-training capabilities of MLLMs enables the creation of a simplified framework, thus mitigating the need for task-specific encoders. Specifically, the large-scale pretrained knowledge in MLLMs allows for easier fine-tuning to downstream robotic vision tasks and yields superior performance. We introduce the RoboLLM framework, equipped with a BEiT-3 backbone, to address all visual perception tasks in the ARMBench challenge-a large-scale robotic manipulation dataset about real-world warehouse scenarios. RoboLLM not only outperforms existing baselines but also substantially reduces the engineering burden associated with model selection and tuning. The source code is publicly available at https://github.com/longkukuhi/armbench.

Viaarxiv icon

MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

Sep 12, 2023
Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa

As the size of Large Multi-Modal Models (LMMs) increases consistently, the adaptation of these pre-trained models to specialized tasks has become a computationally and memory-intensive challenge. Traditional fine-tuning methods require isolated, exhaustive retuning for each new task, limiting the models' versatility. Moreover, current efficient adaptation techniques often overlook modality alignment, focusing only on the knowledge extraction of new tasks. To tackle these issues, we introduce Multiway-Adapter, an innovative framework incorporating an 'Alignment Enhancer' to deepen modality alignment, enabling high transferability without tuning pre-trained parameters. Our method adds fewer than 1.25\% of additional parameters to LMMs, exemplified by the BEiT-3 model in our study. This leads to superior zero-shot image-text retrieval performance compared to fully fine-tuned models, while achieving up to a 57\% reduction in fine-tuning time. Our approach offers a resource-efficient and effective adaptation pathway for LMMs, broadening their applicability. The source code is publicly available at: \url{https://github.com/longkukuhi/MultiWay-Adapter}.

Viaarxiv icon

When hard negative sampling meets supervised contrastive learning

Aug 28, 2023
Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa, Zaiqiao Meng

Figure 1 for When hard negative sampling meets supervised contrastive learning
Figure 2 for When hard negative sampling meets supervised contrastive learning
Figure 3 for When hard negative sampling meets supervised contrastive learning
Figure 4 for When hard negative sampling meets supervised contrastive learning

State-of-the-art image models predominantly follow a two-stage strategy: pre-training on large datasets and fine-tuning with cross-entropy loss. Many studies have shown that using cross-entropy can result in sub-optimal generalisation and stability. While the supervised contrastive loss addresses some limitations of cross-entropy loss by focusing on intra-class similarities and inter-class differences, it neglects the importance of hard negative mining. We propose that models will benefit from performance improvement by weighting negative samples based on their dissimilarity to positive counterparts. In this paper, we introduce a new supervised contrastive learning objective, SCHaNe, which incorporates hard negative sampling during the fine-tuning phase. Without requiring specialized architectures, additional data, or extra computational resources, experimental results indicate that SCHaNe outperforms the strong baseline BEiT-3 in Top-1 accuracy across various benchmarks, with significant gains of up to $3.32\%$ in few-shot learning settings and $3.41\%$ in full dataset fine-tuning. Importantly, our proposed objective sets a new state-of-the-art for base models on ImageNet-1k, achieving an 86.14\% accuracy. Furthermore, we demonstrate that the proposed objective yields better embeddings and explains the improved effectiveness observed in our experiments.

Viaarxiv icon

LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers

Mar 31, 2023
Zijun Long, Zaiqiao Meng, Gerardo Aragon Camarasa, Richard McCreadie

Figure 1 for LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers
Figure 2 for LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers
Figure 3 for LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers
Figure 4 for LaCViT: A Label-aware Contrastive Training Framework for Vision Transformers

Vision Transformers have been incredibly effective when tackling computer vision tasks due to their ability to model long feature dependencies. By using large-scale training data and various self-supervised signals (e.g., masked random patches), vision transformers provide state-of-the-art performance on several benchmarking datasets, such as ImageNet-1k and CIFAR-10. However, these vision transformers pretrained over general large-scale image corpora could only produce an anisotropic representation space, limiting their generalizability and transferability to the target downstream tasks. In this paper, we propose a simple and effective Label-aware Contrastive Training framework LaCViT, which improves the isotropy of the pretrained representation space for vision transformers, thereby enabling more effective transfer learning amongst a wide range of image classification tasks. Through experimentation over five standard image classification datasets, we demonstrate that LaCViT-trained models outperform the original pretrained baselines by around 9% absolute Accuracy@1, and consistent improvements can be observed when applying LaCViT to our three evaluated vision transformers.

Viaarxiv icon