Abstract:Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).
Abstract:This paper proposes a novel approach to person re-identification, a fundamental task in distributed multi-camera surveillance systems. Although a variety of powerful algorithms have been presented in the past few years, most of them usually focus on designing hand-crafted features and learning metrics either individually or sequentially. Different from previous works, we formulate a unified deep ranking framework that jointly tackles both of these key components to maximize their strengths. We start from the principle that the correct match of the probe image should be positioned in the top rank within the whole gallery set. An effective learning-to-rank algorithm is proposed to minimize the cost corresponding to the ranking disorders of the gallery. The ranking model is solved with a deep convolutional neural network (CNN) that builds the relation between input image pairs and their similarity scores through joint representation learning directly from raw image pixels. The proposed framework allows us to get rid of feature engineering and does not rely on any assumption. An extensive comparative evaluation is given, demonstrating that our approach significantly outperforms all state-of-the-art approaches, including both traditional and CNN-based methods on the challenging VIPeR, CUHK-01 and CAVIAR4REID datasets. Additionally, our approach has better ability to generalize across datasets without fine-tuning.