Abstract:Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).
Abstract:In wireless sensor networks (WSNs), coverage and deployment are two most crucial issues when conducting detection tasks. However, the detection information collected from sensors is oftentimes not fully utilized and efficiently integrated. Such sensing model and deployment strategy, thereby, cannot reach the maximum quality of coverage, particularly when the amount of sensors within WSNs expands significantly. In this article, we aim at achieving the optimal coverage quality of WSN deployment. We develop a collaborative sensing model of sensors to enhance detection capabilities of WSNs, by leveraging the collaborative information derived from the combination rule under the framework of evidence theory. In this model, the performance evaluation of evidential fusion systems is adopted as the criterion of the sensor selection. A learnable sensor deployment network (LSDNet) considering both sensor contribution and detection capability, is proposed for achieving the optimal deployment of WSNs. Moreover, we deeply investigate the algorithm for finding the requisite minimum number of sensors that realizes the full coverage of WSNs. A series of numerical examples, along with an application of forest area monitoring, are employed to demonstrate the effectiveness and the robustness of the proposed algorithms.