Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zihan Liao

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

May 14, 2026

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

Abstract:The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.

* Accepted by ICML 2026. The data has been released earlier in the preprint arXiv:2603.19223

Via

Access Paper or Ask Questions

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

May 06, 2026

Siqiao Xue, Zihan Liao, Jin Qin, Ziyin Zhang, Yixiang Mu, Fan Zhou, Hang Yu

Abstract:Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

* project site: https://hq-bench.github.io/coreb-page/

Via

Access Paper or Ask Questions

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Mar 19, 2026

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

Abstract:We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

Via

Access Paper or Ask Questions

C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

Dec 24, 2025

Jin Qin, Zihan Liao, Ziyin Zhang, Hang Yu, Peng Di, Rui Wang

Abstract:We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.

Via

Access Paper or Ask Questions

F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Oct 02, 2025

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

Figure 1 for F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Figure 2 for F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Figure 3 for F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Figure 4 for F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Abstract:We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

Via

Access Paper or Ask Questions

CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

Nov 19, 2024

Zhehan Kan, Ce Zhang, Zihan Liao, Yapeng Tian, Wenming Yang, Junyuan Xiao, Xu Li, Dongmei Jiang, Yaowei Wang, Qingmin Liao

Figure 1 for CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

Figure 2 for CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

Figure 3 for CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

Figure 4 for CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs

Abstract:Large Vision-Language Model (LVLM) systems have demonstrated impressive vision-language reasoning capabilities but suffer from pervasive and severe hallucination issues, posing significant risks in critical domains such as healthcare and autonomous systems. Despite previous efforts to mitigate hallucinations, a persistent issue remains: visual defect from vision-language misalignment, creating a bottleneck in visual processing capacity. To address this challenge, we develop Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs (CATCH), based on the Information Bottleneck theory. CATCH introduces Complementary Visual Decoupling (CVD) for visual information separation, Non-Visual Screening (NVS) for hallucination detection, and Adaptive Token-level Contrastive Decoding (ATCD) for hallucination mitigation. CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios. It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training, opening new possibilities for advancing LVLM in various challenging applications.

Via

Access Paper or Ask Questions

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Sep 10, 2024

Zihan Liao, Jun Wang, Hang Yu, Lingxiao Wei, Jianguo Li, Wei Zhang

Figure 1 for E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Figure 2 for E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Figure 3 for E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Figure 4 for E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

Abstract:In the realm of Large Language Models (LLMs), the ability to process long contexts is increasingly crucial for tasks such as multi-round dialogues, code generation, and document summarization. This paper addresses the challenges of enhancing the long-context performance, reducing computational complexity, and leveraging pretrained models collectively termed the "impossible triangle." We introduce E2LLM (Encoder Elongated Large Language Models), a novel approach that effectively navigates this paradox. The method involves splitting long contexts into chunks, compressing each into embedding vectors via a pretrained text encoder, and utilizing an adapter to align these representations with a decoder-only LLM. Two training objectives, focusing on reconstruction of the encoder output and long-context instruction fine-tuning, are employed to facilitate the understanding of soft prompts by the LLM. Experimental results demonstrate that E2LLM achieves superior performance in long-context scenarios while balancing efficiency, performance, and compatibility with pretrained models. Our framework thus represents a significant advancement in the field, contributing to effective long-text modeling.

* 12 pages, 4 figures

Via

Access Paper or Ask Questions

Pulse Shaping for Random ISAC Signals: The Ambiguity Function Between Symbols Matters

Jul 22, 2024

Zihan Liao, Fan Liu, Shuangyang Li, Yifeng Xiong, Weijie Yuan, Christos Masouros, Marco Lops

Abstract:Integrated sensing and communications (ISAC) has emerged as a pivotal enabling technology for next-generation wireless networks. Despite the distinct signal design requirements of sensing and communication (S&C) systems, shifting the symbol-wise pulse shaping (SWiPS) framework from communication-only systems to ISAC poses significant challenges in signal design and processing This paper addresses these challenges by examining the ambiguity function (AF) of the SWiPS ISAC signal and introducing a novel pulse shaping design for single-carrier ISAC transmission. We formulate optimization problems to minimize the average integrated sidelobe level (ISL) of the AF, as well as the weighted ISL (WISL) while satisfying inter-symbol interference (ISI), out-of-band emission (OOBE), and power constraints. Our contributions include establishing the relationship between the AFs of both the random data symbols and signaling pulses, analyzing the statistical characteristics of the AF, and developing algorithmic frameworks for pulse shaping optimization using successive convex approximation (SCA) and alternating direction method of multipliers (ADMM) approaches. Numerical results are provided to validate our theoretical analysis, which demonstrate significant performance improvements in the proposed SWiPS design compared to the root-raised cosine (RRC) pulse shaping for conventional communication systems.

Via

Access Paper or Ask Questions

D2LLM: Decomposed and Distilled Large Language Models for Semantic Search

Jun 25, 2024

Zihan Liao, Hang Yu, Jianguo Li, Jun Wang, Wei Zhang

Abstract:The key challenge in semantic search is to create models that are both accurate and efficient in pinpointing relevant sentences for queries. While BERT-style bi-encoders excel in efficiency with pre-computed embeddings, they often miss subtle nuances in search tasks. Conversely, GPT-style LLMs with cross-encoder designs capture these nuances but are computationally intensive, hindering real-time applications. In this paper, we present D2LLMs-Decomposed and Distilled LLMs for semantic search-that combines the best of both worlds. We decompose a cross-encoder into an efficient bi-encoder integrated with Pooling by Multihead Attention and an Interaction Emulation Module, achieving nuanced understanding and pre-computability. Knowledge from the LLM is distilled into this model using contrastive, rank, and feature imitation techniques. Our experiments show that D2LLM surpasses five leading baselines in terms of all metrics across three tasks, particularly improving NLI task performance by at least 6.45%. The source code is available at https://github.com/codefuse-ai/D2LLM.

Via

Access Paper or Ask Questions

Improving the Ranging Performance of Random ISAC Signals Through Pulse Shaping Design

May 07, 2024

Zihan Liao, Fan Liu, Shuangyang Li, Yifeng Xiong, Weijie Yuan, Marco Lops

Abstract:In this paper, we propose a novel pulse shaping design for single-carrier integrated sensing and communication (ISAC) transmission. Due to the communication information embedded in the ISAC signal, the resulting auto-correlation function (ACF) is determined by both the information-conveying random symbol sequence and the signaling pulse, where the former leads to random fluctuations in the sidelobes of the ACF, impairing the range estimation performance. To overcome this challenge, we first analyze the statistical characteristics of the random ACF under the symbol-wise pulse shaping (SWPS) regime. As a step further, we formulate an optimization problem to design ISAC pulse shaping filters, which minimizes the average integrated sidelobe level ratio (ISLR) while meeting the Nyquist criterion, subject to power and bandwidth constraints. We then show that the problem can be recast as a convex quadratic program by expressing it in the frequency domain, which can be readily solved through standard tools. Numerical results demonstrate that the proposed pulse shaping design achieves substantial ranging sidelobe reduction compared to the celebrated root-raised cosine (RRC) pulse shaping, given that the communication throughput is unchanged.

Via

Access Paper or Ask Questions