Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Han Xiao

CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

Oct 10, 2024

Guankun Wang, Han Xiao, Huxin Gao, Renrui Zhang, Long Bai, Xiaoxiao Yang, Zhen Li, Hongsheng Li, Hongliang Ren

Figure 1 for CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

Figure 2 for CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

Figure 3 for CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

Figure 4 for CoPESD: A Multi-Level Surgical Motion Dataset for Training Large Vision-Language Models to Co-Pilot Endoscopic Submucosal Dissection

Abstract:submucosal dissection (ESD) enables rapid resection of large lesions, minimizing recurrence rates and improving long-term overall survival. Despite these advantages, ESD is technically challenging and carries high risks of complications, necessitating skilled surgeons and precise instruments. Recent advancements in Large Visual-Language Models (LVLMs) offer promising decision support and predictive planning capabilities for robotic systems, which can augment the accuracy of ESD and reduce procedural risks. However, existing datasets for multi-level fine-grained ESD surgical motion understanding are scarce and lack detailed annotations. In this paper, we design a hierarchical decomposition of ESD motion granularity and introduce a multi-level surgical motion dataset (CoPESD) for training LVLMs as the robotic \textbf{Co}-\textbf{P}ilot of \textbf{E}ndoscopic \textbf{S}ubmucosal \textbf{D}issection. CoPESD includes 17,679 images with 32,699 bounding boxes and 88,395 multi-level motions, from over 35 hours of ESD videos for both robot-assisted and conventional surgeries. CoPESD enables granular analysis of ESD motions, focusing on the complex task of submucosal dissection. Extensive experiments on the LVLMs demonstrate the effectiveness of CoPESD in training LVLMs to predict following surgical robotic motions. As the first multimodal ESD motion dataset, CoPESD supports advanced research in ESD instruction-following and surgical automation. The dataset is available at \href{https://github.com/gkw0010/CoPESD}{https://github.com/gkw0010/CoPESD.}}

Via

Access Paper or Ask Questions

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Sep 17, 2024

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang(+1 more)

Figure 1 for jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Figure 2 for jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Figure 3 for jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Figure 4 for jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Abstract:We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.

* 20 pages, pp11-13 references, pp14-20 appendix and experiment tables

Via

Access Paper or Ask Questions

Frequency Diverse RIS (FD-RIS) Enhanced Wireless Communications via Joint Distance-Angle Beamforming

Sep 13, 2024

Han Xiao, Xiaoyan Hu, Wenjie Wang, Kai-Kit Wong, Kun Yang

Figure 1 for Frequency Diverse RIS (FD-RIS) Enhanced Wireless Communications via Joint Distance-Angle Beamforming

Figure 2 for Frequency Diverse RIS (FD-RIS) Enhanced Wireless Communications via Joint Distance-Angle Beamforming

Figure 3 for Frequency Diverse RIS (FD-RIS) Enhanced Wireless Communications via Joint Distance-Angle Beamforming

Figure 4 for Frequency Diverse RIS (FD-RIS) Enhanced Wireless Communications via Joint Distance-Angle Beamforming

Abstract:The conventional reconfigurable intelligent surface (RIS) assisted far-field communication systems can only implement angle beamforming, which actually limits the capability for reconfiguring the wireless propagation environment. To overcome this limitation, this paper proposes a newly designed frequency diverse RIS (FD-RIS), which can achieve joint distance-angle beamforming with the assistance of the time modulation technology. The signal processing model for FD-RIS-aided wireless communications is first derived. Then, an optimization problem aimed at maximizing the achievable rate is formulated where the frequency-time modulations are jointly optimized to achieve distance-angle beamforming. Furthermore, a novel iterative algorithm based on the cross-entropy optimization (CEO) framework is proposed to effectively handle the non-convex optimization problem. The numerical results validate that the proposed FD-RIS assisted communication scheme can achieve a notable performance improvement compared with the baseline scheme utilizing traditional RIS. In addition, the effectiveness of the proposed CEO algorithm is further verified by comparing with the benchmark using the genetic algorithm (GA).

Via

Access Paper or Ask Questions

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Sep 07, 2024

Michael Günther, Isabelle Mohr, Bo Wang, Han Xiao

Abstract:Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be "over-compressed" in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in suboptimal representations. In this paper, we introduce a novel method called "late chunking," which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks without the need for additional training. Moreover, our method is generic enough to be applied to any long-context embedding model.

* 4 pages, early draft

Via

Access Paper or Ask Questions

Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Sep 04, 2024

Rohan Jha, Bo Wang, Michael Günther, Georgios Mastrapas, Saba Sturua, Isabelle Mohr, Andreas Koukounas, Mohammad Kalim Akram, Nan Wang, Han Xiao

Figure 1 for Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Figure 2 for Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Figure 3 for Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Figure 4 for Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Abstract:Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,

* 8 pages, references at pp7,8; EMNLP workshop submission

Via

Access Paper or Ask Questions

Interference Cancellation Based Neural Receiver for Superimposed Pilot in Multi-Layer Transmission

Jun 27, 2024

Han Xiao, Wenqiang Tian, Shi Jin, Wendong Liu, Jia Shen, Zhihua Shi, Zhi Zhang

Abstract:In this paper, an interference cancellation based neural receiver for superimposed pilot (SIP) in multi-layer transmission is proposed, where the data and pilot are non-orthogonally superimposed in the same time-frequency resource. Specifically, to deal with the intra-layer and inter-layer interference of SIP under multi-layer transmission, the interference cancellation with superimposed symbol aided channel estimation is leveraged in the neural receiver, accompanied by the pre-design of pilot code-division orthogonal mechanism at transmitter. In addition, to address the complexity issue for inter-vendor collaboration and the generalization problem in practical deployments, respectively, this paper also provides a fixed SIP (F-SIP) design based on constant pilot power ratio and scalable mechanisms for different modulation and coding schemes (MCSs) and transmission layers. Simulation results demonstrate the superiority of the proposed schemes on the performance of block error rate and throughput compared with existing counterparts.

Via

Access Paper or Ask Questions

Efficient Exploration of the Rashomon Set of Rule Set Models

Jun 05, 2024

Martino Ciaperoni, Han Xiao, Aristides Gionis

Figure 1 for Efficient Exploration of the Rashomon Set of Rule Set Models

Figure 2 for Efficient Exploration of the Rashomon Set of Rule Set Models

Figure 3 for Efficient Exploration of the Rashomon Set of Rule Set Models

Figure 4 for Efficient Exploration of the Rashomon Set of Rule Set Models

Abstract:Today, as increasingly complex predictive models are developed, simple rule sets remain a crucial tool to obtain interpretable predictions and drive high-stakes decision making. However, a single rule set provides a partial representation of a learning task. An emerging paradigm in interpretable machine learning aims at exploring the Rashomon set of all models exhibiting near-optimal performance. Existing work on Rashomon-set exploration focuses on exhaustive search of the Rashomon set for particular classes of models, which can be a computationally challenging task. On the other hand, exhaustive enumeration leads to redundancy that often is not necessary, and a representative sample or an estimate of the size of the Rashomon set is sufficient for many applications. In this work, we propose, for the first time, efficient methods to explore the Rashomon set of rule set models with or without exhaustive search. Extensive experiments demonstrate the effectiveness of the proposed methods in a variety of scenarios.

Via

Access Paper or Ask Questions

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

May 30, 2024

Andreas Koukounas, Georgios Mastrapas, Michael Günther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Martínez, Saahil Ognawala(+4 more)

Figure 1 for Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Figure 2 for Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Figure 3 for Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Figure 4 for Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Abstract:Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

* 4 pages, ICML2024 workshop submission

Via

Access Paper or Ask Questions

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

Apr 05, 2024

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Jiaming Liu, Han Xiao, Chaoyou Fu, Hao Dong, Peng Gao

Figure 1 for No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

Figure 2 for No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

Figure 3 for No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

Figure 4 for No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

Abstract:To reduce the reliance on large-scale datasets, recent works in 3D segmentation resort to few-shot learning. Current 3D few-shot segmentation methods first pre-train models on 'seen' classes, and then evaluate their generalization performance on 'unseen' classes. However, the prior pre-training stage not only introduces excessive time overhead but also incurs a significant domain gap on 'unseen' classes. To tackle these issues, we propose a Non-parametric Network for few-shot 3D Segmentation, Seg-NN, and its Parametric variant, Seg-PN. Without training, Seg-NN extracts dense representations by hand-crafted filters and achieves comparable performance to existing parametric models. Due to the elimination of pre-training, Seg-NN can alleviate the domain gap issue and save a substantial amount of time. Based on Seg-NN, Seg-PN only requires training a lightweight QUEry-Support Transferring (QUEST) module, which enhances the interaction between the support set and query set. Experiments suggest that Seg-PN outperforms previous state-of-the-art method by +4.19% and +7.71% mIoU on S3DIS and ScanNet datasets respectively, while reducing training time by -90%, indicating its effectiveness and efficiency.

* CVPR Highlight. Code is available at https://github.com/yangyangyang127/Seg-NN. arXiv admin note: text overlap with arXiv:2308.12961

Via

Access Paper or Ask Questions

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Mar 25, 2024

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, Hongsheng Li

Figure 1 for Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Figure 2 for Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Figure 3 for Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Figure 4 for Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Abstract:This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought (CoT) reasoning. While MLLMs have shown promise in various visual tasks, they often lack interpretability and struggle with complex visual inputs. To address these challenges, we propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts. We collect and introduce the Visual CoT dataset comprising 373k question-answer pairs, annotated with intermediate bounding boxes highlighting key regions essential for answering the questions. Importantly, the introduced benchmark is capable of evaluating MLLMs in scenarios requiring specific local region identification. Extensive experiments demonstrate the effectiveness of our framework and shed light on better inference strategies. The Visual CoT dataset, benchmark, and pre-trained models are available to foster further research in this direction.

* Code: https://github.com/deepcs233/Visual-CoT

Via

Access Paper or Ask Questions