Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lei Li

HyperAI Team, Xiaomi Corporation

Open 3D World in Autonomous Driving

Aug 20, 2024

Xinlong Cheng, Lei Li

Abstract:The capability for open vocabulary perception represents a significant advancement in autonomous driving systems, facilitating the comprehension and interpretation of a wide array of textual inputs in real-time. Despite extensive research in open vocabulary tasks within 2D computer vision, the application of such methodologies to 3D environments, particularly within large-scale outdoor contexts, remains relatively underdeveloped. This paper presents a novel approach that integrates 3D point cloud data, acquired from LIDAR sensors, with textual information. The primary focus is on the utilization of textual data to directly localize and identify objects within the autonomous driving context. We introduce an efficient framework for the fusion of bird's-eye view (BEV) region features with textual features, thereby enabling the system to seamlessly adapt to novel textual inputs and enhancing the robustness of open vocabulary detection tasks. The effectiveness of the proposed methodology is rigorously evaluated through extensive experimentation on the newly introduced NuScenes-T dataset, with additional validation of its zero-shot performance on the Lyft Level 5 dataset. This research makes a substantive contribution to the advancement of autonomous driving technologies by leveraging multimodal data to enhance open vocabulary perception in 3D environments, thereby pushing the boundaries of what is achievable in autonomous navigation and perception.

Via

Access Paper or Ask Questions

FASST: Fast LLM-based Simultaneous Speech Translation

Aug 18, 2024

Siqi Ouyang, Xi Xu, Chinmay Dandekar, Lei Li

Figure 1 for FASST: Fast LLM-based Simultaneous Speech Translation

Figure 2 for FASST: Fast LLM-based Simultaneous Speech Translation

Figure 3 for FASST: Fast LLM-based Simultaneous Speech Translation

Figure 4 for FASST: Fast LLM-based Simultaneous Speech Translation

Abstract:Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. Existing methods either have high latency due to recomputation of input representations, or fall behind of offline ST in translation quality. In this paper, we propose FASST, a fast large language model based method for streaming speech translation. We propose blockwise-causal speech encoding and consistency mask, so that streaming speech input can be encoded incrementally without recomputation. Furthermore, we develop a two-stage training strategy to optimize FASST for simultaneous inference. We evaluate FASST and multiple strong prior models on MuST-C dataset. Experiment results show that FASST achieves the best quality-latency trade-off. It outperforms the previous best model by an average of 1.5 BLEU under the same latency for English to Spanish translation.

Via

Access Paper or Ask Questions

CMU's IWSLT 2024 Simultaneous Speech Translation System

Aug 14, 2024

Xi Xu, Siqi Ouyang, Brian Yan, Patrick Fernandes, William Chen, Lei Li, Graham Neubig, Shinji Watanabe

Figure 1 for CMU's IWSLT 2024 Simultaneous Speech Translation System

Figure 2 for CMU's IWSLT 2024 Simultaneous Speech Translation System

Figure 3 for CMU's IWSLT 2024 Simultaneous Speech Translation System

Abstract:This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner. Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder. We employ a two-stage training approach: initially, we align the representations of speech and text, followed by full fine-tuning. Both stages are trained on MuST-c v2 data with cross-entropy loss. We adapt our offline ST model for SST using a simple fixed hold-n policy. Experiments show that our model obtains an offline BLEU score of 31.1 and a BLEU score of 29.5 under 2 seconds latency on the MuST-C-v2 tst-COMMON.

Via

Access Paper or Ask Questions

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

Aug 13, 2024

Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu(+6 more)

Abstract:Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agents, we propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise. DEI functions as a meta-module atop existing SWE agent frameworks, managing agent collectives for enhanced problem-solving. Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent's performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions. Our best-performing group excels with a 55% resolve rate, securing the highest ranking on SWE-Bench Lite. Our findings contribute to the growing body of research on collaborative AI systems and their potential to solve complex software engineering challenges.

Via

Access Paper or Ask Questions

CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications

Aug 07, 2024

Tianfang Zhang, Lei Li, Yang Zhou, Wentao Liu, Chen Qian, Xiangyang Ji

Figure 1 for CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications

Figure 2 for CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications

Figure 3 for CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications

Figure 4 for CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications

Abstract:Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. However, the pairwise token affinity and complex matrix operations limit its deployment on resource-constrained scenarios and real-time applications, such as mobile devices, although considerable efforts have been made in previous works. In this paper, we introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Firstly, we argue that the capability of token mixers to obtain global contextual information hinges on multiple information interactions, such as spatial and channel domains. Subsequently, we construct a novel additive similarity function following this paradigm and present an efficient implementation named Convolutional Additive Token Mixer (CATM). This simplification leads to a significant reduction in computational overhead. We evaluate CAS-ViT across a variety of vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Our experiments, conducted on GPUs, ONNX, and iPhones, demonstrate that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones, establishing it as a viable option for efficient mobile vision applications. Our code and model are available at: \url{https://github.com/Tianfang-Zhang/CAS-ViT}

Via

Access Paper or Ask Questions

MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

Aug 01, 2024

Youjia Fu, Zihao Xu, Junsong Fu, Huixia Xue, Shuqiu Tan, Lei Li

Figure 1 for MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

Figure 2 for MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

Figure 3 for MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

Figure 4 for MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

Abstract:Recent advancements in transformer-based monocular 3D object detection techniques have exhibited exceptional performance in inferring 3D attributes from single 2D images. However, most existing methods rely on resource-intensive transformer architectures, which often lead to significant drops in computational efficiency and performance when handling long sequence data. To address these challenges and advance monocular 3D object detection technology, we propose an innovative network architecture, MonoMM, a Multi-scale \textbf{M}amba-Enhanced network for real-time Monocular 3D object detection. This well-designed architecture primarily includes the following two core modules: Focused Multi-Scale Fusion (FMF) Module, which focuses on effectively preserving and fusing image information from different scales with lower computational resource consumption. By precisely regulating the information flow, the FMF module enhances the model adaptability and robustness to scale variations while maintaining image details. Depth-Aware Feature Enhancement Mamba (DMB) Module: It utilizes the fused features from image characteristics as input and employs a novel adaptive strategy to globally integrate depth information and visual information. This depth fusion strategy not only improves the accuracy of depth estimation but also enhances the model performance under different viewing angles and environmental conditions. Moreover, the modular design of MonoMM provides high flexibility and scalability, facilitating adjustments and optimizations according to specific application needs. Extensive experiments conducted on the KITTI dataset show that our method outperforms previous monocular methods and achieves real-time detection.

Via

Access Paper or Ask Questions

ScalingGaussian: Enhancing 3D Content Creation with Generative Gaussian Splatting

Jul 26, 2024

Shen Chen, Jiale Zhou, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, Jenq-Neng Hwang, Lei Li

Figure 1 for ScalingGaussian: Enhancing 3D Content Creation with Generative Gaussian Splatting

Figure 2 for ScalingGaussian: Enhancing 3D Content Creation with Generative Gaussian Splatting

Figure 3 for ScalingGaussian: Enhancing 3D Content Creation with Generative Gaussian Splatting

Figure 4 for ScalingGaussian: Enhancing 3D Content Creation with Generative Gaussian Splatting

Abstract:The creation of high-quality 3D assets is paramount for applications in digital heritage preservation, entertainment, and robotics. Traditionally, this process necessitates skilled professionals and specialized software for the modeling, texturing, and rendering of 3D objects. However, the rising demand for 3D assets in gaming and virtual reality (VR) has led to the creation of accessible image-to-3D technologies, allowing non-professionals to produce 3D content and decreasing dependence on expert input. Existing methods for 3D content generation struggle to simultaneously achieve detailed textures and strong geometric consistency. We introduce a novel 3D content creation framework, ScalingGaussian, which combines 3D and 2D diffusion models to achieve detailed textures and geometric consistency in generated 3D assets. Initially, a 3D diffusion model generates point clouds, which are then densified through a process of selecting local regions, introducing Gaussian noise, followed by using local density-weighted selection. To refine the 3D gaussians, we utilize a 2D diffusion model with Score Distillation Sampling (SDS) loss, guiding the 3D Gaussians to clone and split. Finally, the 3D Gaussians are converted into meshes, and the surface textures are optimized using Mean Square Error(MSE) and Gradient Profile Prior(GPP) losses. Our method addresses the common issue of sparse point clouds in 3D diffusion, resulting in improved geometric structure and detailed textures. Experiments on image-to-3D tasks demonstrate that our approach efficiently generates high-quality 3D assets.

* 14 pages

Via

Access Paper or Ask Questions

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Jul 08, 2024

Yinquan Lu, Wenhao Zhu, Lei Li, Yu Qiao, Fei Yuan

Figure 1 for LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Figure 2 for LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Figure 3 for LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Figure 4 for LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

Abstract:Large Language Models~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs~(by more than 10 spBLEU points) and performs on-par with specialized translation model~(M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code~\footnote{\url{https://github.com/CONE-MT/LLaMAX/.}} and models~\footnote{\url{https://huggingface.co/LLaMAX/.}} are publicly available.

Via

Access Paper or Ask Questions

Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography Warping

Jun 28, 2024

Tianli Liao, Ce Wang, Lei Li, Guangen Liu, Nan Li

Figure 1 for Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography Warping

Figure 2 for Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography Warping

Figure 3 for Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography Warping

Figure 4 for Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography Warping

Abstract:Large parallax between images is an intractable issue in image stitching. Various warping-based methods are proposed to address it, yet the results are unsatisfactory. In this paper, we propose a novel image stitching method using multi-homography warping guided by image segmentation. Specifically, we leverage the Segment Anything Model to segment the target image into numerous contents and partition the feature points into multiple subsets via the energy-based multi-homography fitting algorithm. The multiple subsets of feature points are used to calculate the corresponding multiple homographies. For each segmented content in the overlapping region, we select its best-fitting homography with the lowest photometric error. For each segmented content in the non-overlapping region, we calculate a weighted combination of the linearized homographies. Finally, the target image is warped via the best-fitting homographies to align with the reference image, and the final panorama is generated via linear blending. Comprehensive experimental results on the public datasets demonstrate that our method provides the best alignment accuracy by a large margin, compared with the state-of-the-art methods. The source code is available at https://github.com/tlliao/multi-homo-warp.

* 11 pages, 9 figures

Via

Access Paper or Ask Questions

LumberChunker: Long-Form Narrative Document Segmentation

Jun 25, 2024

André V. Duarte, João Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira

Abstract:Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker

Via

Access Paper or Ask Questions