Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Zhang

Alibaba Group

Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation

Aug 20, 2024

Yu Li, Dayou Li, Chenkun Zhao, Ruifeng Wang, Ran Song, Wei Zhang

Abstract:To complete a complex task where a robot navigates to a goal object and fetches it, the robot needs to have a good understanding of the instructions and the surrounding environment. Large pre-trained models have shown capabilities to interpret tasks defined via language descriptions. However, previous methods attempting to integrate large pre-trained models with daily tasks are not competent in many robotic goal navigation tasks due to poor understanding of the environment. In this work, we present a visual scene representation built with large-scale visual language models to form a feature representation of the environment capable of handling natural language queries. Combined with large language models, this method can parse language instructions into action sequences for a robot to follow, and accomplish goal navigation with querying the scene representation. Experiments demonstrate that our method enables the robot to follow a wide range of instructions and complete complex goal navigation tasks.

Via

Access Paper or Ask Questions

Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Aug 20, 2024

Pengkun Wei, Shuo Cheng, Dayou Li, Ran Song, Yipeng Zhang, Wei Zhang

Figure 1 for Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Figure 2 for Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Figure 3 for Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Figure 4 for Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Abstract:Efficiently detecting target weld seams while ensuring sub-millimeter accuracy has always been an important challenge in autonomous welding, which has significant application in industrial practice. Previous works mostly focused on recognizing and localizing welding seams one by one, leading to inferior efficiency in modeling the workpiece. This paper proposes a novel framework capable of multiple weld seams extraction using both RGB images and 3D point clouds. The RGB image is used to obtain the region of interest by approximately localizing the weld seams, and the point cloud is used to achieve the fine-edge extraction of the weld seams within the region of interest using region growth. Our method is further accelerated by using a pre-trained deep learning model to ensure both efficiency and generalization ability. The performance of the proposed method has been comprehensively tested on various workpieces featuring both linear and curved weld seams and in physical experiment systems. The results showcase considerable potential for real-world industrial applications, emphasizing the method's efficiency and effectiveness. Videos of the real-world experiments can be found at https://youtu.be/pq162HSP2D4.

Via

Access Paper or Ask Questions

UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

Aug 19, 2024

Hao Fang, Feiyu Pan, Xiankai Lu, Wei Zhang, Runmin Cong

Abstract:Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video. In this year, LSVOS Challenge RVOS Track replaced the origin YouTube-RVOS benchmark with MeViS. MeViS focuses on referring the target object in a video through its motion descriptions instead of static attributes, posing a greater challenge to RVOS task. In this work, we integrate strengths of that leading RVOS and VOS models to build up a simple and effective pipeline for RVOS. Firstly, We finetune the state-of-the-art RVOS model to obtain mask sequences that are correlated with language descriptions. Secondly, based on a reliable and high-quality key frames, we leverage VOS model to enhance the quality and temporal consistency of the mask results. Finally, we further improve the performance of the RVOS model using semi-supervised learning. Our solution achieved 62.57 J&F on the MeViS test set and ranked 1st place for 6th LSVOS Challenge RVOS Track.

Via

Access Paper or Ask Questions

Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

Aug 19, 2024

Feiyu Pan, Hao Fang, Runmin Cong, Wei Zhang, Xiankai Lu

Abstract:Video Object Segmentation (VOS) task aims to segmenting a particular object instance throughout the entire video sequence given only the object mask of the first frame. Recently, Segment Anything Model 2 (SAM 2) is proposed, which is a foundation model towards solving promptable visual segmentation in images and videos. SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. SAM 2 is a simple transformer architecture with streaming memory for real-time video processing, which trained on the date provides strong performance across a wide range of tasks. In this work, we evaluate the zero-shot performance of SAM 2 on the more challenging VOS datasets MOSE and LVOS. Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.

* arXiv admin note: substantial text overlap with arXiv:2408.00714

Via

Access Paper or Ask Questions

Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Aug 18, 2024

Zeyuan Chen, Haiyan Wu, Kaixin Wu, Wei Chen, Mingjie Zhong, Jia Xu, Zhongyi Liu, Wei Zhang

Figure 1 for Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Figure 2 for Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Figure 3 for Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Figure 4 for Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Abstract:Relevance modeling is a critical component for enhancing user experience in search engines, with the primary objective of identifying items that align with users' queries. Traditional models only rely on the semantic congruence between queries and items to ascertain relevance. However, this approach represents merely one aspect of the relevance judgement, and is insufficient in isolation. Even powerful Large Language Models (LLMs) still cannot accurately judge the relevance of a query and an item from a semantic perspective. To augment LLMs-driven relevance modeling, this study proposes leveraging user interactions recorded in search logs to yield insights into users' implicit search intentions. The challenge lies in the effective prompting of LLMs to capture dynamic search intentions, which poses several obstacles in real-world relevance scenarios, i.e., the absence of domain-specific knowledge, the inadequacy of an isolated prompt, and the prohibitive costs associated with deploying LLMs. In response, we propose ProRBP, a novel Progressive Retrieved Behavior-augmented Prompting framework for integrating search scenario-oriented knowledge with LLMs effectively. Specifically, we perform the user-driven behavior neighbors retrieval from the daily search logs to obtain domain-specific knowledge in time, retrieving candidates that users consider to meet their expectations. Then, we guide LLMs for relevance modeling by employing advanced prompting techniques that progressively improve the outputs of the LLMs, followed by a progressive aggregation with comprehensive consideration of diverse aspects. For online serving, we have developed an industrial application framework tailored for the deployment of LLMs in relevance modeling. Experiments on real-world industry data and online A/B testing demonstrate our proposal achieves promising performance.

Via

Access Paper or Ask Questions

A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining

Aug 15, 2024

Audrey Der, Chin-Chia Michael Yeh, Xin Dai, Huiyuan Chen, Yan Zheng, Yujie Fan, Zhongfang Zhuang, Vivian Lai, Junpeng Wang, Liang Wang(+2 more)

Figure 1 for A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining

Figure 2 for A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining

Figure 3 for A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining

Figure 4 for A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining

Abstract:Self-supervised Pretrained Models (PTMs) have demonstrated remarkable performance in computer vision and natural language processing tasks. These successes have prompted researchers to design PTMs for time series data. In our experiments, most self-supervised time series PTMs were surpassed by simple supervised models. We hypothesize this undesired phenomenon may be caused by data scarcity. In response, we test six time series generation methods, use the generated data in pretraining in lieu of the real data, and examine the effects on classification performance. Our results indicate that replacing a real-data pretraining set with a greater volume of only generated samples produces noticeable improvement.

* To appear in CIKM 2024 as a short paper; the version here is the self-contained version that includes the non-mandatory supplementary material available on the paper's companion website

Via

Access Paper or Ask Questions

Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Aug 14, 2024

Quan Liu, Zhenhong Zhou, Longzhu He, Yi Liu, Wei Zhang, Sen Su

Figure 1 for Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Figure 2 for Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Figure 3 for Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Figure 4 for Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Abstract:Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines AED and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach. Code is available at https://github.com/GIGABaozi/AED.git.

* 15 pages, 5 figures

Via

Access Paper or Ask Questions

Breaking Limits of Line-of-Sight MIMO Capacity in 6G Wireless Communications

Aug 13, 2024

Haiyue Jing, Wenchi Cheng, Wei Zhang

Figure 1 for Breaking Limits of Line-of-Sight MIMO Capacity in 6G Wireless Communications

Figure 2 for Breaking Limits of Line-of-Sight MIMO Capacity in 6G Wireless Communications

Figure 3 for Breaking Limits of Line-of-Sight MIMO Capacity in 6G Wireless Communications

Figure 4 for Breaking Limits of Line-of-Sight MIMO Capacity in 6G Wireless Communications

Abstract:Multiple-input-multiple-output (MIMO) has been proved its success for the fourth generation (4G) long term evolution (LTE) and is one of the key technical enablers for evolved mobile broadband (eMBB) in the fifth generation (5G) wireless communications. However, along with the number of antennas eventually increased to be extremely large and one-hop communication distance gradually reduced, how to significantly increase the capacity for line-of-sight (LOS) MIMO becomes more and more urgent. In this article, we introduce the quasi-fractal uniform circular array (QF-UCA) antenna structure based MIMO wireless communications, which can adequately exploit the potential of MIMO in LOS channel and greatly increase the capacity with low complexity demodulation schemes. Specifically, three advantages regarding QF-UCA based LOS MIMO are reviewed. Then, research challenges on transceiver alignment, low-rank channel matrix, extended dimensions of QF-UCA, maximum number of orthogonal streams, and the corresponding potential solutions are discussed. Compared with traditional scattering-depended MIMO communications, the QF-UCA based LOS MIMO wireless communication can achieve high-efficient transmission in LOS channel.

Via

Access Paper or Ask Questions

Achieving Practical OAM Based Wireless Communications With Misaligned Transceiver

Aug 13, 2024

Wenchi Cheng, Haiyue Jing, Wei Zhang, Zan Li, Hailin Zhang

Figure 1 for Achieving Practical OAM Based Wireless Communications With Misaligned Transceiver

Figure 2 for Achieving Practical OAM Based Wireless Communications With Misaligned Transceiver

Figure 3 for Achieving Practical OAM Based Wireless Communications With Misaligned Transceiver

Figure 4 for Achieving Practical OAM Based Wireless Communications With Misaligned Transceiver

Abstract:Orbital angular momentum (OAM) has attracted much attention for radio vortex wireless communications due to the orthogonality among different OAM-modes. To maintain the orthogonality among different OAM modes at the receiver, the strict alignment between transmit and receive antennas is highly demanded. However, it is not practical to guarantee the transceiver alignment in wireless communications. The phase turbulence, resulting from the misaligned transceivers, leads to serious inter-mode interference among different OAM modes and therefore fail for signals detection of multiple OAM modes at the receiver. To achieve practical OAM based wireless communications, in this paper we investigate the radio vortex wireless communications with misaligned transmit and receive antennas. We propose a joint Beamforming and Pre-detection (BePre) scheme, which uses two unitary matrices to convert the channel matrix into the equivalent circulant matrix for keeping the orthogonality among OAM-modes at the receiver. Then, the OAM signals can be detected with the mode-decomposition scheme at the misaligned receiver. Extensive simulations obtained validate and evaluate that our developed joint BePre scheme can efficiently detect the signals of multiple OAM-modes for the misaligned transceiver and can significantly increase the spectrum efficiency.

Via

Access Paper or Ask Questions

Quasi-Fractal UCA Based OAM for Highly Efficient Orthogonal Transmission

Aug 10, 2024

Wenchi Cheng, Haiyue Jing, Wei Zhang, Keyi Zhang, Hailin Zhang

Figure 1 for Quasi-Fractal UCA Based OAM for Highly Efficient Orthogonal Transmission

Figure 2 for Quasi-Fractal UCA Based OAM for Highly Efficient Orthogonal Transmission

Figure 3 for Quasi-Fractal UCA Based OAM for Highly Efficient Orthogonal Transmission

Figure 4 for Quasi-Fractal UCA Based OAM for Highly Efficient Orthogonal Transmission

Abstract:The development of orbital angular momentum (OAM)-based radio vortex transmission presents a promising opportunity for increasing the capacity of wireless communication in correlated channels due to its inherent orthogonality among different OAM modes. One of the most popular schemes for high-efficient OAM transmission is the digital baseband associated with uniform circular array (UCA) based transceiver. However, the periodicity of complex-exponential feed makes the maximum number of orthogonal signals carried by multiple OAM modes generally restricted to the array-element number of UCA antenna, which poses an open question of how to employ more OAM modes given a fixed number of array elements. Furthermore, signals modulated with high-order OAM modes are difficult to be captured by the receiver due to their serious divergence as propagating in free space, thus severely limiting the capacity of radio vortex communications. To overcome the above challenges, in this paper based on the partly element-overlapped fractal geometry layout and effectively using low-order OAM modes, we propose the quasi-fractal UCA (QF-UCA) antenna based OAM multiplexing transmission. We perform the two-dimension OAM modulation (TOM) and demodulation (TOD) schemes with the orthogonal OAM mode number exceeding the array-element number, which is beyond the traditional concept of multiple antennas based wireless communications. Simulation results show that our proposed scheme can achieve more number of orthogonal multiplexing streams than the maximum number of orthogonal multiplexing corresponding to traditional multiple antenna systems.

Via

Access Paper or Ask Questions