Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Zhang

Alibaba Group

Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Aug 20, 2024

Dayou Li, Chenkun Zhao, Shuo Yang, Lin Ma, Yibin Li, Wei Zhang

Figure 1 for Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Figure 2 for Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Figure 3 for Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Figure 4 for Learning Instruction-Guided Manipulation Affordance via Large Models for Embodied Robotic Tasks

Abstract:We study the task of language instruction-guided robotic manipulation, in which an embodied robot is supposed to manipulate the target objects based on the language instructions. In previous studies, the predicted manipulation regions of the target object typically do not change with specification from the language instructions, which means that the language perception and manipulation prediction are separate. However, in human behavioral patterns, the manipulation regions of the same object will change for different language instructions. In this paper, we propose Instruction-Guided Affordance Net (IGANet) for predicting affordance maps of instruction-guided robotic manipulation tasks by utilizing powerful priors from vision and language encoders pre-trained on large-scale datasets. We develop a Vison-Language-Models(VLMs)-based data augmentation pipeline, which can generate a large amount of data automatically for model training. Besides, with the help of Large-Language-Models(LLMs), actions can be effectively executed to finish the tasks defined by instructions. A series of real-world experiments revealed that our method can achieve better performance with generated data. Moreover, our model can generalize better to scenarios with unseen objects and language instructions.

* Accepted to ICARM 2024

Via

Access Paper or Ask Questions

MPGNet: Learning Move-Push-Grasping Synergy for Target-Oriented Grasping in Occluded Scenes

Aug 20, 2024

Dayou Li, Chenkun Zhao, Shuo Yang, Ran Song, Xiaolei Li, Wei Zhang

Figure 1 for MPGNet: Learning Move-Push-Grasping Synergy for Target-Oriented Grasping in Occluded Scenes

Figure 2 for MPGNet: Learning Move-Push-Grasping Synergy for Target-Oriented Grasping in Occluded Scenes

Figure 3 for MPGNet: Learning Move-Push-Grasping Synergy for Target-Oriented Grasping in Occluded Scenes

Figure 4 for MPGNet: Learning Move-Push-Grasping Synergy for Target-Oriented Grasping in Occluded Scenes

Abstract:This paper focuses on target-oriented grasping in occluded scenes, where the target object is specified by a binary mask and the goal is to grasp the target object with as few robotic manipulations as possible. Most existing methods rely on a push-grasping synergy to complete this task. To deliver a more powerful target-oriented grasping pipeline, we present MPGNet, a three-branch network for learning a synergy between moving, pushing, and grasping actions. We also propose a multi-stage training strategy to train the MPGNet which contains three policy networks corresponding to the three actions. The effectiveness of our method is demonstrated via both simulated and real-world experiments.

* Accepted to IROS 2024

Via

Access Paper or Ask Questions

Where to Fetch: Extracting Visual Scene Representation from Large Pre-Trained Models for Robotic Goal Navigation

Aug 20, 2024

Yu Li, Dayou Li, Chenkun Zhao, Ruifeng Wang, Ran Song, Wei Zhang

Abstract:To complete a complex task where a robot navigates to a goal object and fetches it, the robot needs to have a good understanding of the instructions and the surrounding environment. Large pre-trained models have shown capabilities to interpret tasks defined via language descriptions. However, previous methods attempting to integrate large pre-trained models with daily tasks are not competent in many robotic goal navigation tasks due to poor understanding of the environment. In this work, we present a visual scene representation built with large-scale visual language models to form a feature representation of the environment capable of handling natural language queries. Combined with large language models, this method can parse language instructions into action sequences for a robot to follow, and accomplish goal navigation with querying the scene representation. Experiments demonstrate that our method enables the robot to follow a wide range of instructions and complete complex goal navigation tasks.

Via

Access Paper or Ask Questions

Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Aug 20, 2024

Pengkun Wei, Shuo Cheng, Dayou Li, Ran Song, Yipeng Zhang, Wei Zhang

Figure 1 for Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Figure 2 for Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Figure 3 for Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Figure 4 for Coarse-to-Fine Detection of Multiple Seams for Robotic Welding

Abstract:Efficiently detecting target weld seams while ensuring sub-millimeter accuracy has always been an important challenge in autonomous welding, which has significant application in industrial practice. Previous works mostly focused on recognizing and localizing welding seams one by one, leading to inferior efficiency in modeling the workpiece. This paper proposes a novel framework capable of multiple weld seams extraction using both RGB images and 3D point clouds. The RGB image is used to obtain the region of interest by approximately localizing the weld seams, and the point cloud is used to achieve the fine-edge extraction of the weld seams within the region of interest using region growth. Our method is further accelerated by using a pre-trained deep learning model to ensure both efficiency and generalization ability. The performance of the proposed method has been comprehensively tested on various workpieces featuring both linear and curved weld seams and in physical experiment systems. The results showcase considerable potential for real-world industrial applications, emphasizing the method's efficiency and effectiveness. Videos of the real-world experiments can be found at https://youtu.be/pq162HSP2D4.

Via

Access Paper or Ask Questions

Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

Aug 19, 2024

Feiyu Pan, Hao Fang, Runmin Cong, Wei Zhang, Xiankai Lu

Abstract:Video Object Segmentation (VOS) task aims to segmenting a particular object instance throughout the entire video sequence given only the object mask of the first frame. Recently, Segment Anything Model 2 (SAM 2) is proposed, which is a foundation model towards solving promptable visual segmentation in images and videos. SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. SAM 2 is a simple transformer architecture with streaming memory for real-time video processing, which trained on the date provides strong performance across a wide range of tasks. In this work, we evaluate the zero-shot performance of SAM 2 on the more challenging VOS datasets MOSE and LVOS. Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.

* arXiv admin note: substantial text overlap with arXiv:2408.00714

Via

Access Paper or Ask Questions

UNINEXT-Cutie: The 1st Solution for LSVOS Challenge RVOS Track

Aug 19, 2024

Hao Fang, Feiyu Pan, Xiankai Lu, Wei Zhang, Runmin Cong

Abstract:Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video. In this year, LSVOS Challenge RVOS Track replaced the origin YouTube-RVOS benchmark with MeViS. MeViS focuses on referring the target object in a video through its motion descriptions instead of static attributes, posing a greater challenge to RVOS task. In this work, we integrate strengths of that leading RVOS and VOS models to build up a simple and effective pipeline for RVOS. Firstly, We finetune the state-of-the-art RVOS model to obtain mask sequences that are correlated with language descriptions. Secondly, based on a reliable and high-quality key frames, we leverage VOS model to enhance the quality and temporal consistency of the mask results. Finally, we further improve the performance of the RVOS model using semi-supervised learning. Our solution achieved 62.57 J&F on the MeViS test set and ranked 1st place for 6th LSVOS Challenge RVOS Track.

Via

Access Paper or Ask Questions

Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Aug 18, 2024

Zeyuan Chen, Haiyan Wu, Kaixin Wu, Wei Chen, Mingjie Zhong, Jia Xu, Zhongyi Liu, Wei Zhang

Figure 1 for Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Figure 2 for Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Figure 3 for Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Figure 4 for Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

Abstract:Relevance modeling is a critical component for enhancing user experience in search engines, with the primary objective of identifying items that align with users' queries. Traditional models only rely on the semantic congruence between queries and items to ascertain relevance. However, this approach represents merely one aspect of the relevance judgement, and is insufficient in isolation. Even powerful Large Language Models (LLMs) still cannot accurately judge the relevance of a query and an item from a semantic perspective. To augment LLMs-driven relevance modeling, this study proposes leveraging user interactions recorded in search logs to yield insights into users' implicit search intentions. The challenge lies in the effective prompting of LLMs to capture dynamic search intentions, which poses several obstacles in real-world relevance scenarios, i.e., the absence of domain-specific knowledge, the inadequacy of an isolated prompt, and the prohibitive costs associated with deploying LLMs. In response, we propose ProRBP, a novel Progressive Retrieved Behavior-augmented Prompting framework for integrating search scenario-oriented knowledge with LLMs effectively. Specifically, we perform the user-driven behavior neighbors retrieval from the daily search logs to obtain domain-specific knowledge in time, retrieving candidates that users consider to meet their expectations. Then, we guide LLMs for relevance modeling by employing advanced prompting techniques that progressively improve the outputs of the LLMs, followed by a progressive aggregation with comprehensive consideration of diverse aspects. For online serving, we have developed an industrial application framework tailored for the deployment of LLMs in relevance modeling. Experiments on real-world industry data and online A/B testing demonstrate our proposal achieves promising performance.

Via

Access Paper or Ask Questions

A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining

Aug 15, 2024

Audrey Der, Chin-Chia Michael Yeh, Xin Dai, Huiyuan Chen, Yan Zheng, Yujie Fan, Zhongfang Zhuang, Vivian Lai, Junpeng Wang, Liang Wang(+2 more)

Figure 1 for A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining

Figure 2 for A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining

Figure 3 for A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining

Figure 4 for A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining

Abstract:Self-supervised Pretrained Models (PTMs) have demonstrated remarkable performance in computer vision and natural language processing tasks. These successes have prompted researchers to design PTMs for time series data. In our experiments, most self-supervised time series PTMs were surpassed by simple supervised models. We hypothesize this undesired phenomenon may be caused by data scarcity. In response, we test six time series generation methods, use the generated data in pretraining in lieu of the real data, and examine the effects on classification performance. Our results indicate that replacing a real-data pretraining set with a greater volume of only generated samples produces noticeable improvement.

* To appear in CIKM 2024 as a short paper; the version here is the self-contained version that includes the non-mandatory supplementary material available on the paper's companion website

Via

Access Paper or Ask Questions

Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Aug 14, 2024

Quan Liu, Zhenhong Zhou, Longzhu He, Yi Liu, Wei Zhang, Sen Su

Figure 1 for Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Figure 2 for Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Figure 3 for Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Figure 4 for Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions

Abstract:Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines AED and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach. Code is available at https://github.com/GIGABaozi/AED.git.

* 15 pages, 5 figures

Via

Access Paper or Ask Questions

Breaking Limits of Line-of-Sight MIMO Capacity in 6G Wireless Communications

Aug 13, 2024

Haiyue Jing, Wenchi Cheng, Wei Zhang

Figure 1 for Breaking Limits of Line-of-Sight MIMO Capacity in 6G Wireless Communications

Figure 2 for Breaking Limits of Line-of-Sight MIMO Capacity in 6G Wireless Communications

Figure 3 for Breaking Limits of Line-of-Sight MIMO Capacity in 6G Wireless Communications

Figure 4 for Breaking Limits of Line-of-Sight MIMO Capacity in 6G Wireless Communications

Abstract:Multiple-input-multiple-output (MIMO) has been proved its success for the fourth generation (4G) long term evolution (LTE) and is one of the key technical enablers for evolved mobile broadband (eMBB) in the fifth generation (5G) wireless communications. However, along with the number of antennas eventually increased to be extremely large and one-hop communication distance gradually reduced, how to significantly increase the capacity for line-of-sight (LOS) MIMO becomes more and more urgent. In this article, we introduce the quasi-fractal uniform circular array (QF-UCA) antenna structure based MIMO wireless communications, which can adequately exploit the potential of MIMO in LOS channel and greatly increase the capacity with low complexity demodulation schemes. Specifically, three advantages regarding QF-UCA based LOS MIMO are reviewed. Then, research challenges on transceiver alignment, low-rank channel matrix, extended dimensions of QF-UCA, maximum number of orthogonal streams, and the corresponding potential solutions are discussed. Compared with traditional scattering-depended MIMO communications, the QF-UCA based LOS MIMO wireless communication can achieve high-efficient transmission in LOS channel.

Via

Access Paper or Ask Questions