Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abdeslam Boularias

OVIR-3D: Open-Vocabulary 3D Instance Retrieval Without Training on 3D Data

Nov 06, 2023

Shiyang Lu, Haonan Chang, Eric Pu Jing, Abdeslam Boularias, Kostas Bekris

Abstract:This work presents OVIR-3D, a straightforward yet effective method for open-vocabulary 3D object instance retrieval without using any 3D data for training. Given a language query, the proposed method is able to return a ranked set of 3D object instance segments based on the feature similarity of the instance and the text query. This is achieved by a multi-view fusion of text-aligned 2D region proposals into 3D space, where the 2D region proposal network could leverage 2D datasets, which are more accessible and typically larger than 3D datasets. The proposed fusion process is efficient as it can be performed in real-time for most indoor 3D scenes and does not require additional training in 3D space. Experiments on public datasets and a real robot show the effectiveness of the method and its potential for applications in robot navigation and manipulation.

Via

Access Paper or Ask Questions

LGMCTS: Language-Guided Monte-Carlo Tree Search for Executable Semantic Object Rearrangement

Sep 27, 2023

Haonan Chang, Kai Gao, Kowndinya Boyalakuntla, Alex Lee, Baichuan Huang, Harish Udhaya Kumar, Jinjin Yu, Abdeslam Boularias

Figure 1 for LGMCTS: Language-Guided Monte-Carlo Tree Search for Executable Semantic Object Rearrangement

Figure 2 for LGMCTS: Language-Guided Monte-Carlo Tree Search for Executable Semantic Object Rearrangement

Figure 3 for LGMCTS: Language-Guided Monte-Carlo Tree Search for Executable Semantic Object Rearrangement

Figure 4 for LGMCTS: Language-Guided Monte-Carlo Tree Search for Executable Semantic Object Rearrangement

Abstract:We introduce a novel approach to the executable semantic object rearrangement problem. In this challenge, a robot seeks to create an actionable plan that rearranges objects within a scene according to a pattern dictated by a natural language description. Unlike existing methods such as StructFormer and StructDiffusion, which tackle the issue in two steps by first generating poses and then leveraging a task planner for action plan formulation, our method concurrently addresses pose generation and action planning. We achieve this integration using a Language-Guided Monte-Carlo Tree Search (LGMCTS). Quantitative evaluations are provided on two simulation datasets, and complemented by qualitative tests with a real robot.

* Our code and supplementary materials are accessible at https://github.com/changhaonan/LG-MCTS

Via

Access Paper or Ask Questions

Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs

Sep 27, 2023

Haonan Chang, Kowndinya Boyalakuntla, Shiyang Lu, Siwei Cai, Eric Jing, Shreesh Keskar, Shijie Geng, Adeeb Abbas, Lifeng Zhou, Kostas Bekris(+1 more)

Figure 1 for Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs

Figure 2 for Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs

Figure 3 for Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs

Figure 4 for Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs

Abstract:We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for grounding a variety of entities, such as object instances, agents, and regions, with free-form text-based queries. Unlike conventional semantic-based object localization approaches, our system facilitates context-aware entity localization, allowing for queries such as ``pick up a cup on a kitchen table" or ``navigate to a sofa on which someone is sitting". In contrast to existing research on 3D scene graphs, OVSG supports free-form text input and open-vocabulary querying. Through a series of comparative experiments using the ScanNet dataset and a self-collected dataset, we demonstrate that our proposed approach significantly surpasses the performance of previous semantic-based localization techniques. Moreover, we highlight the practical application of OVSG in real-world robot navigation and manipulation experiments.

* The code and dataset used for evaluation can be found at https://github.com/changhaonan/OVSG}{https://github.com/changhaonan/OVSG. This paper has been accepted by CoRL2023

Via

Access Paper or Ask Questions

Detect Every Thing with Few Examples

Sep 22, 2023

Xinyu Zhang, Yuting Wang, Abdeslam Boularias

Abstract:Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at https://github.com/mlzxy/devit.

Via

Access Paper or Ask Questions

Optical Flow boosts Unsupervised Localization and Segmentation

Jul 25, 2023

Xinyu Zhang, Abdeslam Boularias

Abstract:Unsupervised localization and segmentation are long-standing robot vision challenges that describe the critical ability for an autonomous robot to learn to decompose images into individual objects without labeled data. These tasks are important because of the limited availability of dense image manual annotation and the promising vision of adapting to an evolving set of object categories in lifelong learning. Most recent methods focus on using visual appearance continuity as object cues by spatially clustering features obtained from self-supervised vision transformers (ViT). In this work, we leverage motion cues, inspired by the common fate principle that pixels that share similar movements tend to belong to the same object. We propose a new loss term formulation that uses optical flow in unlabeled videos to encourage self-supervised ViT features to become closer to each other if their corresponding spatial locations share similar movements, and vice versa. We use the proposed loss function to finetune vision transformers that were originally trained on static images. Our fine-tuning procedure outperforms state-of-the-art techniques for unsupervised semantic segmentation through linear probing, without the use of any labeled data. This procedure also demonstrates increased performance over original ViT networks across unsupervised object localization and semantic segmentation benchmarks.

* Accepted at IROS2023

Via

Access Paper or Ask Questions

Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D Videos

Apr 09, 2023

Shiyang Lu, Yunfu Deng, Abdeslam Boularias, Kostas Bekris

Figure 1 for Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D Videos

Figure 2 for Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D Videos

Figure 3 for Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D Videos

Figure 4 for Self-Supervised Learning of Object Segmentation from Unlabeled RGB-D Videos

Abstract:This work proposes a self-supervised learning system for segmenting rigid objects in RGB images. The proposed pipeline is trained on unlabeled RGB-D videos of static objects, which can be captured with a camera carried by a mobile robot. A key feature of the self-supervised training process is a graph-matching algorithm that operates on the over-segmentation output of the point cloud that is reconstructed from each video. The graph matching, along with point cloud registration, is able to find reoccurring object patterns across videos and combine them into 3D object pseudo labels, even under occlusions or different viewing angles. Projected 2D object masks from 3D pseudo labels are used to train a pixel-wise feature extractor through contrastive learning. During online inference, a clustering method uses the learned features to cluster foreground pixels into object segments. Experiments highlight the method's effectiveness on both real and synthetic video datasets, which include cluttered scenes of tabletop objects. The proposed method outperforms existing unsupervised methods for object segmentation by a large margin.

Via

Access Paper or Ask Questions

Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction

Jan 30, 2023

Haonan Chang, Dhruv Metha Ramesh, Shijie Geng, Yuqiu Gan, Abdeslam Boularias

Figure 1 for Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction

Figure 2 for Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction

Figure 3 for Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction

Figure 4 for Mono-STAR: Mono-camera Scene-level Tracking and Reconstruction

Abstract:We present Mono-STAR, the first real-time 3D reconstruction system that simultaneously supports semantic fusion, fast motion tracking, non-rigid object deformation, and topological change under a unified framework. The proposed system solves a new optimization problem incorporating optical-flow-based 2D constraints to deal with fast motion and a novel semantic-aware deformation graph (SAD-graph) for handling topology change. We test the proposed system under various challenging scenes and demonstrate that it significantly outperforms existing state-of-the-art methods.

* This paper has been accepted by ICRA2023

Via

Access Paper or Ask Questions

Scene-level Tracking and Reconstruction without Object Priors

Oct 07, 2022

Haonan Chang, Abdeslam Boularias

Figure 1 for Scene-level Tracking and Reconstruction without Object Priors

Figure 2 for Scene-level Tracking and Reconstruction without Object Priors

Figure 3 for Scene-level Tracking and Reconstruction without Object Priors

Figure 4 for Scene-level Tracking and Reconstruction without Object Priors

Abstract:We present the first real-time system capable of tracking and reconstructing, individually, every visible object in a given scene, without any form of prior on the rigidness of the objects, texture existence, or object category. In contrast with previous methods such as Co-Fusion and MaskFusion that first segment the scene into individual objects and then process each object independently, the proposed method dynamically segments the non-rigid scene as part of the tracking and reconstruction process. When new measurements indicate topology change, reconstructed models are updated in real-time to reflect that change. Our proposed system can provide the live geometry and deformation of all visible objects in a novel scene in real-time, which makes it possible to be integrated seamlessly into numerous existing robotics applications that rely on object models for grasping and manipulation. The capabilities of the proposed system are demonstrated in challenging scenes that contain multiple rigid and non-rigid objects.

* Accepted by IROS2022

Via

Access Paper or Ask Questions

Learning Category-Level Manipulation Tasks from Point Clouds with Dynamic Graph CNNs

Sep 13, 2022

Junchi Liang, Abdeslam Boularias

Figure 1 for Learning Category-Level Manipulation Tasks from Point Clouds with Dynamic Graph CNNs

Figure 2 for Learning Category-Level Manipulation Tasks from Point Clouds with Dynamic Graph CNNs

Figure 3 for Learning Category-Level Manipulation Tasks from Point Clouds with Dynamic Graph CNNs

Figure 4 for Learning Category-Level Manipulation Tasks from Point Clouds with Dynamic Graph CNNs

Abstract:This paper presents a new technique for learning category-level manipulation from raw RGB-D videos of task demonstrations, with no manual labels or annotations. Category-level learning aims to acquire skills that can be generalized to new objects, with geometries and textures that are different from the ones of the objects used in the demonstrations. We address this problem by first viewing both grasping and manipulation as special cases of tool use, where a tool object is moved to a sequence of key-poses defined in a frame of reference of a target object. Tool and target objects, along with their key-poses, are predicted using a dynamic graph convolutional neural network that takes as input an automatically segmented depth and color image of the entire scene. Empirical results on object manipulation tasks with a real robotic arm show that the proposed network can efficiently learn from real visual demonstrations to perform the tasks on novel objects within the same category, and outperforms alternative approaches.

Via

Access Paper or Ask Questions

Parallel Monte Carlo Tree Search with Batched Rigid-body Simulations for Speeding up Long-Horizon Episodic Robot Planning

Jul 14, 2022

Baichuan Huang, Abdeslam Boularias, Jingjin Yu

Figure 1 for Parallel Monte Carlo Tree Search with Batched Rigid-body Simulations for Speeding up Long-Horizon Episodic Robot Planning

Figure 2 for Parallel Monte Carlo Tree Search with Batched Rigid-body Simulations for Speeding up Long-Horizon Episodic Robot Planning

Figure 3 for Parallel Monte Carlo Tree Search with Batched Rigid-body Simulations for Speeding up Long-Horizon Episodic Robot Planning

Figure 4 for Parallel Monte Carlo Tree Search with Batched Rigid-body Simulations for Speeding up Long-Horizon Episodic Robot Planning

Abstract:We propose a novel Parallel Monte Carlo tree search with Batched Simulations (PMBS) algorithm for accelerating long-horizon, episodic robotic planning tasks. Monte Carlo tree search (MCTS) is an effective heuristic search algorithm for solving episodic decision-making problems whose underlying search spaces are expansive. Leveraging a GPU-based large-scale simulator, PMBS introduces massive parallelism into MCTS for solving planning tasks through the batched execution of a large number of concurrent simulations, which allows for more efficient and accurate evaluations of the expected cost-to-go over large action spaces. When applied to the challenging manipulation tasks of object retrieval from clutter, PMBS achieves a speedup of over $30\times$ with an improved solution quality, in comparison to a serial MCTS implementation. We show that PMBS can be directly applied to real robot hardware with negligible sim-to-real differences. Supplementary material, including video, can be found at https://github.com/arc-l/pmbs.

* Accepted for IROS 2022

Via

Access Paper or Ask Questions