Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stefano Berti

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

Mar 04, 2026

Stefano Berti, Giulia Pasquale, Lorenzo Natale

Abstract:Few-Shot Action Recognition (FS-AR) has shown promising results but is often limited by a closed-set assumption that fails in real-world open-set scenarios. While Few-Shot Open-Set (FSOS) recognition is well-established for images, its extension to spatio-temporal video data remains underexplored. To address this, we propose an architectural extension based on a Feature-Residual Discriminator (FR-Disc), adapting previous work on skeletal data to the more complex video domain. Extensive experiments on five datasets demonstrate that while common open-set techniques provide only marginal gains, our FR-Disc significantly enhances unknown rejection capabilities without compromising closed-set accuracy, setting a new state-of-the-art for FSOS-AR. The project website, code, and benchmark are available at: https://hsp-iit.github.io/fsosar/.

Via

Access Paper or Ask Questions

The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

May 14, 2024

Carmela Calabrese, Stefano Berti, Giulia Pasquale, Lorenzo Natale

Figure 1 for The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

Figure 2 for The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

Figure 3 for The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

Figure 4 for The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks

Abstract:Addressing multi-label action recognition in videos represents a significant challenge for robotic applications in dynamic environments, especially when the robot is required to cooperate with humans in tasks that involve objects. Existing methods still struggle to recognize unseen actions or require extensive training data. To overcome these problems, we propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition. Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification. The strength of our method is that at training time it only learns two prompts, and it is therefore much simpler than other methods. We validate our method on the Charades dataset that includes a majority of object-based actions, demonstrating that -- despite its simplicity -- our method performs favorably with respect to existing methods on the complete dataset, and promising performance when tested on unseen actions. Our contribution emphasizes the impact of verb-object class-splits during robots' training for new cooperative tasks, highlighting the influence on the performance and giving insights into mitigating biases.

Via

Access Paper or Ask Questions

Towards Confidence-guided Shape Completion for Robotic Applications

Sep 09, 2022

Andrea Rosasco, Stefano Berti, Fabrizio Bottarel, Michele Colledanchise, Lorenzo Natale

Figure 1 for Towards Confidence-guided Shape Completion for Robotic Applications

Figure 2 for Towards Confidence-guided Shape Completion for Robotic Applications

Figure 3 for Towards Confidence-guided Shape Completion for Robotic Applications

Figure 4 for Towards Confidence-guided Shape Completion for Robotic Applications

Abstract:Many robotic tasks involving some form of 3D visual perception greatly benefit from a complete knowledge of the working environment. However, robots often have to tackle unstructured environments and their onboard visual sensors can only provide incomplete information due to limited workspaces, clutter or object self-occlusion. In recent years, deep learning architectures for shape completion have begun taking traction as effective means of inferring a complete 3D object representation from partial visual data. Nevertheless, most of the existing state-of-the-art approaches provide a fixed output resolution in the form of voxel grids, strictly related to the size of the neural network output stage. While this is enough for some tasks, e.g. obstacle avoidance in navigation, grasping and manipulation require finer resolutions and simply scaling up the neural network outputs is computationally expensive. In this paper, we address this limitation by proposing an object shape completion method based on an implicit 3D representation providing a confidence value for each reconstructed point. As a second contribution, we propose a gradient-based method for efficiently sampling such implicit function at an arbitrary resolution, tunable at inference time. We experimentally validate our approach by comparing reconstructed shapes with ground truths, and by deploying our shape completion algorithm in a robotic grasping pipeline. In both cases, we compare results with a state-of-the-art shape completion approach.

Via

Access Paper or Ask Questions

One-Shot Open-Set Skeleton-Based Action Recognition

Sep 09, 2022

Stefano Berti, Andrea Rosasco, Michele Colledanchise, Lorenzo Natale

Figure 1 for One-Shot Open-Set Skeleton-Based Action Recognition

Figure 2 for One-Shot Open-Set Skeleton-Based Action Recognition

Figure 3 for One-Shot Open-Set Skeleton-Based Action Recognition

Figure 4 for One-Shot Open-Set Skeleton-Based Action Recognition

Abstract:Action recognition is a fundamental capability for humanoid robots to interact and cooperate with humans. This application requires the action recognition system to be designed so that new actions can be easily added, while unknown actions are identified and ignored. In recent years, deep-learning approaches represented the principal solution to the Action Recognition problem. However, most models often require a large dataset of manually-labeled samples. In this work we target One-Shot deep-learning models, because they can deal with just a single instance for class. Unfortunately, One-Shot models assume that, at inference time, the action to recognize falls into the support set and they fail when the action lies outside the support set. Few-Shot Open-Set Recognition (FSOSR) solutions attempt to address that flaw, but current solutions consider only static images and not sequences of images. Static images remain insufficient to discriminate actions such as sitting-down and standing-up. In this paper we propose a novel model that addresses the FSOSR problem with a One-Shot model that is augmented with a discriminator that rejects unknown actions. This model is useful for applications in humanoid robotics, because it allows to easily add new classes and determine whether an input sequence is among the ones that are known to the system. We show how to train the whole model in an end-to-end fashion and we perform quantitative and qualitative analyses. Finally, we provide real-world examples.

Via

Access Paper or Ask Questions

Calliope -- A Polyphonic Music Transformer

Jul 08, 2021

Andrea Valenti, Stefano Berti, Davide Bacciu

Figure 1 for Calliope -- A Polyphonic Music Transformer

Figure 2 for Calliope -- A Polyphonic Music Transformer

Abstract:The polyphonic nature of music makes the application of deep learning to music modelling a challenging task. On the other hand, the Transformer architecture seems to be a good fit for this kind of data. In this work, we present Calliope, a novel autoencoder model based on Transformers for the efficient modelling of multi-track sequences of polyphonic music. The experiments show that our model is able to improve the state of the art on musical sequence reconstruction and generation, with remarkably good results especially on long sequences.

* Accepted at ESANN2021

Via

Access Paper or Ask Questions