The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
Automated Valet Parking (AVP) requires precise localization in challenging garage conditions, including poor lighting, sparse textures, repetitive structures, dynamic scenes, and the absence of Global Positioning System (GPS) signals, which often pose problems for conventional localization methods. To address these adversities, we present AVM-SLAM, a semantic visual SLAM framework with multi-sensor fusion in a Bird's Eye View (BEV). Our framework integrates four fisheye cameras, four wheel encoders, and an Inertial Measurement Unit (IMU). The fisheye cameras form an Around View Monitor (AVM) subsystem, generating BEV images. Convolutional Neural Networks (CNNs) extract semantic features from these images, aiding in mapping and localization tasks. These semantic features provide long-term stability and perspective invariance, effectively mitigating environmental challenges. Additionally, data fusion from wheel encoders and IMU enhances system robustness by improving motion estimation and reducing drift. To validate AVM-SLAM's efficacy and robustness, we provide a large-scale, high-resolution underground garage dataset, available at https://github.com/yale-cv/avm-slam. This dataset enables researchers to further explore and assess AVM-SLAM in similar environments.
Compared with contact-based fingerprint acquisition techniques, contactless acquisition has the advantages of less skin distortion, larger fingerprint area, and hygienic acquisition. However, perspective distortion is a challenge in contactless fingerprint recognition, which changes ridge orientation, frequency, and minutiae location, and thus causes degraded recognition accuracy. We propose a learning based shape from texture algorithm to reconstruct a 3D finger shape from a single image and unwarp the raw image to suppress perspective distortion. Experimental results on contactless fingerprint databases show that the proposed method has high 3D reconstruction accuracy. Matching experiments on contactless-contact and contactless-contactless matching prove that the proposed method improves matching accuracy.
Few-shot semantic segmentation aims at recognizing the object regions of unseen categories with only a few annotated examples as supervision. The key to few-shot segmentation is to establish a robust semantic relationship between the support and query images and to prevent overfitting. In this paper, we propose an effective Multi-similarity Hyperrelation Network (MSHNet) to tackle the few-shot semantic segmentation problem. In MSHNet, we propose a new Generative Prototype Similarity (GPS), which together with cosine similarity can establish a strong semantic relation between the support and query images. The locally generated prototype similarity based on global feature is logically complementary to the global cosine similarity based on local feature, and the relationship between the query image and the supported image can be expressed more comprehensively by using the two similarities simultaneously. In addition, we propose a Symmetric Merging Block (SMB) in MSHNet to efficiently merge multi-layer, multi-shot and multi-similarity hyperrelational features. MSHNet is built on the basis of similarity rather than specific category features, which can achieve more general unity and effectively reduce overfitting. On two benchmark semantic segmentation datasets Pascal-5i and COCO-20i, MSHNet achieves new state-of-the-art performances on 1-shot and 5-shot semantic segmentation tasks.
Dense registration of fingerprints is a challenging task due to elastic skin distortion, low image quality, and self-similarity of ridge pattern. To overcome the limitation of handcraft features, we propose to train an end-to-end network to directly output pixel-wise displacement field between two fingerprints. The proposed network includes a siamese network for feature embedding, and a following encoder-decoder network for regressing displacement field. By applying displacement fields reliably estimated by tracing high quality fingerprint videos to challenging fingerprints, we synthesize a large number of training fingerprint pairs with ground truth displacement fields. In addition, based on the proposed registration algorithm, we propose a fingerprint mosaicking method based on optimal seam selection. Registration and matching experiments on FVC2004 databases, Tsinghua Distorted Fingerprint (TDF) database, and NIST SD27 latent fingerprint database show that our registration method outperforms previous dense registration methods in accuracy and efficiency. Mosaicking experiment on FVC2004 DB1 demonstrates that the proposed algorithm produced higher quality fingerprints than other algorithms which also validates the performance of our registration algorithm.