Abstract:Tracking-Any-Point (TAP) models aim to track any point through a video which is a crucial task in AR/XR and robotics applications. The recently introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion -- demonstrating competitive performance at minimal latency. However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture. We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard ($AJ_{RD}$), to explicitly evaluate tracking on re-appearing points. To improve re-detection of points, we introduce tailored geometric augmentations, such as periodic roll that simulates point re-entries, and supervising occluded points. We demonstrate that recurrent transformers can be substantially improved for point tracking and set a new state-of-the-art on multiple benchmarks. Model and code can be found at https://tap-next-plus-plus.github.io.
Abstract:We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-of-the-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach. https://github.com/DLR-RM/nemo
Abstract:In coherent optical communication systems the laser phase noise is commonly modeled as a Wiener process. We propose a sliding-window based linearization of the phase noise, enabling a novel description. We show that, by stochastically modeling the residual error introduced by this approximation, equalization-enhanced phase noise (EEPN) can be described and decomposed into four different components. Furthermore, we analyze the four components separately and provide a stochastical model for each of them. This novel model allows to predict the impact of well-known algorithms in coherent digital signal processing (DSP) pipelines such as timing recovery (TR) and carrier phase recovery (CPR) on each of the terms. Thus, it enables to approximate the resulting signal affected by EEPN after each of these DSP steps and helps to derive appropriate ways of mitigating such effects.




Abstract:Higher-order solitons inherently possess a spatial periodicity along the propagation axis. The pulse expands and compresses in both, frequency and time domain. This property is exploited for a bandwidth-limited receiver by sampling the optical signal at two different distances. Numerical simulations show that when pure solions are transmitted and the second (i.e., further propagated) signal is also processed, a significant gain in terms of required receiver bandwidth is obtained. Since all pulses propagating in a nonlinear optical fiber exhibit solitonic behavior given sufficient input power and propagation distance, the above concept can also be applied to spectrally efficient Nyquist pulse shaping and higher symbol rates. Transmitter and receiver are trainable structures as part of an autoencoder, aiming to learn a suitable predistortion and post-equalization using both signals to increase the spectral efficiency.