State Key Laboratory of Millimeter Waves, Southeast University, Nanjing, China, Purple Mountain Laboratories, Nanjing, China
Abstract:Vehicle-to-infrastructure (V2I) cooperative perception plays a crucial role in autonomous driving scenarios. Despite its potential to improve perception accuracy and robustness, the large amount of raw sensor data inevitably results in high communication overhead. To mitigate this issue, we propose TOCOM-V2I, a task-oriented communication framework for V2I cooperative perception, which reduces bandwidth consumption by transmitting only task-relevant information, instead of the raw data stream, for perceiving the surrounding environment. Our contributions are threefold. First, we propose a spatial-aware feature selection module to filter out irrelevant information based on spatial relationships and perceptual prior. Second, we introduce a hierarchical entropy model to exploit redundancy within the features for efficient compression and transmission. Finally, we utilize a scaled dot-product attention architecture to fuse vehicle-side and infrastructure-side features to improve perception performance. Experimental results demonstrate the effectiveness of TOCOM-V2I.
Abstract:Appearance-based supervised methods with full-face image input have made tremendous advances in recent gaze estimation tasks. However, intensive human annotation requirement inhibits current methods from achieving industrial level accuracy and robustness. Although current unsupervised pre-training frameworks have achieved success in many image recognition tasks, due to the deep coupling between facial and eye features, such frameworks are still deficient in extracting useful gaze features from full-face. To alleviate above limitations, this work proposes a novel unsupervised/self-supervised gaze pre-training framework, which forces the full-face branch to learn a low dimensional gaze embedding without gaze annotations, through collaborative feature contrast and squeeze modules. In the heart of this framework is an alternating eye-attended/unattended masking training scheme, which squeezes gaze-related information from full-face branch into an eye-masked auto-encoder through an injection bottleneck design that successfully encourages the model to pays more attention to gaze direction rather than facial textures only, while still adopting the eye self-reconstruction objective. In the same time, a novel eye/gaze-related information contrastive loss has been designed to further boost the learned representation by forcing the model to focus on eye-centered regions. Extensive experimental results on several gaze benchmarks demonstrate that the proposed scheme achieves superior performances over unsupervised state-of-the-art.
Abstract:Recently, text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input. However, these models struggle to accurately adhere to textual instructions regarding spatial layout information. While previous research has primarily focused on aligning cross-attention maps with layout conditions, they overlook the impact of the initialization noise on the layout guidance. To achieve better layout control, we propose leveraging a spatial-aware initialization noise during the denoising process. Specifically, we find that the inverted reference image with finite inversion steps contains valuable spatial awareness regarding the object's position, resulting in similar layouts in the generated images. Based on this observation, we develop an open-vocabulary framework to customize a spatial-aware initialization noise for each layout condition. Without modifying other modules except the initialization noise, our approach can be seamlessly integrated as a plug-and-play module within other training-free layout guidance frameworks. We evaluate our approach quantitatively and qualitatively on the available Stable Diffusion model and COCO dataset. Equipped with the spatial-aware latent initialization, our method significantly improves the effectiveness of layout guidance while preserving high-quality content.
Abstract:Document pair extraction aims to identify key and value entities as well as their relationships from visually-rich documents. Most existing methods divide it into two separate tasks: semantic entity recognition (SER) and relation extraction (RE). However, simply concatenating SER and RE serially can lead to severe error propagation, and it fails to handle cases like multi-line entities in real scenarios. To address these issues, this paper introduces a novel framework, PEneo (Pair Extraction new decoder option), which performs document pair extraction in a unified pipeline, incorporating three concurrent sub-tasks: line extraction, line grouping, and entity linking. This approach alleviates the error accumulation problem and can handle the case of multi-line entities. Furthermore, to better evaluate the model's performance and to facilitate future research on pair extraction, we introduce RFUND, a re-annotated version of the commonly used FUNSD and XFUND datasets, to make them more accurate and cover realistic situations. Experiments on various benchmarks demonstrate PEneo's superiority over previous pipelines, boosting the performance by a large margin (e.g., 19.89%-22.91% F1 score on RFUND-EN) when combined with various backbones like LiLT and LayoutLMv3, showing its effectiveness and generality. Codes and the new annotations will be open to the public.
Abstract:Deep neural networks (DNNs) that tackle the time series classification (TSC) task have provided a promising framework in signal processing. In real-world applications, as a data-driven model, DNNs are suffered from insufficient data. Few-shot learning has been studied to deal with this limitation. In this paper, we propose a novel few-shot learning framework through data augmentation, which involves transformation through the time-frequency domain and the generation of synthetic images through random erasing. Additionally, we develop a sequence-spectrogram neural network (SSNN). This neural network model composes of two sub-networks: one utilizing 1D residual blocks to extract features from the input sequence while the other one employing 2D residual blocks to extract features from the spectrogram representation. In the experiments, comparison studies of different existing DNN models with/without data augmentation are conducted on an amyotrophic lateral sclerosis (ALS) dataset and a wind turbine fault (WTF) dataset. The experimental results manifest that our proposed method achieves 93.75% F1 score and 93.33% accuracy on the ALS datasets while 95.48% F1 score and 95.59% accuracy on the WTF datasets. Our methodology demonstrates its applicability of addressing the few-shot problems for time series classification.
Abstract:We develop an effective point cloud rendering pipeline for novel view synthesis, which enables high fidelity local detail reconstruction, real-time rendering and user-friendly editing. In the heart of our pipeline is an adaptive frequency modulation module called Adaptive Frequency Net (AFNet), which utilizes a hypernetwork to learn the local texture frequency encoding that is consecutively injected into adaptive frequency activation layers to modulate the implicit radiance signal. This mechanism improves the frequency expressive ability of the network with richer frequency basis support, only at a small computational budget. To further boost performance, a preprocessing module is also proposed for point cloud geometry optimization via point opacity estimation. In contrast to implicit rendering, our pipeline supports high-fidelity interactive editing based on point cloud manipulation. Extensive experimental results on NeRF-Synthetic, ScanNet, DTU and Tanks and Temples datasets demonstrate the superior performances achieved by our method in terms of PSNR, SSIM and LPIPS, in comparison to the state-of-the-art.
Abstract:Spatial audio, which focuses on immersive 3D sound rendering, is widely applied in the acoustic industry. One of the key problems of current spatial audio rendering methods is the lack of personalization based on different anatomies of individuals, which is essential to produce accurate sound source positions. In this work, we address this problem from an interdisciplinary perspective. The rendering of spatial audio is strongly correlated with the 3D shape of human bodies, particularly ears. To this end, we propose to achieve personalized spatial audio by reconstructing 3D human ears with single-view images. First, to benchmark the ear reconstruction task, we introduce AudioEar3D, a high-quality 3D ear dataset consisting of 112 point cloud ear scans with RGB images. To self-supervisedly train a reconstruction model, we further collect a 2D ear dataset composed of 2,000 images, each one with manual annotation of occlusion and 55 landmarks, named AudioEar2D. To our knowledge, both datasets have the largest scale and best quality of their kinds for public use. Further, we propose AudioEarM, a reconstruction method guided by a depth estimation network that is trained on synthetic data, with two loss functions tailored for ear data. Lastly, to fill the gap between the vision and acoustics community, we develop a pipeline to integrate the reconstructed ear mesh with an off-the-shelf 3D human body and simulate a personalized Head-Related Transfer Function (HRTF), which is the core of spatial audio rendering. Code and data are publicly available at https://github.com/seanywang0408/AudioEar.
Abstract:Wireless communication using fully passive metal reflectors is a promising technique for coverage expansion, signal enhancement, rank improvement and blind-zone compensation, thanks to its appealing features including zero energy consumption, ultra low cost, signaling- and maintenance-free, easy deployment and full compatibility with existing and future wireless systems. However, a prevalent understanding for reflection by metal plates is based on Snell's Law, i.e., signal can only be received when the observation angle equals to the incident angle, which is valid only when the electrical dimension of the metal plate is extremely large. In this paper, we rigorously derive a general reflection model that is applicable to metal reflectors of any size, any orientation, and any linear polarization. The derived model is given compactly in terms of the radar cross section (RCS) of the metal plate, as a function of its physical dimensions and orientation vectors, as well as the wave polarization and the wave deflection vector, i.e., the change of direction from the incident wave direction to the observation direction. Furthermore, experimental results based on actual field measurements are provided to validate the accuracy of our developed model and demonstrate the great potential of communications using metal reflectors.
Abstract:Recent years we have witnessed rapid development in NeRF-based image rendering due to its high quality. However, point clouds rendering is somehow less explored. Compared to NeRF-based rendering which suffers from dense spatial sampling, point clouds rendering is naturally less computation intensive, which enables its deployment in mobile computing device. In this work, we focus on boosting the image quality of point clouds rendering with a compact model design. We first analyze the adaption of the volume rendering formulation on point clouds. Based on the analysis, we simplify the NeRF representation to a spatial mapping function which only requires single evaluation per pixel. Further, motivated by ray marching, we rectify the the noisy raw point clouds to the estimated intersection between rays and surfaces as queried coordinates, which could avoid spatial frequency collapse and neighbor point disturbance. Composed of rasterization, spatial mapping and the refinement stages, our method achieves the state-of-the-art performance on point clouds rendering, outperforming prior works by notable margins, with a smaller model size. We obtain a PSNR of 31.74 on NeRF-Synthetic, 25.88 on ScanNet and 30.81 on DTU. Code and data would be released soon.
Abstract:Noncontact particle manipulation (NPM) technology has significantly extended mankind's analysis capability into micro and nano scale, which in turn greatly promoted the development of material science and life science. Though NPM by means of electric, magnetic, and optical field has achieved great success, from the robotic perspective, it is still labor-intensive manipulation since professional human assistance is somehow mandatory in early preparation stage. Therefore, developing automated noncontact trapping of moving particles is worthwhile, particularly for applications where particle samples are rare, fragile or contact sensitive. Taking advantage of latest dynamic acoustic field modulating technology, and particularly by virtue of the great scalability of acoustic manipulation from micro-scale to sub-centimeter-scale, we propose an automated noncontact trapping of moving micro-particles with ultrasonic phased array system and microscopic vision in this paper. The main contribution of this work is for the first time, as far as we know, we achieved fully automated moving micro-particle trapping in acoustic NPM field by resorting to robotic approach. In short, the particle moving status is observed and predicted by binocular microscopic vision system, by referring to which the acoustic trapping zone is calculated and generated to capture and stably hold the particle. The problem of hand-eye relationship of noncontact robotic end-effector is also solved in this work. Experiments demonstrated the effectiveness of this work.