In this work, we build a modular-designed codebase, formulate strong training recipes, design an error diagnosis toolbox, and discuss current methods for image-based 3D object detection. In particular, different from other highly mature tasks, e.g., 2D object detection, the community of image-based 3D object detection is still evolving, where methods often adopt different training recipes and tricks resulting in unfair evaluations and comparisons. What is worse, these tricks may overwhelm their proposed designs in performance, even leading to wrong conclusions. To address this issue, we build a module-designed codebase and formulate unified training standards for the community. Furthermore, we also design an error diagnosis toolbox to measure the detailed characterization of detection models. Using these tools, we analyze current methods in-depth under varying settings and provide discussions for some open questions, e.g., discrepancies in conclusions on KITTI-3D and nuScenes datasets, which have led to different dominant methods for these datasets. We hope that this work will facilitate future research in image-based 3D object detection. Our codes will be released at \url{https://github.com/OpenGVLab/3dodi}
Fine-tuning visual models has been widely shown promising performance on many downstream visual tasks. With the surprising development of pre-trained visual foundation models, visual tuning jumped out of the standard modus operandi that fine-tunes the whole pre-trained model or just the fully connected layer. Instead, recent advances can achieve superior performance than full-tuning the whole pre-trained parameters by updating far fewer parameters, enabling edge devices and downstream applications to reuse the increasingly large foundation models deployed on the cloud. With the aim of helping researchers get the full picture and future directions of visual tuning, this survey characterizes a large and thoughtful selection of recent works, providing a systematic and comprehensive overview of existing work and models. Specifically, it provides a detailed background of visual tuning and categorizes recent visual tuning techniques into five groups: prompt tuning, adapter tuning, parameter tuning, and remapping tuning. Meanwhile, it offers some exciting research directions for prospective pre-training and various interactions in visual tuning.
A steerable parametric loudspeaker array is known for its directivity and narrow beam width. However, it often suffers from the grating lobes due to periodic array distributions. Here we propose the array configuration of hyperuniform disorder, which is short-range random while correlated at large scales, as a promising alternative distribution of acoustic antennas in phased arrays. Angle-resolved measurements reveal that the proposed array suppresses grating lobes and maintains a minimal radiation region in the vicinity of the main lobe for the primary frequency waves. These distinctive emission features benefit the secondary frequency wave in canceling the grating lobes regardless of the frequencies of the primary waves. Besides that, the hyperuniform disordered array is duplicatable, which facilitates extra-large array design without any additional computational efforts.
Table structure recognition is a crucial part of document image analysis domain. Its difficulty lies in the need to parse the physical coordinates and logical indices of each cell at the same time. However, the existing methods are difficult to achieve both these goals, especially when the table splitting lines are blurred or tilted. In this paper, we propose an accurate and end-to-end transformer-based table structure recognition method, referred to as TRUST. Transformers are suitable for table structure recognition because of their global computations, perfect memory, and parallel computation. By introducing novel Transformer-based Query-based Splitting Module and Vertex-based Merging Module, the table structure recognition problem is decoupled into two joint optimization sub-tasks: multi-oriented table row/column splitting and table grid merging. The Query-based Splitting Module learns strong context information from long dependencies via Transformer networks, accurately predicts the multi-oriented table row/column separators, and obtains the basic grids of the table accordingly. The Vertex-based Merging Module is capable of aggregating local contextual information between adjacent basic grids, providing the ability to merge basic girds that belong to the same spanning cell accurately. We conduct experiments on several popular benchmarks including PubTabNet and SynthTable, our method achieves new state-of-the-art results. In particular, TRUST runs at 10 FPS on PubTabNet, surpassing the previous methods by a large margin.
Segmenting dental plaque from images of medical reagent staining provides valuable information for diagnosis and the determination of follow-up treatment plan. However, accurate dental plaque segmentation is a challenging task that requires identifying teeth and dental plaque subjected to semantic-blur regions (i.e., confused boundaries in border regions between teeth and dental plaque) and complex variations of instance shapes, which are not fully addressed by existing methods. Therefore, we propose a semantic decomposition network (SDNet) that introduces two single-task branches to separately address the segmentation of teeth and dental plaque and designs additional constraints to learn category-specific features for each branch, thus facilitating the semantic decomposition and improving the performance of dental plaque segmentation. Specifically, SDNet learns two separate segmentation branches for teeth and dental plaque in a divide-and-conquer manner to decouple the entangled relation between them. Each branch that specifies a category tends to yield accurate segmentation. To help these two branches better focus on category-specific features, two constraint modules are further proposed: 1) contrastive constraint module (CCM) to learn discriminative feature representations by maximizing the distance between different category representations, so as to reduce the negative impact of semantic-blur regions on feature extraction; 2) structural constraint module (SCM) to provide complete structural information for dental plaque of various shapes by the supervision of an boundary-aware geometric constraint. Besides, we construct a large-scale open-source Stained Dental Plaque Segmentation dataset (SDPSeg), which provides high-quality annotations for teeth and dental plaque. Experimental results on SDPSeg datasets show SDNet achieves state-of-the-art performance.
Fine-grained object retrieval aims to learn discriminative representation to retrieve visually similar objects. However, existing top-performing works usually impose pairwise similarities on the semantic embedding spaces to continually fine-tune the entire model in limited-data regimes, thus resulting in easily converging to suboptimal solutions. In this paper, we develop Fine-grained Retrieval Prompt Tuning (FRPT), which steers a frozen pre-trained model to perform the fine-grained retrieval task from the perspectives of sample prompt and feature adaptation. Specifically, FRPT only needs to learn fewer parameters in the prompt and adaptation instead of fine-tuning the entire model, thus solving the convergence to suboptimal solutions caused by fine-tuning the entire model. Technically, as sample prompts, a structure perturbation prompt (SPP) is introduced to zoom and even exaggerate some pixels contributing to category prediction via a content-aware inhomogeneous sampling operation. In this way, SPP can make the fine-grained retrieval task aided by the perturbation prompts close to the solved task during the original pre-training. Besides, a category-specific awareness head is proposed and regarded as feature adaptation, which removes the species discrepancies in the features extracted by the pre-trained model using instance normalization, and thus makes the optimized features only include the discrepancies among subcategories. Extensive experiments demonstrate that our FRPT with fewer learnable parameters achieves the state-of-the-art performance on three widely-used fine-grained datasets.
Multi-behavior recommendation exploits multiple types of user-item interactions to alleviate the data sparsity problem faced by the traditional models that often utilize only one type of interaction for recommendation. In real scenarios, users often take a sequence of actions to interact with an item, in order to get more information about the item and thus accurately evaluate whether an item fits personal preference. Those interaction behaviors often obey a certain order, and different behaviors reveal different information or aspects of user preferences towards the target item. Most existing multi-behavior recommendation methods take the strategy to first extract information from different behaviors separately and then fuse them for final prediction. However, they have not exploited the connections between different behaviors to learn user preferences. Besides, they often introduce complex model structures and more parameters to model multiple behaviors, largely increasing the space and time complexity. In this work, we propose a lightweight multi-behavior recommendation model named Cascading Residual Graph Convolutional Network (CRGCN for short), which can explicitly exploit the connections between different behaviors into the embedding learning process without introducing any additional parameters. In particular, we design a cascading residual graph convolutional network structure, which enables our model to learn user preferences by continuously refining user embeddings across different types of behaviors. The multi-task learning method is adopted to jointly optimize our model based on different behaviors. Extensive experimental results on two real-world benchmark datasets show that CRGCN can substantially outperform state-of-the-art methods. Further studies also analyze the effects of leveraging multi-behaviors in different numbers and orders on the final performance.
3D object detection is a fundamental and challenging task for 3D scene understanding, and the monocular-based methods can serve as an economical alternative to the stereo-based or LiDAR-based methods. However, accurately detecting objects in the 3D space from a single image is extremely difficult due to the lack of spatial cues. To mitigate this issue, we propose a simple and effective scheme to introduce the spatial information from LiDAR signals to the monocular 3D detectors, without introducing any extra cost in the inference phase. In particular, we first project the LiDAR signals into the image plane and align them with the RGB images. After that, we use the resulting data to train a 3D detector (LiDAR Net) with the same architecture as the baseline model. Finally, this LiDAR Net can serve as the teacher to transfer the learned knowledge to the baseline model. Experimental results show that the proposed method can significantly boost the performance of the baseline model and ranks the $1^{st}$ place among all monocular-based methods on the KITTI benchmark. Besides, extensive ablation studies are conducted, which further prove the effectiveness of each part of our designs and illustrate what the baseline model has learned from the LiDAR Net. Our code will be released at \url{https://github.com/monster-ghost/MonoDistill}.
With the development of underwater object grabbing technology, underwater object recognition and segmentation of high accuracy has become a challenge. The existing underwater object detection technology can only give the general position of an object, unable to give more detailed information such as the outline of the object, which seriously affects the grabbing efficiency. To address this problem, we label and establish the first underwater semantic segmentation dataset of real scene(DUT-USEG:DUT Underwater Segmentation Dataset). The DUT- USEG dataset includes 6617 images, 1487 of which have semantic segmentation and instance segmentation annotations, and the remaining 5130 images have object detection box annotations. Based on this dataset, we propose a semi-supervised underwater semantic segmentation network focusing on the boundaries(US-Net: Underwater Segmentation Network). By designing a pseudo label generator and a boundary detection subnetwork, this network realizes the fine learning of boundaries between underwater objects and background, and improves the segmentation effect of boundary areas. Experiments show that the proposed method improves by 6.7% in three categories of holothurian, echinus, starfish in DUT-USEG dataset, and achieves state-of-the-art results. The DUT- USEG dataset will be released at https://github.com/baxiyi/DUT-USEG.
Underwater object detection for robot picking has attracted a lot of interest. However, it is still an unsolved problem due to several challenges. We take steps towards making it more realistic by addressing the following challenges. Firstly, the currently available datasets basically lack the test set annotations, causing researchers must compare their method with other SOTAs on a self-divided test set (from the training set). Training other methods lead to an increase in workload and different researchers divide different datasets, resulting there is no unified benchmark to compare the performance of different algorithms. Secondly, these datasets also have other shortcomings, e.g., too many similar images or incomplete labels. Towards these challenges we introduce a dataset, Detecting Underwater Objects (DUO), and a corresponding benchmark, based on the collection and re-annotation of all relevant datasets. DUO contains a collection of diverse underwater images with more rational annotations. The corresponding benchmark provides indicators of both efficiency and accuracy of SOTAs (under the MMDtection framework) for academic research and industrial applications, where JETSON AGX XAVIER is used to assess detector speed to simulate the robot-embedded environment.