Geodesic models are known as an efficient tool for solving various image segmentation problems. Most of existing approaches only exploit local pointwise image features to track geodesic paths for delineating the objective boundaries. However, such a segmentation strategy cannot take into account the connectivity of the image edge features, increasing the risk of shortcut problem, especially in the case of complicated scenario. In this work, we introduce a new image segmentation model based on the minimal geodesic framework in conjunction with an adaptive cut-based circular optimal path computation scheme and a graph-based boundary proposals grouping scheme. Specifically, the adaptive cut can disconnect the image domain such that the target contours are imposed to pass through this cut only once. The boundary proposals are comprised of precomputed image edge segments, providing the connectivity information for our segmentation model. These boundary proposals are then incorporated into the proposed image segmentation model, such that the target segmentation contours are made up of a set of selected boundary proposals and the corresponding geodesic paths linking them. Experimental results show that the proposed model indeed outperforms state-of-the-art minimal paths-based image segmentation approaches.
In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame. Existing methods treat the temporal contexts obtained from different objects indiscriminately and ignore their different identities. While intuitively, aggregating local views of the same object in different frames may facilitate a better understanding of the object. Thus, in this paper, we aim to enable the model to focus on the identity-consistent temporal contexts of each object to obtain more comprehensive object representations and handle the rapid object appearance variations such as occlusion, motion blur, etc. However, realizing this goal on top of existing VID models faces low-efficiency problems due to their redundant region proposals and nonparallel frame-wise prediction manner. To aid this, we propose ClipVID, a VID model equipped with Identity-Consistent Aggregation (ICA) layers specifically designed for mining fine-grained and identity-consistent temporal contexts. It effectively reduces the redundancies through the set prediction strategy, making the ICA layers very efficient and further allowing us to design an architecture that makes parallel clip-wise predictions for the whole video clip. Extensive experimental results demonstrate the superiority of our method: a state-of-the-art (SOTA) performance (84.7% mAP) on the ImageNet VID dataset while running at a speed about 7x faster (39.3 fps) than previous SOTAs.
In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e.g., CLIP) by adapting them to the video domain. A critical problem for them is how to effectively capture the rich semantics inside the video using the image encoder of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal modeling techniques to fuse the text information into video frame representations, which, however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
Practical object detection application can lose its effectiveness on image inputs with natural distribution shifts. This problem leads the research community to pay more attention on the robustness of detectors under Out-Of-Distribution (OOD) inputs. Existing works construct datasets to benchmark the detector's OOD robustness for a specific application scenario, e.g., Autonomous Driving. However, these datasets lack universality and are hard to benchmark general detectors built on common tasks such as COCO. To give a more comprehensive robustness assessment, we introduce COCO-O(ut-of-distribution), a test dataset based on COCO with 6 types of natural distribution shifts. COCO-O has a large distribution gap with training data and results in a significant 55.7% relative performance drop on a Faster R-CNN detector. We leverage COCO-O to conduct experiments on more than 100 modern object detectors to investigate if their improvements are credible or just over-fitting to the COCO test set. Unfortunately, most classic detectors in early years do not exhibit strong OOD generalization. We further study the robustness effect on recent breakthroughs of detector's architecture design, augmentation and pre-training techniques. Some empirical findings are revealed: 1) Compared with detection head or neck, backbone is the most important part for robustness; 2) An end-to-end detection transformer design brings no enhancement, and may even reduce robustness; 3) Large-scale foundation models have made a great leap on robust object detection. We hope our COCO-O could provide a rich testbed for robustness study of object detection. The dataset will be available at https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o.
The local geometrical randomness of metal foams brings complexities to the performance prediction of porous structures. Although the relative density is commonly deemed as the key factor, the stochasticity of internal cell sizes and shapes has an apparent effect on the porous structural behaviour but the corresponding measurement is challenging. To address this issue, we are aimed to develop an assessment strategy for efficiently examining the foam properties by combining multiscale modelling and deep learning. The multiscale modelling is based on the finite element (FE) simulation employing representative volume elements (RVEs) with random cellular morphologies, mimicking the typical features of closed-cell Aluminium foams. A deep learning database is constructed for training the designed convolutional neural networks (CNNs) to establish a direct link between the mesoscopic porosity characteristics and the effective Youngs modulus of foams. The error range of CNN models leads to an uncertain mechanical performance, which is further evaluated in a structural uncertainty analysis on the FG porous three-layer beam consisting of two thin high-density layers and a thick low-density one, where the imprecise CNN predicted moduli are represented as triangular fuzzy numbers in double parametric form. The uncertain beam bending deflections under a mid-span point load are calculated with the aid of Timoshenko beam theory and the Ritz method. Our findings suggest the success in training CNN models to estimate RVE modulus using images with an average error of 5.92%. The evaluation of FG porous structures can be significantly simplified with the proposed method and connects to the mesoscopic cellular morphologies without establishing the mechanics model for local foams.
The minimal geodesic models based on the Eikonal equations are capable of finding suitable solutions in various image segmentation scenarios. Existing geodesic-based segmentation approaches usually exploit image features in conjunction with geometric regularization terms, such as Euclidean curve length or curvature-penalized length, for computing geodesic curves. In this paper, we take into account a more complicated problem: finding curvature-penalized geodesic paths with a convexity shape prior. We establish new geodesic models relying on the strategy of orientation-lifting, by which a planar curve can be mapped to an high-dimensional orientation-dependent space. The convexity shape prior serves as a constraint for the construction of local geodesic metrics encoding a particular curvature constraint. Then the geodesic distances and the corresponding closed geodesic paths in the orientation-lifted space can be efficiently computed through state-of-the-art Hamiltonian fast marching method. In addition, we apply the proposed geodesic models to the active contours, leading to efficient interactive image segmentation algorithms that preserve the advantages of convexity shape prior and curvature penalization.
Minimal paths are considered as a powerful and efficient tool for boundary detection and image segmentation due to its global optimality and well-established numerical solutions such as fast marching algorithm. In this paper, we introduce a flexible interactive image segmentation model based on the minimal geodesic framework in conjunction with region-based homogeneity enhancement. A key ingredient in our model is the construction of Finsler geodesic metrics, which are capable of integrating anisotropic and asymmetric edge features, region-based homogeneity and/or curvature regularization. This is done by exploiting an implicit method to incorporate the region-based homogeneity information to the metrics used. Moreover, we also introduce a way to build objective simple closed contours, each of which is treated as the concatenation of two disjoint open paths. Experimental results prove that the proposed model indeed outperforms state-of-the-art minimal paths-based image segmentation approaches.
Tubular structure tracking is an important and difficult problem in the fields of computer vision and medical image analysis. The minimal path models have exhibited its power in tracing tubular structures, by which a centerline can be naturally treated as a minimal path with a suitable geodesic metric. However, existing minimal path-based tubular structure tracing models still suffer from difficulty like the shortcuts and short branches combination problems, especially when dealing with the images with a complicated background. We introduce a new minima path-based model for minimally interactive tubular structure centerline extraction in conjunction with a perceptual grouping scheme. We take into account the prescribed tubular trajectories and the relevant curvature-penalized geodesic distances for minimal paths extraction in a graph-based optimization way. Experimental results on both synthetic and real images prove that the proposed model indeed obtains outperformance comparing to state-of-the-art minimal path-based tubular structure tracing algorithms.
Knowledge Distillation (KD) has made remarkable progress in the last few years and become a popular paradigm for model compression and knowledge transfer. However, almost all existing KD algorithms are data-driven, i.e., relying on a large amount of original training data or alternative data, which is usually unavailable in real-world scenarios. In this paper, we devote ourselves to this challenging problem and propose a novel adversarial distillation mechanism to craft a compact student model without any real-world data. We introduce a model discrepancy to quantificationally measure the difference between student and teacher models and construct an optimizable upper bound. In our work, the student and the teacher jointly act the role of the discriminator to reduce this discrepancy, when a generator adversarially produces some "hard samples" to enlarge it. Extensive experiments demonstrate that the proposed data-free method yields comparable performance to existing data-driven methods. More strikingly, our approach can be directly extended to semantic segmentation, which is more complicated than classification, and our approach achieves state-of-the-art results. Code and pretrained models are available at https://github.com/VainF/Data-Free-Adversarial-Distillation.