Coordinate-based neural implicit representation or implicit fields have been widely studied for 3D geometry representation or novel view synthesis. Recently, a series of efforts have been devoted to accelerating the speed and improving the quality of the coordinate-based implicit field learning. Instead of learning heavy MLPs to predict the neural implicit values for the query coordinates, neural voxels or grids combined with shallow MLPs have been proposed to achieve high-quality implicit field learning with reduced optimization time. On the other hand, lightweight field representations such as linear grid have been proposed to further improve the learning speed. In this paper, we aim for both fast and high-quality implicit field learning, and propose TaylorGrid, a novel implicit field representation which can be efficiently computed via direct Taylor expansion optimization on 2D or 3D grids. As a general representation, TaylorGrid can be adapted to different implicit fields learning tasks such as SDF learning or NeRF. From extensive quantitative and qualitative comparisons, TaylorGrid achieves a balance between the linear grid and neural voxels, showing its superiority in fast and high-quality implicit field learning.
Two important problems in the field of Topological Data Analysis are defining practical multifiltrations on objects and showing ability of TDA to detect the geometry. Motivated by the problems, we constuct three multifiltrations named multi-GENEO, multi-DGENEO and mix-GENEO, and prove the stability of both the interleaving distance and multiparameter persistence landscape of multi-GENEO with respect to the pseudometric of the subspace of bounded functions. We also give the estimations of upper bound for multi-DGENEO and mix-GENEO. Finally, we provide experiment results on MNIST dataset to demonstrate our bifiltrations have ability to detect geometric and topological differences of digital images.
Inferring missing regions from severely occluded point clouds is highly challenging. Especially for 3D shapes with rich geometry and structure details, inherent ambiguities of the unknown parts are existing. Existing approaches either learn a one-to-one mapping in a supervised manner or train a generative model to synthesize the missing points for the completion of 3D point cloud shapes. These methods, however, lack the controllability for the completion process and the results are either deterministic or exhibiting uncontrolled diversity. Inspired by the prompt-driven data generation and editing, we propose a novel prompt-guided point cloud completion framework, coined P2M2-Net, to enable more controllable and more diverse shape completion. Given an input partial point cloud and a text prompt describing the part-aware information such as semantics and structure of the missing region, our Transformer-based completion network can efficiently fuse the multimodal features and generate diverse results following the prompt guidance. We train the P2M2-Net on a new large-scale PartNet-Prompt dataset and conduct extensive experiments on two challenging shape completion benchmarks. Quantitative and qualitative results show the efficacy of incorporating prompts for more controllable part-aware point cloud completion and generation. Code and data are available at https://github.com/JLU-ICL/P2M2-Net.
Few-shot class-incremental learning (FSCIL) struggles to incrementally recognize novel classes from few examples without catastrophic forgetting of old classes or overfitting to new classes. We propose TLCE, which ensembles multiple pre-trained models to improve separation of novel and old classes. TLCE minimizes interference between old and new classes by mapping old class images to quasi-orthogonal prototypes using episodic training. It then ensembles diverse pre-trained models to better adapt to novel classes despite data imbalance. Extensive experiments on various datasets demonstrate that our transfer learning ensemble approach outperforms state-of-the-art FSCIL methods.
Nearest-Neighbor (NN) classification has been proven as a simple and effective approach for few-shot learning. The query data can be classified efficiently by finding the nearest support class based on features extracted by pretrained deep models. However, NN-based methods are sensitive to the data distribution and may produce false prediction if the samples in the support set happen to lie around the distribution boundary of different classes. To solve this issue, we present P3DC-Shot, an improved nearest-neighbor based few-shot classification method empowered by prior-driven data calibration. Inspired by the distribution calibration technique which utilizes the distribution or statistics of the base classes to calibrate the data for few-shot tasks, we propose a novel discrete data calibration operation which is more suitable for NN-based few-shot classification. Specifically, we treat the prototypes representing each base class as priors and calibrate each support data based on its similarity to different base prototypes. Then, we perform NN classification using these discretely calibrated support data. Results from extensive experiments on various datasets show our efficient non-learning based method can outperform or at least comparable to SOTA methods which need additional learning steps.
Video Super-Resolution (VSR) aims to recover sequences of high-resolution (HR) frames from low-resolution (LR) frames. Previous methods mainly utilize temporally adjacent frames to assist the reconstruction of target frames. However, in the real world, there is a lot of irrelevant information in adjacent frames of videos with fast scene switching, these VSR methods cannot adaptively distinguish and select useful information. In contrast, with a transformer structure suitable for temporal tasks, we devise a novel adaptive scenario video super-resolution method. Specifically, we use optical flow to label the patches in each video frame, only calculate the attention of patches with the same label. Then select the most relevant label among them to supplement the spatial-temporal information of the target frame. This design can directly make the supplementary information come from the same scene as much as possible. We further propose a cross-scale feature aggregation module to better handle the scale variation problem. Compared with other video super-resolution methods, our method not only achieves significant performance gains on single-scene videos but also has better robustness on cross-scene datasets.
Active learning maximizes the hypothesis updates to find those desired unlabeled data. An inherent assumption is that this learning manner can derive those updates into the optimal hypothesis. However, its convergence may not be guaranteed well if those incremental updates are negative and disordered. In this paper, we introduce a machine teacher who provides a black-box teaching hypothesis for an active learner, where the teaching hypothesis is an effective approximation for the optimal hypothesis. Theoretically, we prove that, under the guidance of this teaching hypothesis, the learner can converge into a tighter generalization error and label complexity bound than those non-educated learners who do not receive any guidance from a teacher. We further consider two teaching scenarios: teaching a white-box and black-box learner, where self-improvement of teaching is firstly proposed to improve the teaching performance. Experiments verify this idea and show better performance than the fundamental active learning strategies, such as IWAL, IWAL-D, etc.
Class activation map (CAM) has been widely studied for visual explanation of the internal working mechanism of convolutional neural networks. The key of existing CAM-based methods is to compute effective weights to combine activation maps in the target convolution layer. Existing gradient and score based weighting schemes have shown superiority in ensuring either the discriminability or faithfulness of the CAM, but they normally cannot excel in both properties. In this paper, we propose a novel CAM weighting scheme, named FD-CAM, to improve both the faithfulness and discriminability of the CAM-based CNN visual explanation. First, we improve the faithfulness and discriminability of the score-based weights by performing a grouped channel switching operation. Specifically, for each channel, we compute its similarity group and switch the group of channels on or off simultaneously to compute changes in the class prediction score as the weights. Then, we combine the improved score-based weights with the conventional gradient-based weights so that the discriminability of the final CAM can be further improved. We perform extensive comparisons with the state-of-the-art CAM algorithms. The quantitative and qualitative results show our FD-CAM can produce more faithful and more discriminative visual explanations of the CNNs. We also conduct experiments to verify the effectiveness of the proposed grouped channel switching and weight combination scheme on improving the results. Our code is available at https://github.com/crishhh1998/FD-CAM.
Image warping is a necessary step in many multimedia applications such as texture mapping, image-based rendering, panorama stitching, image resizing and optical flow computation etc. Traditionally, color image warping interpolation is performed in each color channel independently. In this paper, we show that the warping quality can be significantly enhanced by exploiting the cross-channel correlation. We design a warping scheme that integrates intra-channel interpolation with cross-channel variation at very low computational cost, which is required for interactive multimedia applications on mobile devices. The effectiveness and efficiency of our method are validated by extensive experiments.