This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS), with a focus on two significant issues in the state-of-the-art: foreground leakage and sparse point distribution. The former arises from non-uniform point sampling, allowing models to distinguish the density disparities between foreground and background for easier segmentation. The latter results from sampling only 2,048 points, limiting semantic information and deviating from the real-world practice. To address these issues, we introduce a standardized FS-PCS setting, upon which a new benchmark is built. Moreover, we propose a novel FS-PCS model. While previous methods are based on feature optimization by mainly refining support features to enhance prototypes, our method is based on correlation optimization, referred to as Correlation Optimization Segmentation (COSeg). Specifically, we compute Class-specific Multi-prototypical Correlation (CMC) for each query point, representing its correlations to category prototypes. Then, we propose the Hyper Correlation Augmentation (HCA) module to enhance CMC. Furthermore, tackling the inherent property of few-shot training to incur base susceptibility for models, we propose to learn non-parametric prototypes for the base classes during training. The learned base prototypes are used to calibrate correlations for the background class through a Base Prototypes Calibration (BPC) module. Experiments on popular datasets demonstrate the superiority of COSeg over existing methods. The code is available at: https://github.com/ZhaochongAn/COSeg
Difference features obtained by comparing the images of two periods play an indispensable role in the change detection (CD) task. However, a pair of bi-temporal images can exhibit diverse changes, which may cause various difference features. Identifying changed pixels with differ difference features to be the same category is thus a challenge for CD. Most nowadays' methods acquire distinctive difference features in implicit ways like enhancing image representation or supervision information. Nevertheless, informative image features only guarantee object semantics are modeled and can not guarantee that changed pixels have similar semantics in the difference feature space and are distinct from those unchanged ones. In this work, the generalized representation of various changes is learned straightforwardly in the difference feature space, and a novel Changes-Aware Transformer (CAT) for refining difference features is proposed. This generalized representation can perceive which pixels are changed and which are unchanged and further guide the update of pixels' difference features. CAT effectively accomplishes this refinement process through the stacked cosine cross-attention layer and self-attention layer. After refinement, the changed pixels in the difference feature space are closer to each other, which facilitates change detection. In addition, CAT is compatible with various backbone networks and existing CD methods. Experiments on remote sensing CD data set and street scene CD data set show that our method achieves state-of-the-art performance and has excellent generalization.
Multifocus image fusion is an effective way to overcome the limitation of optical lenses. Many existing methods obtain fused results by generating decision maps. However, such methods often assume that the focused areas of the two source images are complementary, making it impossible to achieve simultaneous fusion of multiple images. Additionally, the existing methods ignore the impact of hard pixels on fusion performance, limiting the visual quality improvement of fusion image. To address these issues, a combining generation and recombination model, termed as GRFusion, is proposed. In GRFusion, focus property detection of each source image can be implemented independently, enabling simultaneous fusion of multiple source images and avoiding information loss caused by alternating fusion. This makes GRFusion free from the number of inputs. To distinguish the hard pixels from the source images, we achieve the determination of hard pixels by considering the inconsistency among the detection results of focus areas in source images. Furthermore, a multi-directional gradient embedding method for generating full focus images is proposed. Subsequently, a hard-pixel-guided recombination mechanism for constructing fused result is devised, effectively integrating the complementary advantages of feature reconstruction-based method and focused pixel recombination-based method. Extensive experimental results demonstrate the effectiveness and the superiority of the proposed method.The source code will be released on https://github.com/xxx/xxx.
Recently, inference privacy has attracted increasing attention. The inference privacy concern arises most notably in the widely deployed edge-cloud video analytics systems, where the cloud needs the videos captured from the edge. The video data can contain sensitive information and subject to attack when they are transmitted to the cloud for inference. Many privacy protection schemes have been proposed. Yet, the performance of a scheme needs to be determined by experiments or inferred by analyzing the specific case. In this paper, we propose a new metric, \textit{privacy protectability}, to characterize to what degree a video stream can be protected given a certain video analytics task. Such a metric has strong operational meaning. For example, low protectability means that it may be necessary to set up an overall secure environment. We can also evaluate a privacy protection scheme, e.g., assume it obfuscates the video data, what level of protection this scheme has achieved after obfuscation. Our definition of privacy protectability is rooted in information theory and we develop efficient algorithms to estimate the metric. We use experiments on real data to validate that our metric is consistent with empirical measurements on how well a video stream can be protected for a video analytics task.
3D object detection is a critical task in autonomous driving. Recently multi-modal fusion-based 3D object detection methods, which combine the complementary advantages of LiDAR and camera, have shown great performance improvements over mono-modal methods. However, so far, no methods have attempted to utilize the instance-level contextual image semantics to guide the 3D object detection. In this paper, we propose a simple and effective Painting Adaptive Instance-prior for 3D object detection (PAI3D) to fuse instance-level image semantics flexibly with point cloud features. PAI3D is a multi-modal sequential instance-level fusion framework. It first extracts instance-level semantic information from images, the extracted information, including objects categorical label, point-to-object membership and object position, are then used to augment each LiDAR point in the subsequent 3D detection network to guide and improve detection performance. PAI3D outperforms the state-of-the-art with a large margin on the nuScenes dataset, achieving 71.4 in mAP and 74.2 in NDS on the test split. Our comprehensive experiments show that instance-level image semantics contribute the most to the performance gain, and PAI3D works well with any good-quality instance segmentation models and any modern point cloud 3D encoders, making it a strong candidate for deployment on autonomous vehicles.
In order to be applicable in real-world scenario, Boundary Attacks (BAs) were proposed and ensured one hundred percent attack success rate with only decision information. However, existing BA methods craft adversarial examples by leveraging a simple random sampling (SRS) to estimate the gradient, consuming a large number of model queries. To overcome the drawback of SRS, this paper proposes a Latin Hypercube Sampling based Boundary Attack (LHS-BA) to save query budget. Compared with SRS, LHS has better uniformity under the same limited number of random samples. Therefore, the average on these random samples is closer to the true gradient than that estimated by SRS. Various experiments are conducted on benchmark datasets including MNIST, CIFAR, and ImageNet-1K. Experimental results demonstrate the superiority of the proposed LHS-BA over the state-of-the-art BA methods in terms of query efficiency. The source codes are publicly available at https://github.com/GZHU-DVL/LHS-BA.
We propose a Transformer-based NeRF (TransNeRF) to learn a generic neural radiance field conditioned on observed-view images for the novel view synthesis task. By contrast, existing MLP-based NeRFs are not able to directly receive observed views with an arbitrary number and require an auxiliary pooling-based operation to fuse source-view information, resulting in the missing of complicated relationships between source views and the target rendering view. Furthermore, current approaches process each 3D point individually and ignore the local consistency of a radiance field scene representation. These limitations potentially can reduce their performance in challenging real-world applications where large differences between source views and a novel rendering view may exist. To address these challenges, our TransNeRF utilizes the attention mechanism to naturally decode deep associations of an arbitrary number of source views into a coordinate-based scene representation. Local consistency of shape and appearance are considered in the ray-cast space and the surrounding-view space within a unified Transformer network. Experiments demonstrate that our TransNeRF, trained on a wide variety of scenes, can achieve better performance in comparison to state-of-the-art image-based neural rendering methods in both scene-agnostic and per-scene finetuning scenarios especially when there is a considerable gap between source views and a rendering view.
In Artificial Intelligence, interpreting the results of a Machine Learning technique often termed as a black box is a difficult task. A counterfactual explanation of a particular "black box" attempts to find the smallest change to the input values that modifies the prediction to a particular output, other than the original one. In this work we formulate the problem of finding a counterfactual explanation as an optimization problem. We propose a new "sparsity algorithm" which solves the optimization problem, while also maximizing the sparsity of the counterfactual explanation. We apply the sparsity algorithm to provide a simple suggestion to publicly traded companies in order to improve their credit ratings. We validate the sparsity algorithm with a synthetically generated dataset and we further apply it to quarterly financial statements from companies in financial, healthcare and IT sectors of the US market. We provide evidence that the counterfactual explanation can capture the nature of the real statement features that changed between the current quarter and the following quarter when ratings improved. The empirical results show that the higher the rating of a company the greater the "effort" required to further improve credit rating.
On edge devices, data scarcity occurs as a common problem where transfer learning serves as a widely-suggested remedy. Nevertheless, transfer learning imposes a heavy computation burden to resource-constrained edge devices. Existing task allocation works usually assume all submitted tasks are equally important, leading to inefficient resource allocation at a task level when directly applied in Multi-task Transfer Learning (MTL). To address these issues, we first reveal that it is crucial to measure the impact of tasks on overall decision performance improvement and quantify \emph{task importance}. We then show that task allocation with task importance for MTL (TATIM) is a variant of the NP-complete Knapsack problem, where the complicated computation to solve this problem needs to be conducted repeatedly under varying contexts. To solve TATIM with high computational efficiency, we propose a Data-driven Cooperative Task Allocation (DCTA) approach. Finally, we evaluate the performance of DCTA by not only a trace-driven simulation, but also a new comprehensive real-world AIOps case study that bridges model and practice via a new architecture and main components design within the AIOps system. Extensive experiments show that our DCTA reduces 3.24 times of processing time, and saves 48.4\% energy consumption compared with the state-of-the-art when solving TATIM.
Deep CNN-based methods have so far achieved the state of the art results in multi-view 3D object reconstruction. Despite the considerable progress, the two core modules of these methods - multi-view feature extraction and fusion, are usually investigated separately, and the object relations in different views are rarely explored. In this paper, inspired by the recent great success in self-attention-based Transformer models, we reformulate the multi-view 3D reconstruction as a sequence-to-sequence prediction problem and propose a new framework named 3D Volume Transformer (VolT) for such a task. Unlike previous CNN-based methods using a separate design, we unify the feature extraction and view fusion in a single Transformer network. A natural advantage of our design lies in the exploration of view-to-view relationships using self-attention among multiple unordered inputs. On ShapeNet - a large-scale 3D reconstruction benchmark dataset, our method achieves a new state-of-the-art accuracy in multi-view reconstruction with fewer parameters ($70\%$ less) than other CNN-based methods. Experimental results also suggest the strong scaling capability of our method. Our code will be made publicly available.