Meta
Abstract:All-in-one image restoration tasks are becoming increasingly important, especially for ultra-high-definition (UHD) images. Existing all-in-one UHD image restoration methods usually boost the model's performance by introducing prompt or customized dynamized networks for different degradation types. For the inference stage, it might be friendly, but in the training stage, since the model encounters multiple degraded images of different quality in an epoch, these cluttered learning objectives might be information pollution for the model. To address this problem, we propose a new training paradigm for general image restoration models, which we name \textbf{Review Learning}, which enables image restoration models to be capable enough to handle multiple types of degradation without prior knowledge and prompts. This approach begins with sequential training of an image restoration model on several degraded datasets, combined with a review mechanism that enhances the image restoration model's memory for several previous classes of degraded datasets. In addition, we design a lightweight all-purpose image restoration network that can efficiently reason about degraded images with 4K ($3840 \times 2160$) resolution on a single consumer-grade GPU.




Abstract:Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking generality. To fill this gap, we introduce HyperSIGMA, a vision transformer-based foundation model for HSI interpretation, scalable to over a billion parameters. To tackle the spectral and spatial redundancy challenges in HSIs, we introduce a novel sparse sampling attention (SSA) mechanism, which effectively promotes the learning of diverse contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA integrates spatial and spectral features using a specially designed spectral enhancement module. In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450K hyperspectral images, significantly surpassing existing datasets in scale. Extensive experiments on various high-level and low-level HSI tasks demonstrate HyperSIGMA's versatility and superior representational capability compared to current state-of-the-art methods. Moreover, HyperSIGMA shows significant advantages in scalability, robustness, cross-modal transferring capability, and real-world applicability.




Abstract:Backdoor attacks present significant threats to Large Language Models (LLMs), particularly with the rise of third-party services that offer API integration and prompt engineering. Untrustworthy third parties can plant backdoors into LLMs and pose risks to users by embedding malicious instructions into user queries. The backdoor-compromised LLM will generate malicious output when and input is embedded with a specific trigger predetermined by an attacker. Traditional defense strategies, which primarily involve model parameter fine-tuning and gradient calculation, are inadequate for LLMs due to their extensive computational and clean data requirements. In this paper, we propose a novel solution, Chain-of-Scrutiny (CoS), to address these challenges. Backdoor attacks fundamentally create a shortcut from the trigger to the target output, thus lack reasoning support. Accordingly, CoS guides the LLMs to generate detailed reasoning steps for the input, then scrutinizes the reasoning process to ensure consistency with the final answer. Any inconsistency may indicate an attack. CoS only requires black-box access to LLM, offering a practical defense, particularly for API-accessible LLMs. It is user-friendly, enabling users to conduct the defense themselves. Driven by natural language, the entire defense process is transparent to users. We validate the effectiveness of CoS through extensive experiments across various tasks and LLMs. Additionally, experiments results shows CoS proves more beneficial for more powerful LLMs.




Abstract:Despite significant progress has been made in image deraining, existing approaches are mostly carried out on low-resolution images. The effectiveness of these methods on high-resolution images is still unknown, especially for ultra-high-definition (UHD) images, given the continuous advancement of imaging devices. In this paper, we focus on the task of UHD image deraining, and contribute the first large-scale UHD image deraining dataset, 4K-Rain13k, that contains 13,000 image pairs at 4K resolution. Based on this dataset, we conduct a benchmark study on existing methods for processing UHD images. Furthermore, we develop an effective and efficient vision MLP-based architecture (UDR-Mixer) to better solve this task. Specifically, our method contains two building components: a spatial feature rearrangement layer that captures long-range information of UHD images, and a frequency feature modulation layer that facilitates high-quality UHD image reconstruction. Extensive experimental results demonstrate that our method performs favorably against the state-of-the-art approaches while maintaining a lower model complexity. The code and dataset will be available at https://github.com/cschenxiang/UDR-Mixer.




Abstract:Today's analog/mixed-signal (AMS) integrated circuit (IC) designs demand substantial manual intervention. The advent of multimodal large language models (MLLMs) has unveiled significant potential across various fields, suggesting their applicability in streamlining large-scale AMS IC design as well. A bottleneck in employing MLLMs for automatic AMS circuit generation is the absence of a comprehensive dataset delineating the schematic-netlist relationship. We therefore design an automatic technique for converting schematics into netlists, and create dataset AMSNet, encompassing transistor-level schematics and corresponding SPICE format netlists. With a growing size, AMSNet can significantly facilitate exploration of MLLM applications in AMS circuit design. We have made an initial set of netlists public, and will make both our netlist generation tool and the full dataset available upon publishing of this paper.




Abstract:A high-precision feature extraction model is crucial for change detection (CD). In the past, many deep learning-based supervised CD methods learned to recognize change feature patterns from a large number of labelled bi-temporal images, whereas labelling bi-temporal remote sensing images is very expensive and often time-consuming; therefore, we propose a coarse-to-fine semi-supervised CD method based on consistency regularization (C2F-SemiCD), which includes a coarse-to-fine CD network with a multiscale attention mechanism (C2FNet) and a semi-supervised update method. Among them, the C2FNet network gradually completes the extraction of change features from coarse-grained to fine-grained through multiscale feature fusion, channel attention mechanism, spatial attention mechanism, global context module, feature refine module, initial aggregation module, and final aggregation module. The semi-supervised update method uses the mean teacher method. The parameters of the student model are updated to the parameters of the teacher Model by using the exponential moving average (EMA) method. Through extensive experiments on three datasets and meticulous ablation studies, including crossover experiments across datasets, we verify the significant effectiveness and efficiency of the proposed C2F-SemiCD method. The code will be open at: https://github.com/ChengxiHAN/C2F-SemiCDand-C2FNet.




Abstract:Benefiting from the developments in deep learning technology, deep-learning-based algorithms employing automatic feature extraction have achieved remarkable performance on the change detection (CD) task. However, the performance of existing deep-learning-based CD methods is hindered by the imbalance between changed and unchanged pixels. To tackle this problem, a progressive foreground-balanced sampling strategy on the basis of not adding change information is proposed in this article to help the model accurately learn the features of the changed pixels during the early training process and thereby improve detection performance.Furthermore, we design a discriminative Siamese network, hierarchical attention network (HANet), which can integrate multiscale features and refine detailed features. The main part of HANet is the HAN module, which is a lightweight and effective self-attention mechanism. Extensive experiments and ablation studies on two CDdatasets with extremely unbalanced labels validate the effectiveness and efficiency of the proposed method.




Abstract:The rapid advancement of automated artificial intelligence algorithms and remote sensing instruments has benefited change detection (CD) tasks. However, there is still a lot of space to study for precise detection, especially the edge integrity and internal holes phenomenon of change features. In order to solve these problems, we design the Change Guiding Network (CGNet), to tackle the insufficient expression problem of change features in the conventional U-Net structure adopted in previous methods, which causes inaccurate edge detection and internal holes. Change maps from deep features with rich semantic information are generated and used as prior information to guide multi-scale feature fusion, which can improve the expression ability of change features. Meanwhile, we propose a self-attention module named Change Guide Module (CGM), which can effectively capture the long-distance dependency among pixels and effectively overcome the problem of the insufficient receptive field of traditional convolutional neural networks. On four major CD datasets, we verify the usefulness and efficiency of the CGNet, and a large number of experiments and ablation studies demonstrate the effectiveness of CGNet. We're going to open-source our code at https://github.com/ChengxiHAN/CGNet-CD.
Abstract:Change detection, as a research hotspot in the field of remote sensing, has witnessed continuous development and progress. However, the discrimination of boundary details remains a significant bottleneck due to the complexity of surrounding elements between change areas and backgrounds. Discriminating the boundaries of large change areas results in misalignment, while connecting boundaries occurs for small change targets. To address the above issues, a novel network based on the localization-then-refinement strategy is proposed in this paper, namely LRNet. LRNet consists of two stages: localization and refinement. In the localization stage, a three-branch encoder simultaneously extracts original image features and their differential features for interactive localization of the position of each change area. To minimize information loss during feature extraction, learnable optimal pooling (LOP) is proposed to replace the widely used max-pooling. Additionally, this process is trainable and contributes to the overall optimization of the network. To effectively interact features from different branches and accurately locate change areas of various sizes, change alignment attention (C2A) and hierarchical change alignment module (HCA) are proposed. In the refinement stage, the localization results from the localization stage are corrected by constraining the change areas and change edges through the edge-area alignment module (E2A). Subsequently, the decoder, combined with the difference features strengthened by C2A in the localization phase, refines change areas of different sizes, ultimately achieving accurate boundary discrimination of change areas. The proposed LRNet outperforms 13 other state-of-the-art methods in terms of comprehensive evaluation metrics and provides the most precise boundary discrimination results on the LEVIR-CD and WHU-CD datasets.
Abstract:Text-to-image diffusion models have achieved remarkable performance in image synthesis, while the text interface does not always provide fine-grained control over certain image factors. For instance, changing a single token in the text can have unintended effects on the image. This paper shows a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models. The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens: the positive prompt describes the image to be synthesized, and the baseline prompt serves as a "baseline" that disentangles other factors. Contrastive Guidance is a general method we illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.