Vision-and-Language Navigation (VLN) has witnessed significant advancements in recent years, largely attributed to meticulously curated datasets and proficiently trained models. Nevertheless, when tested in diverse environments, the trained models inevitably encounter significant shifts in data distribution, highlighting that relying solely on pre-trained and fixed navigation models is insufficient. To enhance models' generalization ability, test-time adaptation (TTA) demonstrates significant potential in the computer vision field by leveraging unlabeled test samples for model updates. However, simply applying existing TTA methods to the VLN task cannot well handle the adaptability-stability dilemma of VLN models, i.e., frequent updates can result in drastic changes in model parameters, while occasional updates can make the models ill-equipped to handle dynamically changing environments. Therefore, we propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for VLN by performing decomposition-accumulation analysis for both gradients and parameters in a unified framework. Specifically, in the fast update phase, gradients generated during the recent multi-step navigation process are decomposed into components with varying levels of consistency. Then, these components are adaptively accumulated to pinpoint a concordant direction for fast model adaptation. In the slow update phase, historically recorded parameters are gathered, and a similar decomposition-accumulation analysis is conducted to revert the model to a stable state. Extensive experiments show that our method obtains impressive performance gains on four popular benchmarks.
Despite the great progress of unsupervised domain adaptation (UDA) with the deep neural networks, current UDA models are opaque and cannot provide promising explanations, limiting their applications in the scenarios that require safe and controllable model decisions. At present, a surge of work focuses on designing deep interpretable methods with adequate data annotations and only a few methods consider the distributional shift problem. Most existing interpretable UDA methods are post-hoc ones, which cannot facilitate the model learning process for performance enhancement. In this paper, we propose an inherently interpretable method, named Transferable Conceptual Prototype Learning (TCPL), which could simultaneously interpret and improve the processes of knowledge transfer and decision-making in UDA. To achieve this goal, we design a hierarchically prototypical module that transfers categorical basic concepts from the source domain to the target domain and learns domain-shared prototypes for explaining the underlying reasoning process. With the learned transferable prototypes, a self-predictive consistent pseudo-label strategy that fuses confidence, predictions, and prototype information, is designed for selecting suitable target samples for pseudo annotations and gradually narrowing down the domain gap. Comprehensive experiments show that the proposed method can not only provide effective and intuitive explanations but also outperform previous state-of-the-arts.
In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics. As the deployment of AI systems becomes more ubiquitous, the demand for transparency and comprehensibility in these systems' decision-making processes has intensified. This survey delves into the realm of interpretable cross-modal reasoning (I-CMR), where the objective is not only to achieve high predictive performance but also to provide human-understandable explanations for the results. This survey presents a comprehensive overview of the typical methods with a three-level taxonomy for I-CMR. Furthermore, this survey reviews the existing CMR datasets with annotations for explanations. Finally, this survey summarizes the challenges for I-CMR and discusses potential future directions. In conclusion, this survey aims to catalyze the progress of this emerging research area by providing researchers with a panoramic and comprehensive perspective, illuminating the state of the art and discerning the opportunities. The summarized methods, datasets, and other resources are available at https://github.com/ZuyiZhou/Awesome-Interpretable-Cross-modal-Reasoning.
In this paper, we for the first time explore helpful multi-modal contextual knowledge to understand novel categories for open-vocabulary object detection (OVD). The multi-modal contextual knowledge stands for the joint relationship across regions and words. However, it is challenging to incorporate such multi-modal contextual knowledge into OVD. The reason is that previous detection frameworks fail to jointly model multi-modal contextual knowledge, as object detectors only support vision inputs and no caption description is provided at test time. To this end, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer with diverse multi-modal masked language modeling (D-MLM) to a student detector. The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM), in order to extract fine-grained region-level visual contexts, which are vital to object detection. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy, where our approach well outperforms the recent state-of-the-art methods.
Recently, the no-box adversarial attack, in which the attacker lacks access to the model's architecture, weights, and training data, become the most practical and challenging attack setup. However, there is an unawareness of the potential and flexibility inherent in the surrogate model selection process on no-box setting. Inspired by the burgeoning interest in utilizing foundational models to address downstream tasks, this paper adopts an innovative idea that 1) recasting adversarial attack as a downstream task. Specifically, image noise generation to meet the emerging trend and 2) introducing foundational models as surrogate models. Harnessing the concept of non-robust features, we elaborate on two guiding principles for surrogate model selection to explain why the foundational model is an optimal choice for this role. However, paradoxically, we observe that these foundational models underperform. Analyzing this unexpected behavior within the feature space, we attribute the lackluster performance of foundational models (e.g., CLIP) to their significant representational capacity and, conversely, their lack of discriminative prowess. To mitigate this issue, we propose the use of a margin-based loss strategy for the fine-tuning of foundational models on target images. The experimental results verify that our approach, which employs the basic Fast Gradient Sign Method (FGSM) attack algorithm, outstrips the performance of other, more convoluted algorithms. We conclude by advocating for the research community to consider surrogate models as crucial determinants in the effectiveness of adversarial attacks in no-box settings. The implications of our work bear relevance for improving the efficacy of such adversarial attacks and the overall robustness of AI systems.
Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous methods pay much attention to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, principle analysis indicates that the multimodal confusing calculation will hamper the precise measurement of multimodal imbalanced feature learning, which further weakens the effectiveness of our DGM mechanism. To cope with this issue, a modality-separated decision unit (MSDU) is designed for more precise measurement of imbalanced feature learning between audio and visual modalities. Comprehensive experiments are conducted on public benchmarks and the corresponding experimental results demonstrate the effectiveness of our proposed method.
We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7.8% zero-shot AP on the LVIS benchmark and averagely +6.3% AP on 13 few-shot downstream tasks, with merely 3% pre-training time required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det.
Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes like material, style, layout, etc. remains a challenge, leading to a lack of disentanglement and editability. To address this, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low- to high-frequency information, providing a new perspective on representing, generating, and editing images. We develop Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called ProSpect. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer stronger disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image/text-guided material/style/layout transfer/editing, achieving previously unattainable results with a single image input without fine-tuning the diffusion models.
Object Re-identification (ReID) aims to retrieve the probe object from many gallery images with the ReID model inferred based on a stationary camera-free dataset by associating and collecting the identities across all camera views. When deploying the ReID algorithm in real-world scenarios, the aspect of storage, privacy constraints, and dynamic changes of cameras would degrade its generalizability and applicability. Treating each camera's data independently, we introduce a novel ReID task named Camera-Incremental Object Re-identification (CIOR) by continually optimizing the ReID mode from the incoming stream of the camera dataset. Since the identities under different camera views might describe the same object, associating and distilling the knowledge of common identities would boost the discrimination and benefit from alleviating the catastrophic forgetting. In this paper, we propose a novel Identity Knowledge Evolution (IKE) framework for CIOR, consisting of the Identity Knowledge Association (IKA), Identity Knowledge Distillation (IKD), and Identity Knowledge Update (IKU). IKA is proposed to discover the common identities between the current identity and historical identities. IKD has applied to distillate historical identity knowledge from common identities and quickly adapt the historical model to the current camera view. After each camera has been trained, IKU is applied to continually expand the identity knowledge by combining the historical and current identity memories. The evaluation of Market-CL and Veri-CL shows the Identity Knowledge Evolution (IKE) effectiveness for CIOR. code:https://github.com/htyao89/Camera-Incremental-Object-ReID