Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jian Zhao

School of Electronic Science and Engineering, Nanjing University

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Nov 25, 2024

Guangzhao Dai, Jian Zhao, Yuantao Chen, Yusen Qin, Hao Zhao, Guosen Xie, Yazhou Yao, Xiangbo Shu, Xuelong Li

Figure 1 for UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Figure 2 for UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Figure 3 for UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Figure 4 for UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Abstract:Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a target destination, has recently seen significant advancements. In contrast to navigation in discrete environments with predefined trajectories, VLN in Continuous Environments (VLN-CE) presents greater challenges, as the agent is free to navigate any unobstructed location and is more vulnerable to visual occlusions or blind spots. Recent approaches have attempted to address this by imagining future environments, either through predicted future visual images or semantic features, rather than relying solely on current observations. However, these RGB-based and feature-based methods lack intuitive appearance-level information or high-level semantic complexity crucial for effective navigation. To overcome these limitations, we introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN, which enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN employs two key schemes: search-then-query sampling and separate-then-united rendering, which facilitate efficient exploitation of neural primitives, helping to integrate both appearance and semantic information for more robust navigation. Extensive experiments demonstrate that UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.

Via

Access Paper or Ask Questions

Embedding Space Allocation with Angle-Norm Joint Classifiers for Few-Shot Class-Incremental Learning

Nov 14, 2024

Dunwei Tu, Huiyu Yi, Tieyi Zhang, Ruotong Li, Furao Shen, Jian Zhao

Figure 1 for Embedding Space Allocation with Angle-Norm Joint Classifiers for Few-Shot Class-Incremental Learning

Figure 2 for Embedding Space Allocation with Angle-Norm Joint Classifiers for Few-Shot Class-Incremental Learning

Figure 3 for Embedding Space Allocation with Angle-Norm Joint Classifiers for Few-Shot Class-Incremental Learning

Figure 4 for Embedding Space Allocation with Angle-Norm Joint Classifiers for Few-Shot Class-Incremental Learning

Abstract:Few-shot class-incremental learning (FSCIL) aims to continually learn new classes from only a few samples without forgetting previous ones, requiring intelligent agents to adapt to dynamic environments. FSCIL combines the characteristics and challenges of class-incremental learning and few-shot learning: (i) Current classes occupy the entire feature space, which is detrimental to learning new classes. (ii) The small number of samples in incremental rounds is insufficient for fully training. In existing mainstream virtual class methods, for addressing the challenge (i), they attempt to use virtual classes as placeholders. However, new classes may not necessarily align with the virtual classes. For the challenge (ii), they replace trainable fully connected layers with Nearest Class Mean (NCM) classifiers based on cosine similarity, but NCM classifiers do not account for sample imbalance issues. To address these issues in previous methods, we propose the class-center guided embedding Space Allocation with Angle-Norm joint classifiers (SAAN) learning framework, which provides balanced space for all classes and leverages norm differences caused by sample imbalance to enhance classification criteria. Specifically, for challenge (i), SAAN divides the feature space into multiple subspaces and allocates a dedicated subspace for each session by guiding samples with the pre-set category centers. For challenge (ii), SAAN establishes a norm distribution for each class and generates angle-norm joint logits. Experiments demonstrate that SAAN can achieve state-of-the-art performance and it can be directly embedded into other SOTA methods as a plug-in, further enhancing their performance.

Via

Access Paper or Ask Questions

Region-Guided Attack on the Segment Anything Model (SAM)

Nov 05, 2024

Xiaoliang Liu, Furao Shen, Jian Zhao

Figure 1 for Region-Guided Attack on the Segment Anything Model (SAM)

Figure 2 for Region-Guided Attack on the Segment Anything Model (SAM)

Figure 3 for Region-Guided Attack on the Segment Anything Model (SAM)

Figure 4 for Region-Guided Attack on the Segment Anything Model (SAM)

Abstract:The Segment Anything Model (SAM) is a cornerstone of image segmentation, demonstrating exceptional performance across various applications, particularly in autonomous driving and medical imaging, where precise segmentation is crucial. However, SAM is vulnerable to adversarial attacks that can significantly impair its functionality through minor input perturbations. Traditional techniques, such as FGSM and PGD, are often ineffective in segmentation tasks due to their reliance on global perturbations that overlook spatial nuances. Recent methods like Attack-SAM-K and UAD have begun to address these challenges, but they frequently depend on external cues and do not fully leverage the structural interdependencies within segmentation processes. This limitation underscores the need for a novel adversarial strategy that exploits the unique characteristics of segmentation tasks. In response, we introduce the Region-Guided Attack (RGA), designed specifically for SAM. RGA utilizes a Region-Guided Map (RGM) to manipulate segmented regions, enabling targeted perturbations that fragment large segments and expand smaller ones, resulting in erroneous outputs from SAM. Our experiments demonstrate that RGA achieves high success rates in both white-box and black-box scenarios, emphasizing the need for robust defenses against such sophisticated attacks. RGA not only reveals SAM's vulnerabilities but also lays the groundwork for developing more resilient defenses against adversarial threats in image segmentation.

Via

Access Paper or Ask Questions

Approximate attention with MLP: a pruning strategy for attention-based model in multivariate time series forecasting

Oct 31, 2024

Suhan Guo, Jiahong Deng, Yi Wei, Hui Dou, Furao Shen, Jian Zhao

Figure 1 for Approximate attention with MLP: a pruning strategy for attention-based model in multivariate time series forecasting

Figure 2 for Approximate attention with MLP: a pruning strategy for attention-based model in multivariate time series forecasting

Figure 3 for Approximate attention with MLP: a pruning strategy for attention-based model in multivariate time series forecasting

Figure 4 for Approximate attention with MLP: a pruning strategy for attention-based model in multivariate time series forecasting

Abstract:Attention-based architectures have become ubiquitous in time series forecasting tasks, including spatio-temporal (STF) and long-term time series forecasting (LTSF). Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we have shown empirically that the entire attention mechanism in the encoder can be reduced to an MLP formed by feedforward, skip-connection, and layer normalization operations for temporal and/or spatial modeling in multivariate time series forecasting. Specifically, the Q, K, and V projection, the attention score calculation, the dot-product between the attention score and the V, and the final projection can be removed from the attention-based networks without significantly degrading the performance that the given network remains the top-tier compared to other SOTA methods. For spatio-temporal networks, the MLP-replace-attention network achieves a reduction in FLOPS of $62.579\%$ with a loss in performance less than $2.5\%$; for LTSF, a reduction in FLOPs of $42.233\%$ with a loss in performance less than $2\%$.

Via

Access Paper or Ask Questions

Survey of User Interface Design and Interaction Techniques in Generative AI Applications

Oct 28, 2024

Reuben Luera, Ryan A. Rossi, Alexa Siu, Franck Dernoncourt, Tong Yu, Sungchul Kim, Ruiyi Zhang, Xiang Chen, Hanieh Salehy, Jian Zhao(+3 more)

Figure 1 for Survey of User Interface Design and Interaction Techniques in Generative AI Applications

Figure 2 for Survey of User Interface Design and Interaction Techniques in Generative AI Applications

Figure 3 for Survey of User Interface Design and Interaction Techniques in Generative AI Applications

Figure 4 for Survey of User Interface Design and Interaction Techniques in Generative AI Applications

Abstract:The applications of generative AI have become extremely impressive, and the interplay between users and AI is even more so. Current human-AI interaction literature has taken a broad look at how humans interact with generative AI, but it lacks specificity regarding the user interface designs and patterns used to create these applications. Therefore, we present a survey that comprehensively presents taxonomies of how a human interacts with AI and the user interaction patterns designed to meet the needs of a variety of relevant use cases. We focus primarily on user-guided interactions, surveying interactions that are initiated by the user and do not include any implicit signals given by the user. With this survey, we aim to create a compendium of different user-interaction patterns that can be used as a reference for designers and developers alike. In doing so, we also strive to lower the entry barrier for those attempting to learn more about the design of generative AI applications.

Via

Access Paper or Ask Questions

A New Approach to Solving SMAC Task: Generating Decision Tree Code from Large Language Models

Oct 21, 2024

Yue Deng, Weiyu Ma, Yuxin Fan, Yin Zhang, Haifeng Zhang, Jian Zhao

Abstract:StarCraft Multi-Agent Challenge (SMAC) is one of the most commonly used experimental environments in multi-agent reinforcement learning (MARL), where the specific task is to control a set number of allied units to defeat enemy forces. Traditional MARL algorithms often require interacting with the environment for up to 1 million steps to train a model, and the resulting policies are typically non-interpretable with weak transferability. In this paper, we propose a novel approach to solving SMAC tasks called LLM-SMAC. In our framework, agents leverage large language models (LLMs) to generate decision tree code by providing task descriptions. The model is further self-reflection using feedback from the rewards provided by the environment. We conduct experiments in the SMAC and demonstrate that our method can produce high-quality, interpretable decision trees with minimal environmental exploration. Moreover, these models exhibit strong transferability, successfully applying to similar SMAC environments without modification. We believe this approach offers a new direction for solving decision-making tasks in the future.

Via

Access Paper or Ask Questions

Autonomous Driving in Unstructured Environments: How Far Have We Come?

Oct 10, 2024

Chen Min, Shubin Si, Xu Wang, Hanzhang Xue, Weizhong Jiang, Yang Liu, Juan Wang, Qingtian Zhu, Qi Zhu, Lun Luo(+27 more)

Figure 1 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 2 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 3 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 4 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Abstract:Research on autonomous driving in unstructured outdoor environments is less advanced than in structured urban settings due to challenges like environmental diversities and scene complexity. These environments-such as rural areas and rugged terrains-pose unique obstacles that are not common in structured urban areas. Despite these difficulties, autonomous driving in unstructured outdoor environments is crucial for applications in agriculture, mining, and military operations. Our survey reviews over 250 papers for autonomous driving in unstructured outdoor environments, covering offline mapping, pose estimation, environmental perception, path planning, end-to-end autonomous driving, datasets, and relevant challenges. We also discuss emerging trends and future research directions. This review aims to consolidate knowledge and encourage further research for autonomous driving in unstructured environments. To support ongoing work, we maintain an active repository with up-to-date literature and open-source projects at: https://github.com/chaytonmin/Survey-Autonomous-Driving-in-Unstructured-Environments.

* Survey paper; 38 pages

Via

Access Paper or Ask Questions

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Oct 09, 2024

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen

Figure 1 for F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Figure 2 for F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Figure 3 for F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Figure 4 for F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Abstract:This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at https://SWivid.github.io/F5-TTS. We release all code and checkpoints to promote community development.

Via

Access Paper or Ask Questions

SurANet: Surrounding-Aware Network for Concealed Object Detection via Highly-Efficient Interactive Contrastive Learning Strategy

Oct 09, 2024

Yuhan Kang, Qingpeng Li, Leyuan Fang, Jian Zhao, Xuelong Li

Figure 1 for SurANet: Surrounding-Aware Network for Concealed Object Detection via Highly-Efficient Interactive Contrastive Learning Strategy

Figure 2 for SurANet: Surrounding-Aware Network for Concealed Object Detection via Highly-Efficient Interactive Contrastive Learning Strategy

Figure 3 for SurANet: Surrounding-Aware Network for Concealed Object Detection via Highly-Efficient Interactive Contrastive Learning Strategy

Figure 4 for SurANet: Surrounding-Aware Network for Concealed Object Detection via Highly-Efficient Interactive Contrastive Learning Strategy

Abstract:Concealed object detection (COD) in cluttered scenes is significant for various image processing applications. However, due to that concealed objects are always similar to their background, it is extremely hard to distinguish them. Here, the major obstacle is the tiny feature differences between the inside and outside object boundary region, which makes it trouble for existing COD methods to achieve accurate results. In this paper, considering that the surrounding environment information can be well utilized to identify the concealed objects, and thus, we propose a novel deep Surrounding-Aware Network, namely SurANet, for COD tasks, which introduces surrounding information into feature extraction and loss function to improve the discrimination. First, we enhance the semantics of feature maps using differential fusion of surrounding features to highlight concealed objects. Next, a Surrounding-Aware Contrastive Loss is applied to identify the concealed object via learning surrounding feature maps contrastively. Then, SurANet can be trained end-to-end with high efficiency via our proposed Spatial-Compressed Correlation Transmission strategy after our investigation of feature dynamics, and extensive experiments improve that such features can be well reserved respectively. Finally, experimental results demonstrate that the proposed SurANet outperforms state-of-the-art COD methods on multiple real datasets. Our source code will be available at https://github.com/kyh433/SurANet.

Via

Access Paper or Ask Questions

Causal Deciphering and Inpainting in Spatio-Temporal Dynamics via Diffusion Model

Sep 29, 2024

Yifan Duan, Jian Zhao, pengcheng, Junyuan Mao, Hao Wu, Jingyu Xu, shilong wang, Caoyuan Ma, Kai Wang, Kun Wang(+1 more)

Figure 1 for Causal Deciphering and Inpainting in Spatio-Temporal Dynamics via Diffusion Model

Figure 2 for Causal Deciphering and Inpainting in Spatio-Temporal Dynamics via Diffusion Model

Figure 3 for Causal Deciphering and Inpainting in Spatio-Temporal Dynamics via Diffusion Model

Figure 4 for Causal Deciphering and Inpainting in Spatio-Temporal Dynamics via Diffusion Model

Abstract:Spatio-temporal (ST) prediction has garnered a De facto attention in earth sciences, such as meteorological prediction, human mobility perception. However, the scarcity of data coupled with the high expenses involved in sensor deployment results in notable data imbalances. Furthermore, models that are excessively customized and devoid of causal connections further undermine the generalizability and interpretability. To this end, we establish a causal framework for ST predictions, termed CaPaint, which targets to identify causal regions in data and endow model with causal reasoning ability in a two-stage process. Going beyond this process, we utilize the back-door adjustment to specifically address the sub-regions identified as non-causal in the upstream phase. Specifically, we employ a novel image inpainting technique. By using a fine-tuned unconditional Diffusion Probabilistic Model (DDPM) as the generative prior, we in-fill the masks defined as environmental parts, offering the possibility of reliable extrapolation for potential data distributions. CaPaint overcomes the high complexity dilemma of optimal ST causal discovery models by reducing the data generation complexity from exponential to quasi-linear levels. Extensive experiments conducted on five real-world ST benchmarks demonstrate that integrating the CaPaint concept allows models to achieve improvements ranging from 4.3% to 77.3%. Moreover, compared to traditional mainstream ST augmenters, CaPaint underscores the potential of diffusion models in ST enhancement, offering a novel paradigm for this field. Our project is available at https://anonymous.4open.science/r/12345-DFCC.

Via

Access Paper or Ask Questions