Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yue Xu

Robert

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Jul 29, 2024

Mingyu Zhang, Jiting Cai, Mingyu Liu, Yue Xu, Cewu Lu, Yong-Lu Li

Figure 1 for Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Figure 2 for Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Figure 3 for Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Figure 4 for Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Abstract:Visual reasoning, as a prominent research area, plays a crucial role in AI by facilitating concept formation and interaction with the world. However, current works are usually carried out separately on small datasets thus lacking generalization ability. Through rigorous evaluation of diverse benchmarks, we demonstrate the shortcomings of existing ad-hoc methods in achieving cross-domain reasoning and their tendency to data bias fitting. In this paper, we revisit visual reasoning with a two-stage perspective: (1) symbolization and (2) logical reasoning given symbols or their representations. We find that the reasoning stage is better at generalization than symbolization. Thus, it is more efficient to implement symbolization via separated encoders for different data domains while using a shared reasoner. Given our findings, we establish design principles for visual reasoning frameworks following the separated symbolization and shared reasoning. The proposed two-stage framework achieves impressive generalization ability on various visual reasoning tasks, including puzzles, physical prediction, and visual question answering (VQA), encompassing both 2D and 3D modalities. We believe our insights will pave the way for generalizable visual reasoning.

* ECCV 2024, Project page: https://mybearyzhang.github.io/projects/TwoStageReason/

Via

Access Paper or Ask Questions

FAFE: Immune Complex Modeling with Geodesic Distance Loss on Noisy Group Frames

Jul 01, 2024

Ruidong Wu, Ruihan Guo, Rui Wang, Shitong Luo, Yue Xu, Jiahan Li, Jianzhu Ma, Qiang Liu, Yunan Luo, Jian Peng

Figure 1 for FAFE: Immune Complex Modeling with Geodesic Distance Loss on Noisy Group Frames

Figure 2 for FAFE: Immune Complex Modeling with Geodesic Distance Loss on Noisy Group Frames

Figure 3 for FAFE: Immune Complex Modeling with Geodesic Distance Loss on Noisy Group Frames

Figure 4 for FAFE: Immune Complex Modeling with Geodesic Distance Loss on Noisy Group Frames

Abstract:Despite the striking success of general protein folding models such as AlphaFold2(AF2, Jumper et al. (2021)), the accurate computational modeling of antibody-antigen complexes remains a challenging task. In this paper, we first analyze AF2's primary loss function, known as the Frame Aligned Point Error (FAPE), and raise a previously overlooked issue that FAPE tends to face gradient vanishing problem on high-rotational-error targets. To address this fundamental limitation, we propose a novel geodesic loss called Frame Aligned Frame Error (FAFE, denoted as F2E to distinguish from FAPE), which enables the model to better optimize both the rotational and translational errors between two frames. We then prove that F2E can be reformulated as a group-aware geodesic loss, which translates the optimization of the residue-to-residue error to optimizing group-to-group geodesic frame distance. By fine-tuning AF2 with our proposed new loss function, we attain a correct rate of 52.3\% (DockQ $>$ 0.23) on an evaluation set and 43.8\% correct rate on a subset with low homology, with substantial improvement over AF2 by 182\% and 100\% respectively.

Via

Access Paper or Ask Questions

Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

Jun 14, 2024

Runze Liu, Dongchen Zhu, Guanghui Zhang, Yue Xu, Wenjun Shi, Xiaolin Zhang, Lei Wang, Jiamao Li

Figure 1 for Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

Figure 2 for Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

Figure 3 for Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

Figure 4 for Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

Abstract:Unsupervised monocular depth estimation has received widespread attention because of its capability to train without ground truth. In real-world scenarios, the images may be blurry or noisy due to the influence of weather conditions and inherent limitations of the camera. Therefore, it is particularly important to develop a robust depth estimation model. Benefiting from the training strategies of generative networks, generative-based methods often exhibit enhanced robustness. In light of this, we employ a well-converging diffusion model among generative networks for unsupervised monocular depth estimation. Additionally, we propose a hierarchical feature-guided denoising module. This model significantly enriches the model's capacity for learning and interpreting depth distribution by fully leveraging image features to guide the denoising process. Furthermore, we explore the implicit depth within reprojection and design an implicit depth consistency loss. This loss function serves to enhance the performance of the model and ensure the scale consistency of depth within a video sequence. We conduct experiments on the KITTI, Make3D, and our self-collected SIMIT datasets. The results indicate that our approach stands out among generative-based models, while also showcasing remarkable robustness.

Via

Access Paper or Ask Questions

Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

Jun 13, 2024

Yue Xu, Kaizhi Yang, Jiebo Luo, Xuejin Chen

Figure 1 for Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

Figure 2 for Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

Figure 3 for Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

Figure 4 for Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

Abstract:3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual Attribute-Spatial relation Alignment Network that separately models and aligns object attributes and spatial relation features between language and 3D vision modalities. We decompose both the language and 3D point cloud input into two separate parts and design a dual-branch attention module to separately model the decomposed inputs while preserving global context in attribute-spatial feature fusion by cross attentions. Our DASANet achieves the highest grounding accuracy 65.1% on the Nr3D dataset, 1.3% higher than the best competitor. Besides, the visualization of the two branches proves that our method is efficient and highly interpretable.

Via

Access Paper or Ask Questions

CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

Jun 10, 2024

Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu(+14 more)

Figure 1 for CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

Figure 2 for CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

Figure 3 for CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

Figure 4 for CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

Abstract:Artificial intelligence has significantly impacted medical applications, particularly with the advent of Medical Large Vision Language Models (Med-LVLMs), sparking optimism for the future of automated and personalized healthcare. However, the trustworthiness of Med-LVLMs remains unverified, posing significant risks for future model deployment. In this paper, we introduce CARES and aim to comprehensively evaluate the Trustworthiness of Med-LVLMs across the medical domain. We assess the trustworthiness of Med-LVLMs across five dimensions, including trustfulness, fairness, safety, privacy, and robustness. CARES comprises about 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions. Our analysis reveals that the models consistently exhibit concerns regarding trustworthiness, often displaying factual inaccuracies and failing to maintain fairness across different demographic groups. Furthermore, they are vulnerable to attacks and demonstrate a lack of privacy awareness. We publicly release our benchmark and code in https://github.com/richard-peng-xia/CARES.

Via

Access Paper or Ask Questions

Low-Rank Similarity Mining for Multimodal Dataset Distillation

Jun 06, 2024

Yue Xu, Zhilin Lin, Yusong Qiu, Cewu Lu, Yong-Lu Li

Abstract:Though dataset distillation has witnessed rapid development in recent years, the distillation of multimodal data, e.g., image-text pairs, poses unique and under-explored challenges. Unlike unimodal data, image-text contrastive learning (ITC) data lack inherent categorization and should instead place greater emphasis on modality correspondence. In this work, we propose Low-Rank Similarity Mining (LoRS) for multimodal dataset distillation, that concurrently distills a ground truth similarity matrix with image-text pairs, and leverages low-rank factorization for efficiency and scalability. The proposed approach brings significant improvement to the existing algorithms, marking a significant contribution to the field of visual-language dataset distillation. We advocate adopting LoRS as a foundational synthetic data setup for image-text dataset distillation. Our code is available at https://github.com/silicx/LoRS_Distill.

* Accepted at ICML 2024

Via

Access Paper or Ask Questions

SimAD: A Simple Dissimilarity-based Approach for Time Series Anomaly Detection

May 18, 2024

Zhijie Zhong, Zhiwen Yu, Xing Xi, Yue Xu, Jiahui Chen, Kaixiang Yang

Figure 1 for SimAD: A Simple Dissimilarity-based Approach for Time Series Anomaly Detection

Figure 2 for SimAD: A Simple Dissimilarity-based Approach for Time Series Anomaly Detection

Figure 3 for SimAD: A Simple Dissimilarity-based Approach for Time Series Anomaly Detection

Figure 4 for SimAD: A Simple Dissimilarity-based Approach for Time Series Anomaly Detection

Abstract:Despite the prevalence of reconstruction-based deep learning methods, time series anomaly detection remains challenging. Existing approaches often struggle with limited temporal contexts, inadequate representation of normal patterns, and flawed evaluation metrics, hindering their effectiveness in identifying aberrant behavior. To address these issues, we introduce $\textbf{{SimAD}}$, a $\textbf{{Sim}}$ple dissimilarity-based approach for time series $\textbf{{A}}$nomaly $\textbf{{D}}$etection. SimAD incorporates an advanced feature extractor adept at processing extended temporal windows, utilizes the EmbedPatch encoder to integrate normal behavioral patterns comprehensively, and introduces an innovative ContrastFusion module designed to accentuate distributional divergences between normal and abnormal data, thereby enhancing the robustness of anomaly discrimination. Additionally, we propose two robust evaluation metrics, UAff and NAff, addressing the limitations of existing metrics and demonstrating their reliability through theoretical and experimental analyses. Experiments across $\textbf{seven}$ diverse time series datasets demonstrate SimAD's superior performance compared to state-of-the-art methods, achieving relative improvements of $\textbf{19.85%}$ on F1, $\textbf{4.44%}$ on Aff-F1, $\textbf{77.79%}$ on NAff-F1, and $\textbf{9.69%}$ on AUC on six multivariate datasets. Code and pre-trained models are available at https://github.com/EmorZz1G/SimAD.

* 18 pages, 12 figures,7 tables, Under review

Via

Access Paper or Ask Questions

$\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models

Apr 09, 2024

Yue Xu, Wenjie Wang

$Figure 1 for $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models$

$Figure 2 for $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models$

$Figure 3 for $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models$

$Figure 4 for $\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models$

Abstract:Prompt-based learning is a new language model training paradigm that adapts the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes the performance benchmarks across various natural language processing (NLP) tasks. Instead of using a fixed prompt template to fine-tune the model, some research demonstrates the effectiveness of searching for the prompt via optimization. Such prompt optimization process of prompt-based learning on PLMs also gives insight into generating adversarial prompts to mislead the model, raising concerns about the adversarial vulnerability of this paradigm. Recent studies have shown that universal adversarial triggers (UATs) can be generated to alter not only the predictions of the target PLMs but also the prediction of corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based learning paradigm. However, UATs found in previous works are often unreadable tokens or characters and can be easily distinguished from natural texts with adaptive defenses. In this work, we consider the naturalness of the UATs and develop $\textit{LinkPrompt}$, an adversarial attack algorithm to generate UATs by a gradient-based beam search algorithm that not only effectively attacks the target PLMs and PFMs but also maintains the naturalness among the trigger tokens. Extensive results demonstrate the effectiveness of $\textit{LinkPrompt}$, as well as the transferability of UATs generated by $\textit{LinkPrompt}$ to open-sourced Large Language Model (LLM) Llama2 and API-accessed LLM GPT-3.5-turbo. The resource is available at $\href{https://github.com/SavannahXu79/LinkPrompt}{https://github.com/SavannahXu79/LinkPrompt}$.

* Accepted to the main conference of NAACL2024

Via

Access Paper or Ask Questions

Macro Graph Neural Networks for Online Billion-Scale Recommender Systems

Jan 26, 2024

Hao Chen, Yuanchen Bei, Qijie Shen, Yue Xu, Sheng Zhou, Wenbing Huang, Feiran Huang, Senzhang Wang, Xiao Huang

Figure 1 for Macro Graph Neural Networks for Online Billion-Scale Recommender Systems

Figure 2 for Macro Graph Neural Networks for Online Billion-Scale Recommender Systems

Figure 3 for Macro Graph Neural Networks for Online Billion-Scale Recommender Systems

Figure 4 for Macro Graph Neural Networks for Online Billion-Scale Recommender Systems

Abstract:Predicting Click-Through Rate (CTR) in billion-scale recommender systems poses a long-standing challenge for Graph Neural Networks (GNNs) due to the overwhelming computational complexity involved in aggregating billions of neighbors. To tackle this, GNN-based CTR models usually sample hundreds of neighbors out of the billions to facilitate efficient online recommendations. However, sampling only a small portion of neighbors results in a severe sampling bias and the failure to encompass the full spectrum of user or item behavioral patterns. To address this challenge, we name the conventional user-item recommendation graph as "micro recommendation graph" and introduce a more suitable MAcro Recommendation Graph (MAG) for billion-scale recommendations. MAG resolves the computational complexity problems in the infrastructure by reducing the node count from billions to hundreds. Specifically, MAG groups micro nodes (users and items) with similar behavior patterns to form macro nodes. Subsequently, we introduce tailored Macro Graph Neural Networks (MacGNN) to aggregate information on a macro level and revise the embeddings of macro nodes. MacGNN has already served Taobao's homepage feed for two months, providing recommendations for over one billion users. Extensive offline experiments on three public benchmark datasets and an industrial dataset present that MacGNN significantly outperforms twelve CTR baselines while remaining computationally efficient. Besides, online A/B tests confirm MacGNN's superiority in billion-scale recommender systems.

* 11 pages, 7 figures, accepted by The Web Conference 2024

Via

Access Paper or Ask Questions

Dancing with Images: Video Distillation via Static-Dynamic Disentanglement

Dec 01, 2023

Ziyu Wang, Yue Xu, Cewu Lu, Yong-Lu Li

Abstract:Recently, dataset distillation has paved the way towards efficient machine learning, especially for image datasets. However, the distillation for videos, characterized by an exclusive temporal dimension, remains an underexplored domain. In this work, we provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression. Our investigation reveals that the temporal information is usually not well learned during distillation , and the temporal dimension of synthetic data contributes little. The observations motivate our unified framework of disentangling the dynamic and static information in the videos. It first distills the videos into still images as static memory and then compensates the dynamic and motion information with a learnable dynamic memory block. Our method achieves state-of-the-art on video datasets at different scales, with notably smaller storage expenditure. Our code will be publicly available.

Via

Access Paper or Ask Questions