Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Peng Wang

Smart-LLaMA: Two-Stage Post-Training of Large Language Models for Smart Contract Vulnerability Detection and Explanation

Nov 09, 2024

Lei Yu, Shiqi Chen, Hang Yuan, Peng Wang, Zhirong Huang, Jingyuan Zhang, Chenjie Shen, Fengjun Zhang, Li Yang, Jiajia Ma

Abstract:With the rapid development of blockchain technology, smart contract security has become a critical challenge. Existing smart contract vulnerability detection methods face three main issues: (1) Insufficient quality of datasets, lacking detailed explanations and precise vulnerability locations. (2) Limited adaptability of large language models (LLMs) to the smart contract domain, as most LLMs are pre-trained on general text data but minimal smart contract-specific data. (3) Lack of high-quality explanations for detected vulnerabilities, as existing methods focus solely on detection without clear explanations. These limitations hinder detection performance and make it harder for developers to understand and fix vulnerabilities quickly, potentially leading to severe financial losses. To address these problems, we propose Smart-LLaMA, an advanced detection method based on the LLaMA language model. First, we construct a comprehensive dataset covering four vulnerability types with labels, detailed explanations, and precise vulnerability locations. Second, we introduce Smart Contract-Specific Continual Pre-Training, using raw smart contract data to enable the LLM to learn smart contract syntax and semantics, enhancing their domain adaptability. Furthermore, we propose Explanation-Guided Fine-Tuning, which fine-tunes the LLM using paired vulnerable code and explanations, enabling both vulnerability detection and reasoned explanations. We evaluate explanation quality through LLM and human evaluation, focusing on Correctness, Completeness, and Conciseness. Experimental results show that Smart-LLaMA outperforms state-of-the-art baselines, with average improvements of 6.49% in F1 score and 3.78% in accuracy, while providing reliable explanations.

Via

Access Paper or Ask Questions

Dynamic Textual Prompt For Rehearsal-free Lifelong Person Re-identification

Nov 09, 2024

Hongyu Chen, Bingliang Jiao, Wenxuan Wang, Peng Wang

Figure 1 for Dynamic Textual Prompt For Rehearsal-free Lifelong Person Re-identification

Figure 2 for Dynamic Textual Prompt For Rehearsal-free Lifelong Person Re-identification

Figure 3 for Dynamic Textual Prompt For Rehearsal-free Lifelong Person Re-identification

Figure 4 for Dynamic Textual Prompt For Rehearsal-free Lifelong Person Re-identification

Abstract:Lifelong person re-identification attempts to recognize people across cameras and integrate new knowledge from continuous data streams. Key challenges involve addressing catastrophic forgetting caused by parameter updating and domain shift, and maintaining performance in seen and unseen domains. Many previous works rely on data memories to retain prior samples. However, the amount of retained data increases linearly with the number of training domains, leading to continually increasing memory consumption. Additionally, these methods may suffer significant performance degradation when data preservation is prohibited due to privacy concerns. To address these limitations, we propose using textual descriptions as guidance to encourage the ReID model to learn cross-domain invariant features without retaining samples. The key insight is that natural language can describe pedestrian instances with an invariant style, suggesting a shared textual space for any pedestrian images. By leveraging this shared textual space as an anchor, we can prompt the ReID model to embed images from various domains into a unified semantic space, thereby alleviating catastrophic forgetting caused by domain shifts. To achieve this, we introduce a task-driven dynamic textual prompt framework in this paper. This model features a dynamic prompt fusion module, which adaptively constructs and fuses two different textual prompts as anchors. This effectively guides the ReID model to embed images into a unified semantic space. Additionally, we design a text-visual feature alignment module to learn a more precise mapping between fine-grained visual and textual features. We also developed a learnable knowledge distillation module that allows our model to dynamically balance retaining existing knowledge with acquiring new knowledge. Extensive experiments demonstrate that our method outperforms SOTAs under various settings.

* 12 pages, 6 figures

Via

Access Paper or Ask Questions

Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning

Nov 03, 2024

Fei Zhou, Peng Wang, Lei Zhang, Zhenghua Chen, Wei Wei, Chen Ding, Guosheng Lin, Yanning Zhang

Figure 1 for Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning

Figure 2 for Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning

Figure 3 for Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning

Figure 4 for Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning

Abstract:Meta-learning offers a promising avenue for few-shot learning (FSL), enabling models to glean a generalizable feature embedding through episodic training on synthetic FSL tasks in a source domain. Yet, in practical scenarios where the target task diverges from that in the source domain, meta-learning based method is susceptible to over-fitting. To overcome this, we introduce a novel framework, Meta-Exploiting Frequency Prior for Cross-Domain Few-Shot Learning, which is crafted to comprehensively exploit the cross-domain transferable image prior that each image can be decomposed into complementary low-frequency content details and high-frequency robust structural characteristics. Motivated by this insight, we propose to decompose each query image into its high-frequency and low-frequency components, and parallel incorporate them into the feature embedding network to enhance the final category prediction. More importantly, we introduce a feature reconstruction prior and a prediction consistency prior to separately encourage the consistency of the intermediate feature as well as the final category prediction between the original query image and its decomposed frequency components. This allows for collectively guiding the network's meta-learning process with the aim of learning generalizable image feature embeddings, while not introducing any extra computational cost in the inference phase. Our framework establishes new state-of-the-art results on multiple cross-domain few-shot learning benchmarks.

Via

Access Paper or Ask Questions

Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

Oct 30, 2024

Wei Dong, Yuan Sun, Yiting Yang, Xing Zhang, Zhijun Lin, Qingsen Yan, Haokui Zhang, Peng Wang, Yang Yang, Hengtao Shen

Figure 1 for Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

Figure 2 for Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

Figure 3 for Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

Figure 4 for Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

Abstract:A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of down-projection and up-projection matrices, with the bottleneck dimensionality being crucial for reducing the number of learnable parameters, as exemplified by prevalent methods like LoRA and Adapter. However, these low-rank strategies typically employ a fixed bottleneck dimensionality, which limits their flexibility in handling layer-wise variations. To address this limitation, we propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix. We utilize Householder transformations to construct orthogonal matrices that efficiently mimic the unitary matrices, requiring only a vector. The diagonal values are learned in a layer-wise manner, allowing them to flexibly capture the unique properties of each layer. This approach enables the generation of adaptation matrices with varying ranks across different layers, providing greater flexibility in adapting pre-trained models. Experiments on standard downstream vision tasks demonstrate that our method achieves promising fine-tuning performance.

Via

Access Paper or Ask Questions

Two Birds With One Stone: Enhancing Communication and Sensing via Multi-Functional RIS

Oct 09, 2024

Wanli Ni, Wen Wang, Ailing Zheng, Peng Wang, Changsheng You, Yonina C. Eldar, Dusit Niyato, Robert Schober

Figure 1 for Two Birds With One Stone: Enhancing Communication and Sensing via Multi-Functional RIS

Figure 2 for Two Birds With One Stone: Enhancing Communication and Sensing via Multi-Functional RIS

Figure 3 for Two Birds With One Stone: Enhancing Communication and Sensing via Multi-Functional RIS

Figure 4 for Two Birds With One Stone: Enhancing Communication and Sensing via Multi-Functional RIS

Abstract:In this article, we propose new network architectures that integrate multi-functional reconfigurable intelligent surfaces (MF-RISs) into 6G networks to enhance both communication and sensing capabilities. Firstly, we elaborate how to leverage MF-RISs for improving communication performance in different communication modes including unicast, mulitcast, and broadcast and for different multi-access schemes. Next, we emphasize synergistic benefits of integrating MF-RISs with wireless sensing, enabling more accurate and efficient target detection in 6G networks. Furthermore, we present two schemes that utilize MF-RISs to enhance the performance of integrated sensing and communication (ISAC). We also study multi-objective optimization to achieve the optimal trade-off between communication and sensing performance. Finally, we present numerical results to show the performance improvements offered by MF-RISs compared to conventional RISs in ISAC. We also outline key research directions for MF-RIS under the ambition of 6G.

* 8 pages, 5 figures, submitted to IEEE

Via

Access Paper or Ask Questions

Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information

Oct 06, 2024

Yongheng Zhang, Qiguang Chen, Jingxuan Zhou, Peng Wang, Jiasheng Si, Jin Wang, Wenpeng Lu, Libo Qin

Figure 1 for Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information

Figure 2 for Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information

Figure 3 for Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information

Figure 4 for Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information

Abstract:Chain-of-Thought (CoT) has become a vital technique for enhancing the performance of Large Language Models (LLMs), attracting increasing attention from researchers. One stream of approaches focuses on the iterative enhancement of LLMs by continuously verifying and refining their reasoning outputs for desired quality. Despite its impressive results, this paradigm faces two critical issues: (1) Simple verification methods: The current paradigm relies solely on a single verification method. (2) Wrong Information Ignorance: Traditional paradigms directly ignore wrong information during reasoning and refine the logic paths from scratch each time. To address these challenges, we propose Wrong-of-Thought (WoT), which includes two core modules: (1) Multi-Perspective Verification: A multi-perspective verification method for accurately refining the reasoning process and result, and (2) Wrong Information Utilization: Utilizing wrong information to alert LLMs and reduce the probability of LLMs making same mistakes. Experiments on 8 popular datasets and 5 LLMs demonstrate that WoT surpasses all previous baselines. In addition, WoT exhibits powerful capabilities in difficult computation tasks.

* Accepted by EMNLP 2024 Findings

Via

Access Paper or Ask Questions

Empirical Insights on Fine-Tuning Large Language Models for Question-Answering

Sep 24, 2024

Junjie Ye, Yuming Yang, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan

Abstract:Large language models (LLMs) encode extensive world knowledge through pre-training on massive datasets, which can then be fine-tuned for the question-answering (QA) task. However, effective strategies for fine-tuning LLMs for the QA task remain largely unexplored. To address this gap, we categorize supervised fine-tuning (SFT) data based on the extent of knowledge memorized by the pretrained LLMs and conduct a series of empirical analyses. Our experiments, involving four LLMs from three different model families, focus on three key factors: the amount of data required for SFT, the impact of different SFT datasets on model performance, and how data requirements vary across LLMs. The results show that as few as 60 data points during the SFT stage can activate the knowledge encoded during pre-training, enabling LLMs to perform the QA task. Additionally, SFT with data of varying memory levels has a significant impact on LLM performance, with the optimal dataset differing based on the specific model being fine-tuned. Future research will delve deeper into the mechanisms underlying these phenomena.

Via

Access Paper or Ask Questions

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Sep 18, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge(+9 more)

Figure 1 for Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Figure 2 for Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Figure 3 for Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Figure 4 for Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Abstract:We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \url{https://github.com/QwenLM/Qwen2-VL}.

* Code is available at https://github.com/QwenLM/Qwen2-VL

Via

Access Paper or Ask Questions

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Sep 18, 2024

Humen Zhong, Zhibo Yang, Zhaohai Li, Peng Wang, Jun Tang, Wenqing Cheng, Cong Yao

Figure 1 for VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Figure 2 for VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Figure 3 for VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Figure 4 for VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Abstract:Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. Towards advanced text recognition, there are three key challenges: (1) an encoder capable of representing the visual and semantic distributions; (2) a decoder that ensures the alignment between vision and semantics; and (3) consistency in the framework during pre-training, if it exists, and fine-tuning. Inspired by masked autoencoding, a successful pre-training strategy in both vision and language, we propose an innovative scene text recognition approach, named VL-Reader. The novelty of the VL-Reader lies in the pervasive interplay between vision and language throughout the entire process. Concretely, we first introduce a Masked Visual-Linguistic Reconstruction (MVLR) objective, which aims at simultaneously modeling visual and linguistic information. Then, we design a Masked Visual-Linguistic Decoder (MVLD) to further leverage masked vision-language context and achieve bi-modal feature interaction. The architecture of VL-Reader maintains consistency from pre-training to fine-tuning. In the pre-training stage, VL-Reader reconstructs both masked visual and text tokens, while in the fine-tuning stage, the network degrades to reconstruct all characters from an image without any masked regions. VL-reader achieves an average accuracy of 97.1% on six typical datasets, surpassing the SOTA by 1.1%. The improvement was even more significant on challenging datasets. The results demonstrate that vision and language reconstructor can serve as an effective scene text recognizer.

* Accepted by ACM-MM2024

Via

Access Paper or Ask Questions

VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

Sep 14, 2024

Hongyu Sun, Yongcai Wang, Peng Wang, Haoran Deng, Xudong Cai, Deying Li

Figure 1 for VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

Figure 2 for VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

Figure 3 for VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

Figure 4 for VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

Abstract:View-based methods have demonstrated promising performance in 3D shape understanding. However, they tend to make strong assumptions about the relations between views or learn the multi-view correlations indirectly, which limits the flexibility of exploring inter-view correlations and the effectiveness of target tasks. To overcome the above problems, this paper investigates flexible organization and explicit correlation learning for multiple views. In particular, we propose to incorporate different views of a 3D shape into a permutation-invariant set, referred to as \emph{View Set}, which removes rigid relation assumptions and facilitates adequate information exchange and fusion among views. Based on that, we devise a nimble Transformer model, named \emph{VSFormer}, to explicitly capture pairwise and higher-order correlations of all elements in the set. Meanwhile, we theoretically reveal a natural correspondence between the Cartesian product of a view set and the correlation matrix in the attention mechanism, which supports our model design. Comprehensive experiments suggest that VSFormer has better flexibility, efficient inference efficiency and superior performance. Notably, VSFormer reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD. It also establishes new records on the SHREC'17 retrieval benchmark. The code and datasets are available at \url{https://github.com/auniquesun/VSFormer}.

* accepted by TVCG 2024

Via

Access Paper or Ask Questions