Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Feng Wang

M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Nov 15, 2024

Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, Cihang Xie

Figure 1 for M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Figure 2 for M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Figure 3 for M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Figure 4 for M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Abstract:There exists recent work in computer vision, named VAR, that proposes a new autoregressive paradigm for image generation. Diverging from the vanilla next-token prediction, VAR structurally reformulates the image generation into a coarse to fine next-scale prediction. In this paper, we show that this scale-wise autoregressive framework can be effectively decoupled into \textit{intra-scale modeling}, which captures local spatial dependencies within each scale, and \textit{inter-scale modeling}, which models cross-scale relationships progressively from coarse-to-fine scales. This decoupling structure allows to rebuild VAR in a more computationally efficient manner. Specifically, for intra-scale modeling -- crucial for generating high-fidelity images -- we retain the original bidirectional self-attention design to ensure comprehensive modeling; for inter-scale modeling, which semantically connects different scales but is computationally intensive, we apply linear-complexity mechanisms like Mamba to substantially reduce computational overhead. We term this new framework M-VAR. Extensive experiments demonstrate that our method outperforms existing models in both image quality and generation speed. For example, our 1.5B model, with fewer parameters and faster inference speed, outperforms the largest VAR-d30-2B. Moreover, our largest model M-VAR-d32 impressively registers 1.78 FID on ImageNet 256$\times$256 and outperforms the prior-art autoregressive models LlamaGen/VAR by 0.4/0.19 and popular diffusion models LDM/DiT by 1.82/0.49, respectively. Code is avaiable at \url{https://github.com/OliverRensu/MVAR}.

Via

Access Paper or Ask Questions

An Overview on IRS-Enabled Sensing and Communications for 6G: Architectures, Fundamental Limits, and Joint Beamforming Designs

Nov 11, 2024

Xianxin Song, Yuan Fang, Feng Wang, Zixiang Ren, Xianghao Yu, Ye Zhang, Fan Liu, Jie Xu, Derrick Wing Kwan Ng, Rui Zhang(+1 more)

Figure 1 for An Overview on IRS-Enabled Sensing and Communications for 6G: Architectures, Fundamental Limits, and Joint Beamforming Designs

Figure 2 for An Overview on IRS-Enabled Sensing and Communications for 6G: Architectures, Fundamental Limits, and Joint Beamforming Designs

Figure 3 for An Overview on IRS-Enabled Sensing and Communications for 6G: Architectures, Fundamental Limits, and Joint Beamforming Designs

Figure 4 for An Overview on IRS-Enabled Sensing and Communications for 6G: Architectures, Fundamental Limits, and Joint Beamforming Designs

Abstract:This paper presents an overview on intelligent reflecting surface (IRS)-enabled sensing and communication for the forthcoming sixth-generation (6G) wireless networks, in which IRSs are strategically deployed to proactively reconfigure wireless environments to improve both sensing and communication (S&C) performance. First, we exploit a single IRS to enable wireless sensing in the base station's (BS's) non-line-of-sight (NLoS) area. In particular, we present three IRS-enabled NLoS target sensing architectures with fully-passive, semi-passive, and active IRSs, respectively. We compare their pros and cons by analyzing the fundamental sensing performance limits for target detection and parameter estimation. Next, we consider a single IRS to facilitate integrated sensing and communication (ISAC), in which the transmit signals at the BS are used for achieving both S&C functionalities, aided by the IRS through reflective beamforming. We present joint transmit signal and receiver processing designs for realizing efficient ISAC, and jointly optimize the transmit beamforming at the BS and reflective beamforming at the IRS to balance the fundamental performance tradeoff between S&C. Furthermore, we discuss multi-IRS networked ISAC, by particularly focusing on multi-IRS-enabled multi-link ISAC, multi-region ISAC, and ISAC signal routing, respectively. Finally, we highlight various promising research topics in this area to motivate future work.

* 22 pages,7 figures

Via

Access Paper or Ask Questions

Semi-supervised Chinese Poem-to-Painting Generation via Cycle-consistent Adversarial Networks

Oct 25, 2024

Zhengyang Lu, Tianhao Guo, Feng Wang

Abstract:Classical Chinese poetry and painting represent the epitome of artistic expression, but the abstract and symbolic nature of their relationship poses a significant challenge for computational translation. Most existing methods rely on large-scale paired datasets, which are scarce in this domain. In this work, we propose a semi-supervised approach using cycle-consistent adversarial networks to leverage the limited paired data and large unpaired corpus of poems and paintings. The key insight is to learn bidirectional mappings that enforce semantic alignment between the visual and textual modalities. We introduce novel evaluation metrics to assess the quality, diversity, and consistency of the generated poems and paintings. Extensive experiments are conducted on a new Chinese Painting Description Dataset (CPDD). The proposed model outperforms previous methods, showing promise in capturing the symbolic essence of artistic expression. Codes are available online \url{https://github.com/Mnster00/poemtopainting}.

Via

Access Paper or Ask Questions

Causal Image Modeling for Efficient Visual Understanding

Oct 10, 2024

Feng Wang, Timing Yang, Yaodong Yu, Sucheng Ren, Guoyizhe Wei, Angtian Wang, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

Figure 1 for Causal Image Modeling for Efficient Visual Understanding

Figure 2 for Causal Image Modeling for Efficient Visual Understanding

Figure 3 for Causal Image Modeling for Efficient Visual Understanding

Figure 4 for Causal Image Modeling for Efficient Visual Understanding

Abstract:In this work, we present a comprehensive analysis of causal image modeling and introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies demonstrate the significant efficiency and effectiveness of this causal image modeling paradigm. For example, our base-sized Adventurer model attains a competitive test accuracy of 84.0% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 5.3 times more efficient than vision transformers to achieve the same result.

Via

Access Paper or Ask Questions

PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead

Sep 29, 2024

Tao Tan, Yining Qian, Ang Lv, Hongzhan Lin, Songhao Wu, Yongbo Wang, Feng Wang, Jingtong Wu, Xin Lu, Rui Yan

Figure 1 for PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead

Figure 2 for PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead

Figure 3 for PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead

Figure 4 for PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead

Abstract:Large language models (LLMs) enhanced with retrieval-augmented generation (RAG) have introduced a new paradigm for web search. However, the limited context awareness of LLMs degrades their performance on RAG tasks. Existing methods to enhance context awareness are often inefficient, incurring time or memory overhead during inference, and many are tailored to specific position embeddings. In this paper, we propose Position-Embedding-Agnostic attention Re-weighting (PEAR), which enhances the context awareness of LLMs with zero inference overhead. Specifically, on a proxy task focused on context copying, we first detect heads which suppress the models' context awareness thereby diminishing RAG performance. To weaken the impact of these heads, we re-weight their outputs with learnable coefficients. The LLM (with frozen parameters) is optimized by adjusting these coefficients to minimize loss on the proxy task. As a result, the coefficients are optimized to values less than one, thereby reducing their tendency to suppress RAG performance. During inference, the optimized coefficients are fixed to re-weight these heads, regardless of the specific task at hand. Our proposed PEAR offers two major advantages over previous approaches: (1) It introduces zero additional inference overhead in terms of memory usage or inference time, while outperforming competitive baselines in accuracy and efficiency across various RAG tasks. (2) It is independent of position embedding algorithms, ensuring broader applicability.

* preprint

Via

Access Paper or Ask Questions

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Sep 17, 2024

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang(+1 more)

Figure 1 for jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Figure 2 for jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Figure 3 for jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Figure 4 for jina-embeddings-v3: Multilingual Embeddings With Task LoRA

Abstract:We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.

* 20 pages, pp11-13 references, pp14-20 appendix and experiment tables

Via

Access Paper or Ask Questions

Rethinking the Key Factors for the Generalization of Remote Sensing Stereo Matching Networks

Aug 14, 2024

Liting Jiang, Feng Wang, Wenyi Zhang, Peifeng Li, Hongjian You, Yuming Xiang

Figure 1 for Rethinking the Key Factors for the Generalization of Remote Sensing Stereo Matching Networks

Figure 2 for Rethinking the Key Factors for the Generalization of Remote Sensing Stereo Matching Networks

Figure 3 for Rethinking the Key Factors for the Generalization of Remote Sensing Stereo Matching Networks

Figure 4 for Rethinking the Key Factors for the Generalization of Remote Sensing Stereo Matching Networks

Abstract:Stereo matching, a critical step of 3D reconstruction, has fully shifted towards deep learning due to its strong feature representation of remote sensing images. However, ground truth for stereo matching task relies on expensive airborne LiDAR data, thus making it difficult to obtain enough samples for supervised learning. To improve the generalization ability of stereo matching networks on cross-domain data from different sensors and scenarios, in this paper, we dedicate to study key training factors from three perspectives. (1) For the selection of training dataset, it is important to select data with similar regional target distribution as the test set instead of utilizing data from the same sensor. (2) For model structure, cascaded structure that flexibly adapts to different sizes of features is preferred. (3) For training manner, unsupervised methods generalize better than supervised methods, and we design an unsupervised early-stop strategy to help retain the best model with pre-trained weights as the basis. Extensive experiments are conducted to support the previous findings, on the basis of which we present an unsupervised stereo matching network with good generalization performance. We release the source code and the datasets at https://github.com/Elenairene/RKF_RSSM to reproduce the results and encourage future work.

* submitted to IEEE jstars

Via

Access Paper or Ask Questions

Unsupervised Stereo Matching Network For VHR Remote Sensing Images Based On Error Prediction

Aug 14, 2024

Liting Jiang, Yuming Xiang, Feng Wang, Hongjian You

Figure 1 for Unsupervised Stereo Matching Network For VHR Remote Sensing Images Based On Error Prediction

Figure 2 for Unsupervised Stereo Matching Network For VHR Remote Sensing Images Based On Error Prediction

Figure 3 for Unsupervised Stereo Matching Network For VHR Remote Sensing Images Based On Error Prediction

Figure 4 for Unsupervised Stereo Matching Network For VHR Remote Sensing Images Based On Error Prediction

Abstract:Stereo matching in remote sensing has recently garnered increased attention, primarily focusing on supervised learning. However, datasets with ground truth generated by expensive airbone Lidar exhibit limited quantity and diversity, constraining the effectiveness of supervised networks. In contrast, unsupervised learning methods can leverage the increasing availability of very-high-resolution (VHR) remote sensing images, offering considerable potential in the realm of stereo matching. Motivated by this intuition, we propose a novel unsupervised stereo matching network for VHR remote sensing images. A light-weight module to bridge confidence with predicted error is introduced to refine the core model. Robust unsupervised losses are formulated to enhance network convergence. The experimental results on US3D and WHU-Stereo datasets demonstrate that the proposed network achieves superior accuracy compared to other unsupervised networks and exhibits better generalization capabilities than supervised models. Our code will be available at https://github.com/Elenairene/CBEM.

* Accepted to International Geoscience and Remote Sensing Symposium (IGARSS), 2024

Via

Access Paper or Ask Questions

Respiratory Subtraction for Pulmonary Microwave Ablation Evaluation

Aug 08, 2024

Wan Li, Xinyun Zhong, Wei Li, Song Zhang, Moheng Rong, Yan Xi, Peng Yuan, Zechen Wang, Xiaolei Jiang, Rongxi Yi(+5 more)

Figure 1 for Respiratory Subtraction for Pulmonary Microwave Ablation Evaluation

Figure 2 for Respiratory Subtraction for Pulmonary Microwave Ablation Evaluation

Figure 3 for Respiratory Subtraction for Pulmonary Microwave Ablation Evaluation

Figure 4 for Respiratory Subtraction for Pulmonary Microwave Ablation Evaluation

Abstract:Currently, lung cancer is a leading cause of global cancer mortality, often necessitating minimally invasive interventions. Microwave ablation (MWA) is extensively utilized for both primary and secondary lung tumors. Although numerous clinical guidelines and standards for MWA have been established, the clinical evaluation of ablation surgery remains challenging and requires long-term patient follow-up for confirmation. In this paper, we propose a method termed respiratory subtraction to evaluate lung tumor ablation therapy performance based on pre- and post-operative image guidance. Initially, preoperative images undergo coarse rigid registration to their corresponding postoperative positions, followed by further non-rigid registration. Subsequently, subtraction images are generated by subtracting the registered preoperative images from the postoperative ones. Furthermore, to enhance the clinical assessment of MWA treatment performance, we devise a quantitative analysis metric to evaluate ablation efficacy by comparing differences between tumor areas and treatment areas. To the best of our knowledge, this is the pioneering work in the field to facilitate the assessment of MWA surgery performance on pulmonary tumors. Extensive experiments involving 35 clinical cases further validate the efficacy of the respiratory subtraction method. The experimental results confirm the effectiveness of the respiratory subtraction method and the proposed quantitative evaluation metric in assessing lung tumor treatment.

Via

Access Paper or Ask Questions

Intermittent Semi-working Mask: A New Masking Paradigm for LLMs

Aug 01, 2024

Mingcong Lu, Jiangcai Zhu, Wang Hao, Zheng Li, Shusheng Zhang, Kailai Shao, Chao Chen, Nan Li, Feng Wang, Xin Lu

Figure 1 for Intermittent Semi-working Mask: A New Masking Paradigm for LLMs

Figure 2 for Intermittent Semi-working Mask: A New Masking Paradigm for LLMs

Figure 3 for Intermittent Semi-working Mask: A New Masking Paradigm for LLMs

Figure 4 for Intermittent Semi-working Mask: A New Masking Paradigm for LLMs

Abstract:Multi-turn dialogues are a key interaction method between humans and Large Language Models (LLMs), as conversations extend over multiple rounds, keeping LLMs' high generation quality and low latency is a challenge. Mainstream LLMs can be grouped into two categories based on masking strategy: causal LLM and prefix LLM. Several works have demonstrated that prefix LLMs tend to outperform causal ones in scenarios that heavily depend on historical context such as multi-turn dialogues or in-context learning, thanks to their bidirectional attention on prefix sequences. However, prefix LLMs have an inherent inefficient training problem in multi-turn dialogue datasets. In addition, the attention mechanism of prefix LLM makes it unable to reuse Key-Value Cache (KV Cache) across dialogue rounds to reduce generation latency. In this paper, we propose a novel masking scheme called Intermittent Semi-working Mask (ISM) to address these problems. Specifically, we apply alternate bidirectional and unidirectional attention on queries and answers in the dialogue history. In this way, ISM is able to maintain the high quality of prefix LLM and low generation latency of causal LLM, simultaneously. Extensive experiments illustrate that our ISM achieves significant performance.

Via

Access Paper or Ask Questions