Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yi Jiang

Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Dec 05, 2024

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu

Figure 1 for Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Figure 2 for Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Figure 3 for Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Figure 4 for Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Abstract:We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity redefines visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary tokenizer & classifier and bitwise self-correction mechanism, remarkably improving the generation capacity and details. By theoretically scaling the tokenizer vocabulary size to infinity and concurrently scaling the transformer size, our method significantly unleashes powerful scaling capabilities compared to vanilla VAR. Infinity sets a new record for autoregressive text-to-image models, outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably, Infinity surpasses SD3-Medium by improving the GenEval benchmark score from 0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a win rate of 66%. Without extra optimization, Infinity generates a high-quality 1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium and establishing it as the fastest text-to-image model. Models and codes will be released to promote further exploration of Infinity for visual generation and unified tokenizer modeling.

* 17 pages, 14 figures

Via

Access Paper or Ask Questions

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Dec 04, 2024

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu

Figure 1 for TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Figure 2 for TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Figure 3 for TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Figure 4 for TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Abstract:We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our extensive experiments demonstrate TokenFlow's superiority across multiple dimensions. Leveraging TokenFlow, we demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2\% average improvement. For image reconstruction, we achieve a strong FID score of 0.63 at 384*384 resolution. Moreover, TokenFlow establishes state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256*256 resolution, achieving comparable results to SDXL.

* https://byteflow-ai.github.io/TokenFlow/

Via

Access Paper or Ask Questions

MapEval: Towards Unified, Robust and Efficient SLAM Map Evaluation Framework

Nov 26, 2024

Xiangcheng Hu, Jin Wu, Mingkai Jia, Hongyu Yan, Yi Jiang, Binqian Jiang, Wei Zhang, Wei He, Ping Tan

Figure 1 for MapEval: Towards Unified, Robust and Efficient SLAM Map Evaluation Framework

Figure 2 for MapEval: Towards Unified, Robust and Efficient SLAM Map Evaluation Framework

Figure 3 for MapEval: Towards Unified, Robust and Efficient SLAM Map Evaluation Framework

Figure 4 for MapEval: Towards Unified, Robust and Efficient SLAM Map Evaluation Framework

Abstract:Evaluating massive-scale point cloud maps in Simultaneous Localization and Mapping (SLAM) remains challenging, primarily due to the absence of unified, robust and efficient evaluation frameworks. We present MapEval, an open-source framework for comprehensive quality assessment of point cloud maps, specifically addressing SLAM scenarios where ground truth map is inherently sparse compared to the mapped environment. Through systematic analysis of existing evaluation metrics in SLAM applications, we identify their fundamental limitations and establish clear guidelines for consistent map quality assessment. Building upon these insights, we propose a novel Gaussian-approximated Wasserstein distance in voxelized space, enabling two complementary metrics under the same error standard: Voxelized Average Wasserstein Distance (AWD) for global geometric accuracy and Spatial Consistency Score (SCS) for local consistency evaluation. This theoretical foundation leads to significant improvements in both robustness against noise and computational efficiency compared to conventional metrics. Extensive experiments on both simulated and real-world datasets demonstrate that MapEval achieves at least \SI{100}{}-\SI{500}{} times faster while maintaining evaluation integrity. The MapEval library\footnote{\texttt{https://github.com/JokerJohn/Cloud\_Map\_Evaluation}} will be publicly available to promote standardized map evaluation practices in the robotics community.

* 8 pages, 7 figures, 7 tables

Via

Access Paper or Ask Questions

NACNet: A Histology Context-aware Transformer Graph Convolution Network for Predicting Treatment Response to Neoadjuvant Chemotherapy in Triple Negative Breast Cancer

Nov 14, 2024

Qiang Li, George Teodoro, Yi Jiang, Jun Kong

Figure 1 for NACNet: A Histology Context-aware Transformer Graph Convolution Network for Predicting Treatment Response to Neoadjuvant Chemotherapy in Triple Negative Breast Cancer

Figure 2 for NACNet: A Histology Context-aware Transformer Graph Convolution Network for Predicting Treatment Response to Neoadjuvant Chemotherapy in Triple Negative Breast Cancer

Figure 3 for NACNet: A Histology Context-aware Transformer Graph Convolution Network for Predicting Treatment Response to Neoadjuvant Chemotherapy in Triple Negative Breast Cancer

Figure 4 for NACNet: A Histology Context-aware Transformer Graph Convolution Network for Predicting Treatment Response to Neoadjuvant Chemotherapy in Triple Negative Breast Cancer

Abstract:Neoadjuvant chemotherapy (NAC) response prediction for triple negative breast cancer (TNBC) patients is a challenging task clinically as it requires understanding complex histology interactions within the tumor microenvironment (TME). Digital whole slide images (WSIs) capture detailed tissue information, but their giga-pixel size necessitates computational methods based on multiple instance learning, which typically analyze small, isolated image tiles without the spatial context of the TME. To address this limitation and incorporate TME spatial histology interactions in predicting NAC response for TNBC patients, we developed a histology context-aware transformer graph convolution network (NACNet). Our deep learning method identifies the histopathological labels on individual image tiles from WSIs, constructs a spatial TME graph, and represents each node with features derived from tissue texture and social network analysis. It predicts NAC response using a transformer graph convolution network model enhanced with graph isomorphism network layers. We evaluate our method with WSIs of a cohort of TNBC patient (N=105) and compared its performance with multiple state-of-the-art machine learning and deep learning models, including both graph and non-graph approaches. Our NACNet achieves 90.0% accuracy, 96.0% sensitivity, 88.0% specificity, and an AUC of 0.82, through eight-fold cross-validation, outperforming baseline models. These comprehensive experimental results suggest that NACNet holds strong potential for stratifying TNBC patients by NAC response, thereby helping to prevent overtreatment, improve patient quality of life, reduce treatment cost, and enhance clinical outcomes, marking an important advancement toward personalized breast cancer treatment.

* This paper is accepted by Computerized Medical Imaging and Graphics (Nov 07 2024)

Via

Access Paper or Ask Questions

PEAR: A Robust and Flexible Automation Framework for Ptychography Enabled by Multiple Large Language Model Agents

Oct 11, 2024

Xiangyu Yin, Chuqiao Shi, Yimo Han, Yi Jiang

Figure 1 for PEAR: A Robust and Flexible Automation Framework for Ptychography Enabled by Multiple Large Language Model Agents

Figure 2 for PEAR: A Robust and Flexible Automation Framework for Ptychography Enabled by Multiple Large Language Model Agents

Figure 3 for PEAR: A Robust and Flexible Automation Framework for Ptychography Enabled by Multiple Large Language Model Agents

Figure 4 for PEAR: A Robust and Flexible Automation Framework for Ptychography Enabled by Multiple Large Language Model Agents

Abstract:Ptychography is an advanced computational imaging technique in X-ray and electron microscopy. It has been widely adopted across scientific research fields, including physics, chemistry, biology, and materials science, as well as in industrial applications such as semiconductor characterization. In practice, obtaining high-quality ptychographic images requires simultaneous optimization of numerous experimental and algorithmic parameters. Traditionally, parameter selection often relies on trial and error, leading to low-throughput workflows and potential human bias. In this work, we develop the "Ptychographic Experiment and Analysis Robot" (PEAR), a framework that leverages large language models (LLMs) to automate data analysis in ptychography. To ensure high robustness and accuracy, PEAR employs multiple LLM agents for tasks including knowledge retrieval, code generation, parameter recommendation, and image reasoning. Our study demonstrates that PEAR's multi-agent design significantly improves the workflow success rate, even with smaller open-weight models such as LLaMA 3.1 8B. PEAR also supports various automation levels and is designed to work with customized local knowledge bases, ensuring flexibility and adaptability across different research environments.

* 18 pages, 5 figures, technical preview report

Via

Access Paper or Ask Questions

Copiloting Diagnosis of Autism in Real Clinical Scenarios via LLMs

Oct 10, 2024

Yi Jiang, Qingyang Shen, Shuzhong Lai, Shunyu Qi, Qian Zheng, Lin Yao, Yueming Wang, Gang Pan

Figure 1 for Copiloting Diagnosis of Autism in Real Clinical Scenarios via LLMs

Figure 2 for Copiloting Diagnosis of Autism in Real Clinical Scenarios via LLMs

Figure 3 for Copiloting Diagnosis of Autism in Real Clinical Scenarios via LLMs

Figure 4 for Copiloting Diagnosis of Autism in Real Clinical Scenarios via LLMs

Abstract:Autism spectrum disorder(ASD) is a pervasive developmental disorder that significantly impacts the daily functioning and social participation of individuals. Despite the abundance of research focused on supporting the clinical diagnosis of ASD, there is still a lack of systematic and comprehensive exploration in the field of methods based on Large Language Models (LLMs), particularly regarding the real-world clinical diagnostic scenarios based on Autism Diagnostic Observation Schedule, Second Edition (ADOS-2). Therefore, we have proposed a framework called ADOS-Copilot, which strikes a balance between scoring and explanation and explored the factors that influence the performance of LLMs in this task. The experimental results indicate that our proposed framework is competitive with the diagnostic results of clinicians, with a minimum MAE of 0.4643, binary classification F1-score of 81.79\%, and ternary classification F1-score of 78.37\%. Furthermore, we have systematically elucidated the strengths and limitations of current LLMs in this task from the perspectives of ADOS-2, LLMs' capabilities, language, and model scale aiming to inspire and guide the future application of LLMs in a broader fields of mental health disorders. We hope for more research to be transferred into real clinical practice, opening a window of kindness to the world for eccentric children.

Via

Access Paper or Ask Questions

SIMRP: Self-Interference Mitigation Using RIS and Phase Shifter Network

Sep 13, 2024

Zhang Wei, Chen Ding, Bin Zhou, Yi Jiang, Zhiyong Bu

Figure 1 for SIMRP: Self-Interference Mitigation Using RIS and Phase Shifter Network

Figure 2 for SIMRP: Self-Interference Mitigation Using RIS and Phase Shifter Network

Figure 3 for SIMRP: Self-Interference Mitigation Using RIS and Phase Shifter Network

Figure 4 for SIMRP: Self-Interference Mitigation Using RIS and Phase Shifter Network

Abstract:Strong self-interference due to the co-located transmitter is the bottleneck for implementing an in-band full-duplex (IBFD) system. If not adequately mitigated, the strong interference can saturate the receiver's analog-digital converters (ADCs) and hence void the digital processing. This paper considers utilizing a reconfigurable intelligent surface (RIS), together with a receiving (Rx) phase shifter network (PSN), to mitigate the strong self-interference through jointly optimizing their phases. This method, named self-interference mitigation using RIS and PSN (SIMRP), can suppress self-interference to avoid ADC saturation effectively and therefore improve the sum rate performance of communication systems, as verified by the simulation studies.

* 6 pages, 4 figures, accepted by IEEE WCSP 2024

Via

Access Paper or Ask Questions

Phases Calibration of RIS Using Backpropagation Algorithm

Jul 16, 2024

Wei Zhang, Bin Zhou, Tianyi Zhang, Yi Jiang, Zhiyong Bu

Figure 1 for Phases Calibration of RIS Using Backpropagation Algorithm

Figure 2 for Phases Calibration of RIS Using Backpropagation Algorithm

Figure 3 for Phases Calibration of RIS Using Backpropagation Algorithm

Figure 4 for Phases Calibration of RIS Using Backpropagation Algorithm

Abstract:Reconfigurable intelligent surface (RIS) technology has emerged in recent years as a promising solution to the ever-increasing demand for wireless communication capacity. In practice, however, elements of RIS may suffer from phase deviations, which need to be properly estimated and calibrated. This paper models the problem of over-the-air (OTA) estimation of the RIS elements as a quasi-neural network (QNN) so that the phase estimates can be obtained using the classic backpropagation (BP) algorithm. We also derive the Cram\'{e}r Rao Bounds (CRBs) for the phases of the RIS elements as a benchmark of the proposed approach. The simulation results verify the effectiveness of the proposed algorithm by showing that the root mean square errors (RMSEs) of the phase estimates are close to the CRBs.

* 5 pages, 5 figures, accepted by IEEE/CIC ICCC 2024

Via

Access Paper or Ask Questions

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Jun 13, 2024

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang Jiang

Figure 1 for OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Figure 2 for OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Figure 3 for OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Figure 4 for OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Abstract:Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window and causal attention for spatial and temporal modeling. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e.g., 1.11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method. Code is available at https://github.com/FoundationVision/OmniTokenizer.

Via

Access Paper or Ask Questions

ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery

Jun 10, 2024

Xian Sun, Qiwei Yan, Chubo Deng, Chenglong Liu, Yi Jiang, Zhongyan Hou, Wanxuan Lu, Fanglong Yao, Xiaoyu Liu, Lingxiang Hao(+1 more)

Figure 1 for ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery

Figure 2 for ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery

Figure 3 for ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery

Figure 4 for ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery

Abstract:Scene Graph Generation (SGG) is a high-level visual understanding and reasoning task aimed at extracting entities (such as objects) and their interrelationships from images. Significant progress has been made in the study of SGG in natural images in recent years, but its exploration in the domain of remote sensing images remains very limited. The complex characteristics of remote sensing images necessitate higher time and manual interpretation costs for annotation compared to natural images. The lack of a large-scale public SGG benchmark is a major impediment to the advancement of SGG-related research in aerial imagery. In this paper, we introduce the first publicly available large-scale, million-level relation dataset in the field of remote sensing images which is named as ReCon1M. Specifically, our dataset is built upon Fair1M and comprises 21,392 images. It includes annotations for 859,751 object bounding boxes across 60 different categories, and 1,149,342 relation triplets across 64 categories based on these bounding boxes. We provide a detailed description of the dataset's characteristics and statistical information. We conducted two object detection tasks and three sub-tasks within SGG on this dataset, assessing the performance of mainstream methods on these tasks.

Via

Access Paper or Ask Questions