Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhen Li

LMO, CELESTE, HEC Paris

UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

May 23, 2024

Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang

Figure 1 for UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

Figure 2 for UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

Figure 3 for UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

Figure 4 for UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

Abstract:Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the detailed plot of the new movie Dune 2, which wasn't released until February 2024. To solve the problem, a promising solution is to provide LVLMs with up-to-date knowledge via internet search during inference, i.e., internet-augmented generation (IAG), which is already integrated in some closed-source commercial LVLMs such as GPT-4V. However, the specific mechanics underpinning them remain a mystery. In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed UDKAG. A hierarchical filtering model is trained to effectively and efficiently find the most helpful content from the websites returned by a search engine to prompt LVLMs with up-to-date knowledge. To train the model and evaluate our framework's performance, we propose a pipeline to automatically generate news-related VQA samples to construct a dataset, dubbed UDK-VQA. A multi-model voting mechanism is introduced to label the usefulness of website/content for VQA samples to construct the training set. Experimental results demonstrate the effectiveness of our framework, outperforming GPT-4V by about 25% in accuracy.

* 12 pages, 6 figures, a framework to augment large vision-language models with up-to-date knowledge

Via

Access Paper or Ask Questions

Generalize Polyp Segmentation via Inpainting across Diverse Backgrounds and Pseudo-Mask Refinement

May 21, 2024

Jiajian Ma, Fangqi Lu, Silin Huang, Song Wu, Zhen Li

Abstract:Inpainting lesions within different normal backgrounds is a potential method of addressing the generalization problem, which is crucial for polyp segmentation models. However, seamlessly introducing polyps into complex endoscopic environments while simultaneously generating accurate pseudo-masks remains a challenge for current inpainting methods. To address these issues, we first leverage the pre-trained Stable Diffusion Inpaint and ControlNet, to introduce a robust generative model capable of inpainting polyps across different backgrounds. Secondly, we utilize the prior that synthetic polyps are confined to the inpainted region, to establish an inpainted region-guided pseudo-mask refinement network. We also propose a sample selection strategy that prioritizes well-aligned and hard synthetic cases for further model fine-tuning. Experiments demonstrate that our inpainting model outperformed baseline methods both qualitatively and quantitatively in inpainting quality. Moreover, our data augmentation strategy significantly enhances the performance of polyp segmentation models on external datasets, achieving or surpassing the level of fully supervised training benchmarks in that domain. Our code is available at https://github.com/497662892/PolypInpainter.

Via

Access Paper or Ask Questions

SEGAN: semi-supervised learning approach for missing data imputation

May 21, 2024

Xiaohua Pan, Weifeng Wu, Peiran Liu, Zhen Li, Peng Lu, Peijian Cao, Jianfeng Zhang, Xianfei Qiu, YangYang Wu

Figure 1 for SEGAN: semi-supervised learning approach for missing data imputation

Figure 2 for SEGAN: semi-supervised learning approach for missing data imputation

Figure 3 for SEGAN: semi-supervised learning approach for missing data imputation

Figure 4 for SEGAN: semi-supervised learning approach for missing data imputation

Abstract:In many practical real-world applications, data missing is a very common phenomenon, making the development of data-driven artificial intelligence theory and technology increasingly difficult. Data completion is an important method for missing data preprocessing. Most existing miss-ing data completion models directly use the known information in the missing data set but ignore the impact of the data label information contained in the data set on the missing data completion model. To this end, this paper proposes a missing data completion model SEGAN based on semi-supervised learning, which mainly includes three important modules: generator, discriminator and classifier. In the SEGAN model, the classifier enables the generator to make more full use of known data and its label information when predicting missing data values. In addition, the SE-GAN model introduces a missing hint matrix to allow the discriminator to more effectively distinguish between known data and data filled by the generator. This paper theoretically proves that the SEGAN model that introduces a classifier and a missing hint matrix can learn the real known data distribution characteristics when reaching Nash equilibrium. Finally, a large number of experiments were conducted in this article, and the experimental results show that com-pared with the current state-of-the-art multivariate data completion method, the performance of the SEGAN model is improved by more than 3%.

Via

Access Paper or Ask Questions

Time Evidence Fusion Network: Multi-source View in Long-Term Time Series Forecasting

May 10, 2024

Tianxiang Zhan, Yuanpeng He, Zhen Li, Yong Deng

Abstract:In real-world scenarios, time series forecasting often demands timeliness, making research on model backbones a perennially hot topic. To meet these performance demands, we propose a novel backbone from the perspective of information fusion. Introducing the Basic Probability Assignment (BPA) Module and the Time Evidence Fusion Network (TEFN), based on evidence theory, allows us to achieve superior performance. On the other hand, the perspective of multi-source information fusion effectively improves the accuracy of forecasting. Due to the fact that BPA is generated by fuzzy theory, TEFN also has considerable interpretability. In real data experiments, the TEFN partially achieved state-of-the-art, with low errors comparable to PatchTST, and operating efficiency surpass performance models such as Dlinear. Meanwhile, TEFN has high robustness and small error fluctuations in the random hyperparameter selection. TEFN is not a model that achieves the ultimate in single aspect, but a model that balances performance, accuracy, stability, and interpretability.

Via

Access Paper or Ask Questions

Instance-free Text to Point Cloud Localization with Relative Position Awareness

Apr 27, 2024

Lichao Wang, Zhihao Yuan, Jinke Ren, Shuguang Cui, Zhen Li

Abstract:Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration. It seeks to localize a position from a city-scale point cloud scene based on a few natural language instructions. In this paper, we address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances. Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation. In both stages, we introduce an instance query extractor, in which the cells are encoded by a 3D sparse convolution U-Net to generate the multi-scale point cloud features, and a set of queries iteratively attend to these features to represent instances. In the coarse stage, a row-column relative position-aware self-attention (RowColRPA) module is designed to capture the spatial relations among the instance queries. In the fine stage, a multi-modal relative position-aware cross-attention (RPCA) module is developed to fuse the text and point cloud features along with spatial relations for improving fine position estimation. Experiment results on the KITTI360Pose dataset demonstrate that our model achieves competitive performance with the state-of-the-art models without taking ground-truth instances as input.

* 12 pages, 10 figures, conference

Via

Access Paper or Ask Questions

GauU-Scene V2: Assessing the Reliability of Image-Based Metrics with Expansive Lidar Image Dataset Using 3DGS and NeRF

Apr 13, 2024

Butian Xiong, Nanjun Zheng, Junhua Liu, Zhen Li

Figure 1 for GauU-Scene V2: Assessing the Reliability of Image-Based Metrics with Expansive Lidar Image Dataset Using 3DGS and NeRF

Figure 2 for GauU-Scene V2: Assessing the Reliability of Image-Based Metrics with Expansive Lidar Image Dataset Using 3DGS and NeRF

Figure 3 for GauU-Scene V2: Assessing the Reliability of Image-Based Metrics with Expansive Lidar Image Dataset Using 3DGS and NeRF

Figure 4 for GauU-Scene V2: Assessing the Reliability of Image-Based Metrics with Expansive Lidar Image Dataset Using 3DGS and NeRF

Abstract:We introduce a novel, multimodal large-scale scene reconstruction benchmark that utilizes newly developed 3D representation approaches: Gaussian Splatting and Neural Radiance Fields (NeRF). Our expansive U-Scene dataset surpasses any previously existing real large-scale outdoor LiDAR and image dataset in both area and point count. GauU-Scene encompasses over 6.5 square kilometers and features a comprehensive RGB dataset coupled with LiDAR ground truth. Additionally, we are the first to propose a LiDAR and image alignment method for a drone-based dataset. Our assessment of GauU-Scene includes a detailed analysis across various novel viewpoints, employing image-based metrics such as SSIM, LPIPS, and PSNR on NeRF and Gaussian Splatting based methods. This analysis reveals contradictory results when applying geometric-based metrics like Chamfer distance. The experimental results on our multimodal dataset highlight the unreliability of current image-based metrics and reveal significant drawbacks in geometric reconstruction using the current Gaussian Splatting-based method, further illustrating the necessity of our dataset for assessing geometry reconstruction tasks. We also provide detailed supplementary information on data collection protocols and make the dataset available on the following anonymous project page

Via

Access Paper or Ask Questions

Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

Apr 10, 2024

Zhuo Li, He Zhao, Zhen Li, Tongliang Liu, Dandan Guo, Xiang Wan

Figure 1 for Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

Figure 2 for Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

Figure 3 for Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

Figure 4 for Extracting Clean and Balanced Subset for Noisy Long-tailed Classification

Abstract:Real-world datasets usually are class-imbalanced and corrupted by label noise. To solve the joint issue of long-tailed distribution and label noise, most previous works usually aim to design a noise detector to distinguish the noisy and clean samples. Despite their effectiveness, they may be limited in handling the joint issue effectively in a unified way. In this work, we develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching, which can be solved with optimal transport (OT). By setting a manually-specific probability measure and using a learned transport plan to pseudo-label the training samples, the proposed method can reduce the side-effects of noisy and long-tailed data simultaneously. Then we introduce a simple yet effective filter criteria by combining the observed labels and pseudo labels to obtain a more balanced and less noisy subset for a robust model training. Extensive experiments demonstrate that our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.

Via

Access Paper or Ask Questions

VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs

Apr 09, 2024

Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Shaoling Dong, Xing Zhou, Wenbin Jiang

Abstract:Automatically generating UI code from webpage design visions can significantly alleviate the burden of developers, enabling beginner developers or designers to directly generate Web pages from design diagrams. Currently, prior research has accomplished the objective of generating UI code from rudimentary design visions or sketches through designing deep neural networks. Inspired by the groundbreaking advancements achieved by Multimodal Large Language Models (MLLMs), the automatic generation of UI code from high-fidelity design images is now emerging as a viable possibility. Nevertheless, our investigation reveals that existing MLLMs are hampered by the scarcity of authentic, high-quality, and large-scale datasets, leading to unsatisfactory performance in automated UI code generation. To mitigate this gap, we present a novel dataset, termed VISION2UI, extracted from real-world scenarios, augmented with comprehensive layout information, tailored specifically for finetuning MLLMs in UI code generation. Specifically, this dataset is derived through a series of operations, encompassing collecting, cleaning, and filtering of the open-source Common Crawl dataset. In order to uphold its quality, a neural scorer trained on labeled samples is utilized to refine the data, retaining higher-quality instances. Ultimately, this process yields a dataset comprising 2,000 (Much more is coming soon) parallel samples encompassing design visions and UI code. The dataset is available at https://huggingface.co/datasets/xcodemind/vision2ui.

Via

Access Paper or Ask Questions

Random Walk in Random Permutation Set Theory

Apr 05, 2024

Jiefeng Zhou, Zhen Li, Yong Deng

Abstract:Random walk is an explainable approach for modeling natural processes at the molecular level. The Random Permutation Set Theory (RPST) serves as a framework for uncertainty reasoning, extending the applicability of Dempster-Shafer Theory. Recent explorations indicate a promising link between RPST and random walk. In this study, we conduct an analysis and construct a random walk model based on the properties of RPST, with Monte Carlo simulations of such random walk. Our findings reveal that the random walk generated through RPST exhibits characteristics similar to those of a Gaussian random walk and can be transformed into a Wiener process through a specific limiting scaling procedure. This investigation establishes a novel connection between RPST and random walk theory, thereby not only expanding the applicability of RPST, but also demonstrating the potential for combining the strengths of both approaches to improve problem-solving abilities.

* 27 pages, 8 figures

Via

Access Paper or Ask Questions

An Open-source End-to-End Logic Optimization Framework for Large-scale Boolean Network with Reinforcement Learning

Mar 26, 2024

Zhen Li, Kaixiang Zhu, Xuegong Zhou, Lingli Wang

Figure 1 for An Open-source End-to-End Logic Optimization Framework for Large-scale Boolean Network with Reinforcement Learning

Figure 2 for An Open-source End-to-End Logic Optimization Framework for Large-scale Boolean Network with Reinforcement Learning

Figure 3 for An Open-source End-to-End Logic Optimization Framework for Large-scale Boolean Network with Reinforcement Learning

Figure 4 for An Open-source End-to-End Logic Optimization Framework for Large-scale Boolean Network with Reinforcement Learning

Abstract:We propose an open-source end-to-end logic optimization framework for large-scale boolean network with reinforcement learning.

* 5 pages, 4 figures, 1 table

Via

Access Paper or Ask Questions