Alert button
Picture for Lin Chen

Lin Chen

Alert button

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Nov 28, 2023
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.

* Project: https://ShareGPT4V.github.io 
Viaarxiv icon

CurriculumLoc: Enhancing Cross-Domain Geolocalization through Multi-Stage Refinement

Nov 20, 2023
Boni Hu, Lin Chen, Runjian Chen, Shuhui Bu, Pengcheng Han, Haowei Li

Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images, taken at some unknown location, to a set of geo-tagged reference images. Existing methods, devoted to semantic features representation, evolving towards robustness to a wide variety between query and reference, including illumination and viewpoint changes, as well as scale and seasonal variations. However, practical visual geolocalization approaches need to be robust in appearance changing and extreme viewpoint variation conditions, while providing accurate global location estimates. Therefore, inspired by curriculum design, human learn general knowledge first and then delve into professional expertise. We first recognize semantic scene and then measure geometric structure. Our approach, termed CurriculumLoc, involves a delicate design of multi-stage refinement pipeline and a novel keypoint detection and description with global semantic awareness and local geometric verification. We rerank candidates and solve a particular cross-domain perspective-n-point (PnP) problem based on these keypoints and corresponding descriptors, position refinement occurs incrementally. The extensive experimental results on our collected dataset, TerraTrack and a benchmark dataset, ALTO, demonstrate that our approach results in the aforementioned desirable characteristics of a practical visual geolocalization solution. Additionally, we achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively. Dataset, code and trained models are publicly available on https://github.com/npupilab/CurriculumLoc.

* 14 pages, 15 figures 
Viaarxiv icon

Greedy PIG: Adaptive Integrated Gradients

Nov 10, 2023
Kyriakos Axiotis, Sami Abu-al-haija, Lin Chen, Matthew Fahrbach, Gang Fu

Deep learning has become the standard approach for most machine learning tasks. While its impact is undeniable, interpreting the predictions of deep learning models from a human perspective remains a challenge. In contrast to model training, model interpretability is harder to quantify and pose as an explicit optimization problem. Inspired by the AUC softmax information curve (AUC SIC) metric for evaluating feature attribution methods, we propose a unified discrete optimization framework for feature attribution and feature selection based on subset selection. This leads to a natural adaptive generalization of the path integrated gradients (PIG) method for feature attribution, which we call Greedy PIG. We demonstrate the success of Greedy PIG on a wide variety of tasks, including image feature attribution, graph compression/explanation, and post-hoc feature selection on tabular data. Our results show that introducing adaptivity is a powerful and versatile method for making attribution methods more powerful.

Viaarxiv icon

It's an Alignment, Not a Trade-off: Revisiting Bias and Variance in Deep Models

Oct 13, 2023
Lin Chen, Michal Lukasik, Wittawat Jitkrittum, Chong You, Sanjiv Kumar

Classical wisdom in machine learning holds that the generalization error can be decomposed into bias and variance, and these two terms exhibit a \emph{trade-off}. However, in this paper, we show that for an ensemble of deep learning based classification models, bias and variance are \emph{aligned} at a sample level, where squared bias is approximately \emph{equal} to variance for correctly classified sample points. We present empirical evidence confirming this phenomenon in a variety of deep learning models and datasets. Moreover, we study this phenomenon from two theoretical perspectives: calibration and neural collapse. We first show theoretically that under the assumption that the models are well calibrated, we can observe the bias-variance alignment. Second, starting from the picture provided by the neural collapse theory, we show an approximate correlation between bias and variance.

Viaarxiv icon

Roulette: A Semantic Privacy-Preserving Device-Edge Collaborative Inference Framework for Deep Learning Classification Tasks

Sep 06, 2023
Jingyi Li, Guocheng Liao, Lin Chen, Xu Chen

Deep learning classifiers are crucial in the age of artificial intelligence. The device-edge-based collaborative inference has been widely adopted as an efficient framework for promoting its applications in IoT and 5G/6G networks. However, it suffers from accuracy degradation under non-i.i.d. data distribution and privacy disclosure. For accuracy degradation, direct use of transfer learning and split learning is high cost and privacy issues remain. For privacy disclosure, cryptography-based approaches lead to a huge overhead. Other lightweight methods assume that the ground truth is non-sensitive and can be exposed. But for many applications, the ground truth is the user's crucial privacy-sensitive information. In this paper, we propose a framework of Roulette, which is a task-oriented semantic privacy-preserving collaborative inference framework for deep learning classifiers. More than input data, we treat the ground truth of the data as private information. We develop a novel paradigm of split learning where the back-end DNN is frozen and the front-end DNN is retrained to be both a feature extractor and an encryptor. Moreover, we provide a differential privacy guarantee and analyze the hardness of ground truth inference attacks. To validate the proposed Roulette, we conduct extensive performance evaluations using realistic datasets, which demonstrate that Roulette can effectively defend against various attacks and meanwhile achieve good model accuracy. In a situation where the non-i.i.d. is very severe, Roulette improves the inference accuracy by 21\% averaged over benchmarks, while making the accuracy of discrimination attacks almost equivalent to random guessing.

Viaarxiv icon

Efficient Commercial Bank Customer Credit Risk Assessment Based on LightGBM and Feature Engineering

Aug 17, 2023
Yanjie Sun, Zhike Gong, Quan Shi, Lin Chen

Figure 1 for Efficient Commercial Bank Customer Credit Risk Assessment Based on LightGBM and Feature Engineering
Figure 2 for Efficient Commercial Bank Customer Credit Risk Assessment Based on LightGBM and Feature Engineering
Figure 3 for Efficient Commercial Bank Customer Credit Risk Assessment Based on LightGBM and Feature Engineering
Figure 4 for Efficient Commercial Bank Customer Credit Risk Assessment Based on LightGBM and Feature Engineering

Effective control of credit risk is a key link in the steady operation of commercial banks. This paper is mainly based on the customer information dataset of a foreign commercial bank in Kaggle, and we use LightGBM algorithm to build a classifier to classify customers, to help the bank judge the possibility of customer credit default. This paper mainly deals with characteristic engineering, such as missing value processing, coding, imbalanced samples, etc., which greatly improves the machine learning effect. The main innovation of this paper is to construct new feature attributes on the basis of the original dataset so that the accuracy of the classifier reaches 0.734, and the AUC reaches 0.772, which is more than many classifiers based on the same dataset. The model can provide some reference for commercial banks' credit granting, and also provide some feature processing ideas for other similar studies.

Viaarxiv icon

An End-to-End Framework of Road User Detection, Tracking, and Prediction from Monocular Images

Aug 09, 2023
Hao Cheng, Mengmeng liu, Lin Chen

Figure 1 for An End-to-End Framework of Road User Detection, Tracking, and Prediction from Monocular Images
Figure 2 for An End-to-End Framework of Road User Detection, Tracking, and Prediction from Monocular Images
Figure 3 for An End-to-End Framework of Road User Detection, Tracking, and Prediction from Monocular Images
Figure 4 for An End-to-End Framework of Road User Detection, Tracking, and Prediction from Monocular Images

Perception that involves multi-object detection and tracking, and trajectory prediction are two major tasks of autonomous driving. However, they are currently mostly studied separately, which results in most trajectory prediction modules being developed based on ground truth trajectories without taking into account that trajectories extracted from the detection and tracking modules in real-world scenarios are noisy. These noisy trajectories can have a significant impact on the performance of the trajectory predictor and can lead to serious prediction errors. In this paper, we build an end-to-end framework for detection, tracking, and trajectory prediction called ODTP (Online Detection, Tracking and Prediction). It adopts the state-of-the-art online multi-object tracking model, QD-3DT, for perception and trains the trajectory predictor, DCENet++, directly based on the detection results without purely relying on ground truth trajectories. We evaluate the performance of ODTP on the widely used nuScenes dataset for autonomous driving. Extensive experiments show that ODPT achieves high performance end-to-end trajectory prediction. DCENet++, with the enhanced dynamic maps, predicts more accurate trajectories than its base model. It is also more robust when compared with other generative and deterministic trajectory prediction models trained on noisy detection results.

Viaarxiv icon

Imbalanced Large Graph Learning Framework for FPGA Logic Elements Packing Prediction

Aug 07, 2023
Zhixiong Di, Runzhe Tao, Lin Chen, Qiang Wu, Yibo Lin

Figure 1 for Imbalanced Large Graph Learning Framework for FPGA Logic Elements Packing Prediction
Figure 2 for Imbalanced Large Graph Learning Framework for FPGA Logic Elements Packing Prediction
Figure 3 for Imbalanced Large Graph Learning Framework for FPGA Logic Elements Packing Prediction
Figure 4 for Imbalanced Large Graph Learning Framework for FPGA Logic Elements Packing Prediction

Packing is a required step in a typical FPGA CAD flow. It has high impacts to the performance of FPGA placement and routing. Early prediction of packing results can guide design optimization and expedite design closure. In this work, we propose an imbalanced large graph learning framework, ImLG, for prediction of whether logic elements will be packed after placement. Specifically, we propose dedicated feature extraction and feature aggregation methods to enhance the node representation learning of circuit graphs. With imbalanced distribution of packed and unpacked logic elements, we further propose techniques such as graph oversampling and mini-batch training for this imbalanced learning task in large circuit graphs. Experimental results demonstrate that our framework can improve the F1 score by 42.82% compared to the most recent Gaussian-based prediction method. Physical design results show that the proposed method can assist the placer in improving routed wirelength by 0.93% and SLICE occupation by 0.89%.

Viaarxiv icon

FreeDrag: Point Tracking is Not What You Need for Interactive Point-based Image Editing

Jul 29, 2023
Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, Yi Jin

Figure 1 for FreeDrag: Point Tracking is Not What You Need for Interactive Point-based Image Editing
Figure 2 for FreeDrag: Point Tracking is Not What You Need for Interactive Point-based Image Editing
Figure 3 for FreeDrag: Point Tracking is Not What You Need for Interactive Point-based Image Editing
Figure 4 for FreeDrag: Point Tracking is Not What You Need for Interactive Point-based Image Editing

To serve the intricate and varied demands of image editing, precise and flexible manipulation of image content is indispensable. Recently, DragGAN has achieved impressive editing results through point-based manipulation. However, we have observed that DragGAN struggles with miss tracking, where DragGAN encounters difficulty in effectively tracking the desired handle points, and ambiguous tracking, where the tracked points are situated within other regions that bear resemblance to the handle points. To deal with the above issues, we propose FreeDrag, which adopts a feature-oriented approach to free the burden on point tracking within the point-oriented methodology of DragGAN. The FreeDrag incorporates adaptive template features, line search, and fuzzy localization techniques to perform stable and efficient point-based image editing. Extensive experiments demonstrate that our method is superior to the DragGAN and enables stable point-based editing in challenging scenarios with similar structures, fine details, or under multi-point targets.

* 8 pages, 7 figures 
Viaarxiv icon