Soochow University
Abstract:As the volume of image data grows, data-oriented cloud computing in Internet of Video Things (IoVT) systems encounters latency issues. Task-oriented edge computing addresses this by shifting data analysis to the edge. However, limited computational power of edge devices poses challenges for executing visual tasks. Existing methods struggle to balance high model performance with low resource consumption; lightweight neural networks often underperform, while device-specific models designed by Neural Architecture Search (NAS) fail to adapt to heterogeneous devices. For these issues, we propose a novel co-design framework to optimize neural network architecture and deployment strategies during inference for high-throughput. Specifically, it implements a dynamic model structure based on re-parameterization, coupled with a Roofline-based model partitioning strategy to enhance the computational performance of edge devices. We also employ a multi-objective co-optimization approach to balance throughput and accuracy. Additionally, we derive mathematical consistency and convergence of partitioned models. Experimental results demonstrate significant improvements in throughput (12.05\% on MNIST, 18.83\% on ImageNet) and superior classification accuracy compared to baseline algorithms. Our method consistently achieves stable performance across different devices, underscoring its adaptability. Simulated experiments further confirm its efficacy in high-accuracy, real-time detection for small objects in IoVT systems.




Abstract:This paper proposes an unsupervised deep-learning (DL) approach by integrating transformer and Kolmogorov-Arnold networks (KAN) termed KANsformer to realize scalable beamforming for mobile communication systems. Specifically, we consider a classic multi-input-single-output energy efficiency maximization problem subject to the total power budget. The proposed KANsformer first extracts hidden features via a multi-head self-attention mechanism and then reads out the desired beamforming design via KAN. Numerical results are provided to evaluate the KANsformer in terms of generalization performance, transfer learning and ablation experiment. Overall, the KANsformer outperforms existing benchmark DL approaches, and is adaptable to the change in the number of mobile users with real-time and near-optimal inference.




Abstract:As trustworthy AI continues to advance, the fairness issue in recommendations has received increasing attention. A recommender system is considered unfair when it produces unequal outcomes for different user groups based on user-sensitive attributes (e.g., age, gender). Some researchers have proposed data augmentation-based methods aiming at alleviating user-level unfairness by altering the skewed distribution of training data among various user groups. Despite yielding promising results, they often rely on fairness-related assumptions that may not align with reality, potentially reducing the data quality and negatively affecting model effectiveness. To tackle this issue, in this paper, we study how to implement high-quality data augmentation to improve recommendation fairness. Specifically, we propose FairDgcl, a dynamic graph adversarial contrastive learning framework aiming at improving fairness in recommender system. First, FairDgcl develops an adversarial contrastive network with a view generator and a view discriminator to learn generating fair augmentation strategies in an adversarial style. Then, we propose two dynamic, learnable models to generate contrastive views within contrastive learning framework, which automatically fine-tune the augmentation strategies. Meanwhile, we theoretically show that FairDgcl can simultaneously generate enhanced representations that possess both fairness and accuracy. Lastly, comprehensive experiments conducted on four real-world datasets demonstrate the effectiveness of the proposed FairDgcl.




Abstract:The widespread deployment of sensing devices leads to a surge in data for spatio-temporal forecasting applications such as traffic flow, air quality, and wind energy. Although spatio-temporal graph neural networks have achieved success in modeling various static spatio-temporal forecasting scenarios, real-world spatio-temporal data are typically received in a streaming manner, and the network continuously expands with the installation of new sensors. Thus, spatio-temporal forecasting in streaming scenarios faces dual challenges: the inefficiency of retraining models over newly arrived data and the detrimental effects of catastrophic forgetting over long-term history. To address these challenges, we propose a novel prompt tuning-based continuous forecasting method, following two fundamental tuning principles guided by empirical and theoretical analysis: expand and compress, which effectively resolve the aforementioned problems with lightweight tuning parameters. Specifically, we integrate the base spatio-temporal graph neural network with a continuous prompt pool, utilizing stored prompts (i.e., few learnable parameters) in memory, and jointly optimize them with the base spatio-temporal graph neural network. This method ensures that the model sequentially learns from the spatio-temporal data stream to accomplish tasks for corresponding periods. Extensive experimental results on multiple real-world datasets demonstrate the multi-faceted superiority of our method over the state-of-the-art baselines, including effectiveness, efficiency, universality, etc.




Abstract:Most text-to-image customization techniques fine-tune models on a small set of \emph{personal concept} images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts \emph{solely} within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images. Additionally, our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods. We demonstrate the broad applicability of our approach by combining it with four different baseline methods, achieving notable CLIP score improvements.




Abstract:Chart images, such as bar charts, pie charts, and line charts, are explosively produced due to the wide usage of data visualizations. Accordingly, knowledge mining from chart images is becoming increasingly important, which can benefit downstream tasks like chart retrieval and knowledge graph completion. However, existing methods for chart knowledge mining mainly focus on converting chart images into raw data and often ignore their visual encodings and semantic meanings, which can result in information loss for many downstream tasks. In this paper, we propose ChartKG, a novel knowledge graph (KG) based representation for chart images, which can model the visual elements in a chart image and semantic relations among them including visual encodings and visual insights in a unified manner. Further, we develop a general framework to convert chart images to the proposed KG-based representation. It integrates a series of image processing techniques to identify visual elements and relations, e.g., CNNs to classify charts, yolov5 and optical character recognition to parse charts, and rule-based methods to construct graphs. We present four cases to illustrate how our knowledge-graph-based representation can model the detailed visual elements and semantic relations in charts, and further demonstrate how our approach can benefit downstream applications such as semantic-aware chart retrieval and chart question answering. We also conduct quantitative evaluations to assess the two fundamental building blocks of our chart-to-KG framework, i.e., object recognition and optical character recognition. The results provide support for the usefulness and effectiveness of ChartKG.
Abstract:Document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially in academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler is integrated with the MLLM's image encoder and selects the paragraphs or diagrams most pertinent to user queries for processing by the language model. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of academic papers sourced from arXiv, multiple strategies are proposed to generate automatically 1M QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal PDF understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.




Abstract:Understanding long text is of great demands in practice but beyond the reach of most language-image pre-training (LIP) models. In this work, we empirically confirm that the key reason causing such an issue is that the training images are usually paired with short captions, leaving certain tokens easily overshadowed by salient tokens. Towards this problem, our initial attempt is to relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text (e.g., in the image classification task). Then, after incorporating corner tokens to aggregate diverse textual information, we manage to help the model catch up to its original level of short text understanding yet greatly enhance its capability of long text understanding. We further look into whether the model can continuously benefit from longer captions and notice a clear trade-off between the performance and the efficiency. Finally, we validate the effectiveness of our approach using a self-constructed large-scale dataset, which consists of 100M long caption oriented text-image pairs. It is noteworthy that, on the task of long-text image retrieval, we beat the competitor using long captions with 11.1% improvement (i.e., from 72.62% to 83.72%). We will release the code, the model, and the new dataset to facilitate the reproducibility and further research. The project page is available at https://wuw2019.github.io/lotlip.




Abstract:In recent advancements, multimodal large language models (MLLMs) have been fine-tuned on specific medical image datasets to address medical visual question answering (Med-VQA) tasks. However, this common approach of task-specific fine-tuning is costly and necessitates separate models for each downstream task, limiting the exploration of zero-shot capabilities. In this paper, we introduce MC-CoT, a modular cross-modal collaboration Chain-of-Thought (CoT) framework designed to enhance the zero-shot performance of MLLMs in Med-VQA by leveraging large language models (LLMs). MC-CoT improves reasoning and information extraction by integrating medical knowledge and task-specific guidance, where LLM provides various complex medical reasoning chains and MLLM provides various observations of medical images based on instructions of the LLM. Our experiments on datasets such as SLAKE, VQA-RAD, and PATH-VQA show that MC-CoT surpasses standalone MLLMs and various multimodality CoT frameworks in recall rate and accuracy. These findings highlight the importance of incorporating background information and detailed guidance in addressing complex zero-shot Med-VQA tasks.




Abstract:This paper applies graph neural networks (GNN) in UAV communications to optimize the placement and transmission design. We consider a multiple-user multiple-input-single-output UAV communication system where a UAV intends to find a placement to hover and serve users with maximum energy efficiency (EE). To facilitate the GNN-based learning, we adopt the hybrid maximum ratio transmission and zero forcing scheme to design the beamforming vectors and a feature augment is implemented by manually setting edge features. Furthermore, we propose a two-stage GNN-based model where the first stage and the second stage yield the placement and the transmission design, respectively. The two stages are connected via a residual and their learnable weights are jointly optimized by via unsupervised learning. Numerical results illustrate the effectiveness and validate the scalability to both UAV antennas and users of the proposed model.