Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

De-Chuan Zhan

Weight Scope Alignment: A Frustratingly Easy Method for Model Merging

Aug 22, 2024

Yichu Xu, Xin-Chun Li, Le Gan, De-Chuan Zhan

Figure 1 for Weight Scope Alignment: A Frustratingly Easy Method for Model Merging

Figure 2 for Weight Scope Alignment: A Frustratingly Easy Method for Model Merging

Figure 3 for Weight Scope Alignment: A Frustratingly Easy Method for Model Merging

Figure 4 for Weight Scope Alignment: A Frustratingly Easy Method for Model Merging

Abstract:Merging models becomes a fundamental procedure in some applications that consider model efficiency and robustness. The training randomness or Non-I.I.D. data poses a huge challenge for averaging-based model fusion. Previous research efforts focus on element-wise regularization or neural permutations to enhance model averaging while overlooking weight scope variations among models, which can significantly affect merging effectiveness. In this paper, we reveal variations in weight scope under different training conditions, shedding light on its influence on model merging. Fortunately, the parameters in each layer basically follow the Gaussian distribution, which inspires a novel and simple regularization approach named Weight Scope Alignment (WSA). It contains two key components: 1) leveraging a target weight scope to guide the model training process for ensuring weight scope matching in the subsequent model merging. 2) fusing the weight scope of two or more models into a unified one for multi-stage model fusion. We extend the WSA regularization to two different scenarios, including Mode Connectivity and Federated Learning. Abundant experimental studies validate the effectiveness of our approach.

Via

Access Paper or Ask Questions

CS3: Cascade SAM for Sperm Segmentation

Jul 04, 2024

Yi Shi, Xu-Peng Tian, Yun-Kai Wang, Tie-Yi Zhang, Bin Yao, Hui Wang, Yong Shao, Cen-Cen Wang, Rong Zeng, De-Chuan Zhan

Abstract:Automated sperm morphology analysis plays a crucial role in the assessment of male fertility, yet its efficacy is often compromised by the challenges in accurately segmenting sperm images. Existing segmentation techniques, including the Segment Anything Model(SAM), are notably inadequate in addressing the complex issue of sperm overlap-a frequent occurrence in clinical samples. Our exploratory studies reveal that modifying image characteristics by removing sperm heads and easily segmentable areas, alongside enhancing the visibility of overlapping regions, markedly enhances SAM's efficiency in segmenting intricate sperm structures. Motivated by these findings, we present the Cascade SAM for Sperm Segmentation (CS3), an unsupervised approach specifically designed to tackle the issue of sperm overlap. This method employs a cascade application of SAM to segment sperm heads, simple tails, and complex tails in stages. Subsequently, these segmented masks are meticulously matched and joined to construct complete sperm masks. In collaboration with leading medical institutions, we have compiled a dataset comprising approximately 2,000 unlabeled sperm images to fine-tune our method, and secured expert annotations for an additional 240 images to facilitate comprehensive model assessment. Experimental results demonstrate superior performance of CS3 compared to existing methods.

Via

Access Paper or Ask Questions

Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later

Jul 03, 2024

Han-Jia Ye, Huai-Hong Yin, De-Chuan Zhan

Figure 1 for Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later

Figure 2 for Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later

Figure 3 for Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later

Figure 4 for Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later

Abstract:The growing success of deep learning in various domains has prompted investigations into its application to tabular data, where deep models have shown promising results compared to traditional tree-based methods. In this paper, we revisit Neighborhood Component Analysis (NCA), a classic tabular prediction method introduced in 2004, designed to learn a linear projection that captures semantic similarities between instances. We find that minor modifications, such as adjustments to the learning objectives and the integration of deep learning architectures, significantly enhance NCA's performance, enabling it to surpass most modern deep tabular models. Additionally, we introduce a stochastic neighbor sampling strategy that improves both the efficiency and predictive accuracy of our proposed ModernNCA -- sampling only a subset of neighbors during training, while utilizing the entire neighborhood during inference. Extensive experiments demonstrate that our ModernNCA achieves state-of-the-art results in both classification and regression tasks across various tabular datasets, outperforming both tree-based and other deep tabular models, while also reducing training time and model size.

Via

Access Paper or Ask Questions

A Closer Look at Deep Learning on Tabular Data

Jul 01, 2024

Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, De-Chuan Zhan

Figure 1 for A Closer Look at Deep Learning on Tabular Data

Figure 2 for A Closer Look at Deep Learning on Tabular Data

Figure 3 for A Closer Look at Deep Learning on Tabular Data

Figure 4 for A Closer Look at Deep Learning on Tabular Data

Abstract:Tabular data is prevalent across various domains in machine learning. Although Deep Neural Network (DNN)-based methods have shown promising performance comparable to tree-based ones, in-depth evaluation of these methods is challenging due to varying performance ranks across diverse datasets. In this paper, we propose a comprehensive benchmark comprising 300 tabular datasets, covering a wide range of task types, size distributions, and domains. We perform an extensive comparison between state-of-the-art deep tabular methods and tree-based methods, revealing the average rank of all methods and highlighting the key factors that influence the success of deep tabular methods. Next, we analyze deep tabular methods based on their training dynamics, including changes in validation metrics and other statistics. For each dataset-method pair, we learn a mapping from both the meta-features of datasets and the first part of the validation curve to the final validation set performance and even the evolution of validation curves. This mapping extracts essential meta-features that influence prediction accuracy, helping the analysis of tabular methods from novel aspects. Based on the performance of all methods on this large benchmark, we identify two subsets of 45 datasets each. The first subset contains datasets that favor either tree-based methods or DNN-based methods, serving as effective analysis tools to evaluate strategies (e.g., attribute encoding strategies) for improving deep tabular models. The second subset contains datasets where the ranks of methods are consistent with the overall benchmark, acting as a probe for tabular analysis. These ``tiny tabular benchmarks'' will facilitate further studies on tabular data.

Via

Access Paper or Ask Questions

SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Jun 13, 2024

Shenghua Wan, Ziyuan Chen, Le Gan, Shuai Feng, De-Chuan Zhan

Figure 1 for SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Figure 2 for SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Figure 3 for SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Figure 4 for SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Abstract:Model-based offline reinforcement Learning (RL) is a promising approach that leverages existing data effectively in many real-world applications, especially those involving high-dimensional inputs like images and videos. To alleviate the distribution shift issue in offline RL, existing model-based methods heavily rely on the uncertainty of learned dynamics. However, the model uncertainty estimation becomes significantly biased when observations contain complex distractors with non-trivial dynamics. To address this challenge, we propose a new approach - \emph{Separated Model-based Offline Policy Optimization} (SeMOPO) - decomposing latent states into endogenous and exogenous parts via conservative sampling and estimating model uncertainty on the endogenous states only. We provide a theoretical guarantee of model uncertainty and performance bound of SeMOPO. To assess the efficacy, we construct the Low-Quality Vision Deep Data-Driven Datasets for RL (LQV-D4RL), where the data are collected by non-expert policy and the observations include moving distractors. Experimental results show that our method substantially outperforms all baseline methods, and further analytical experiments validate the critical designs in our method. The project website is \href{https://sites.google.com/view/semopo}{https://sites.google.com/view/semopo}.

* 23 pages, 10 figures

Via

Access Paper or Ask Questions

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens

Jun 12, 2024

Ting-Ji Huang, Jia-Qi Yang, Chunxu Shen, Kai-Qi Liu, De-Chuan Zhan, Han-Jia Ye

Figure 1 for Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens

Figure 2 for Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens

Figure 3 for Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens

Figure 4 for Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens

Abstract:Characterizing users and items through vector representations is crucial for various tasks in recommender systems. Recent approaches attempt to apply Large Language Models (LLMs) in recommendation through a question and answer format, where real users and items (e.g., Item No.2024) are represented with in-vocabulary tokens (e.g., "item", "20", "24"). However, since LLMs are typically pretrained on natural language tasks, these in-vocabulary tokens lack the expressive power for distinctive users and items, thereby weakening the recommendation ability even after fine-tuning on recommendation tasks. In this paper, we explore how to effectively tokenize users and items in LLM-based recommender systems. We emphasize the role of out-of-vocabulary (OOV) tokens in addition to the in-vocabulary ones and claim the memorization of OOV tokens that capture correlations of users/items as well as diversity of OOV tokens. By clustering the learned representations from historical user-item interactions, we make the representations of user/item combinations share the same OOV tokens if they have similar properties. Furthermore, integrating these OOV tokens into the LLM's vocabulary allows for better distinction between users and items and enhanced capture of user-item relationships during fine-tuning on downstream tasks. Our proposed framework outperforms existing state-of-the-art methods across various downstream recommendation tasks.

Via

Access Paper or Ask Questions

Wings: Learning Multimodal LLMs without Text-only Forgetting

Jun 05, 2024

Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

Figure 1 for Wings: Learning Multimodal LLMs without Text-only Forgetting

Figure 2 for Wings: Learning Multimodal LLMs without Text-only Forgetting

Figure 3 for Wings: Learning Multimodal LLMs without Text-only Forgetting

Figure 4 for Wings: Learning Multimodal LLMs without Text-only Forgetting

Abstract:Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like "wings" on either side, are connected in parallel within each layer's attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.

Via

Access Paper or Ask Questions

Parrot: Multilingual Visual Instruction Tuning

Jun 04, 2024

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan(+1 more)

Figure 1 for Parrot: Multilingual Visual Instruction Tuning

Figure 2 for Parrot: Multilingual Visual Instruction Tuning

Figure 3 for Parrot: Multilingual Visual Instruction Tuning

Figure 4 for Parrot: Multilingual Visual Instruction Tuning

Abstract:The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available.

Via

Access Paper or Ask Questions

Visualizing, Rethinking, and Mining the Loss Landscape of Deep Neural Networks

May 21, 2024

Xin-Chun Li, Lan Li, De-Chuan Zhan

Figure 1 for Visualizing, Rethinking, and Mining the Loss Landscape of Deep Neural Networks

Figure 2 for Visualizing, Rethinking, and Mining the Loss Landscape of Deep Neural Networks

Figure 3 for Visualizing, Rethinking, and Mining the Loss Landscape of Deep Neural Networks

Figure 4 for Visualizing, Rethinking, and Mining the Loss Landscape of Deep Neural Networks

Abstract:The loss landscape of deep neural networks (DNNs) is commonly considered complex and wildly fluctuated. However, an interesting observation is that the loss surfaces plotted along Gaussian noise directions are almost v-basin ones with the perturbed model lying on the basin. This motivates us to rethink whether the 1D or 2D subspace could cover more complex local geometry structures, and how to mine the corresponding perturbation directions. This paper systematically and gradually categorizes the 1D curves from simple to complex, including v-basin, v-side, w-basin, w-peak, and vvv-basin curves. Notably, the latter two types are already hard to obtain via the intuitive construction of specific perturbation directions, and we need to propose proper mining algorithms to plot the corresponding 1D curves. Combining these 1D directions, various types of 2D surfaces are visualized such as the saddle surfaces and the bottom of a bottle of wine that are only shown by demo functions in previous works. Finally, we propose theoretical insights from the lens of the Hessian matrix to explain the observed several interesting phenomena.

Via

Access Paper or Ask Questions

Exploring Dark Knowledge under Various Teacher Capacities and Addressing Capacity Mismatch

May 21, 2024

Xin-Chun Li, Wen-Shu Fan, Bowen Tao, Le Gan, De-Chuan Zhan

Abstract:Knowledge Distillation (KD) could transfer the ``dark knowledge" of a well-performed yet large neural network to a weaker but lightweight one. From the view of output logits and softened probabilities, this paper goes deeper into the dark knowledge provided by teachers with different capacities. Two fundamental observations are: (1) a larger teacher tends to produce probability vectors that are less distinct between non-ground-truth classes; (2) teachers with different capacities are basically consistent in their cognition of relative class affinity. Abundant experimental studies verify these observations and in-depth empirical explanations are provided. The difference in dark knowledge leads to the peculiar phenomenon named ``capacity mismatch" that a more accurate teacher does not necessarily perform as well as a smaller teacher when teaching the same student network. Enlarging the distinctness between non-ground-truth class probabilities for larger teachers could address the capacity mismatch problem. This paper explores multiple simple yet effective ways to achieve this goal and verify their success by comparing them with popular KD methods that solve the capacity mismatch.

Via

Access Paper or Ask Questions