Univ. California, Santa Barbara




Abstract:In this paper, we propose a simple yet unified single object tracking (SOT) framework, dubbed SUTrack. It consolidates five SOT tasks (RGB-based, RGB-Depth, RGB-Thermal, RGB-Event, RGB-Language Tracking) into a unified model trained in a single session. Due to the distinct nature of the data, current methods typically design individual architectures and train separate models for each task. This fragmentation results in redundant training processes, repetitive technological innovations, and limited cross-modal knowledge sharing. In contrast, SUTrack demonstrates that a single model with a unified input representation can effectively handle various common SOT tasks, eliminating the need for task-specific designs and separate training sessions. Additionally, we introduce a task-recognition auxiliary training strategy and a soft token type embedding to further enhance SUTrack's performance with minimal overhead. Experiments show that SUTrack outperforms previous task-specific counterparts across 11 datasets spanning five SOT tasks. Moreover, we provide a range of models catering edge devices as well as high-performance GPUs, striking a good trade-off between speed and accuracy. We hope SUTrack could serve as a strong foundation for further compelling research into unified tracking models. Code and models are available at github.com/chenxin-dlut/SUTrack.




Abstract:Deep recommender systems rely heavily on large embedding tables to handle high-cardinality categorical features such as user/item identifiers, and face significant memory constraints at scale. To tackle this challenge, hashing techniques are often employed to map multiple entities to the same embedding and thus reduce the size of the embedding tables. Concurrently, graph-based collaborative signals have emerged as powerful tools in recommender systems, yet their potential for optimizing embedding table reduction remains unexplored. This paper introduces GraphHash, the first graph-based approach that leverages modularity-based bipartite graph clustering on user-item interaction graphs to reduce embedding table sizes. We demonstrate that the modularity objective has a theoretical connection to message-passing, which provides a foundation for our method. By employing fast clustering algorithms, GraphHash serves as a computationally efficient proxy for message-passing during preprocessing and a plug-and-play graph-based alternative to traditional ID hashing. Extensive experiments show that GraphHash substantially outperforms diverse hashing baselines on both retrieval and click-through-rate prediction tasks. In particular, GraphHash achieves on average a 101.52% improvement in recall when reducing the embedding table size by more than 75%, highlighting the value of graph-based collaborative information for model reduction.




Abstract:Binarized Neural Networks (BNN) offer efficient implementations for machine learning tasks and facilitate Privacy-Preserving Machine Learning (PPML) by simplifying operations with binary values. Nevertheless, challenges persist in terms of communication and accuracy in their application scenarios. In this work, we introduce CBNN, a three-party secure computation framework tailored for efficient BNN inference. Leveraging knowledge distillation and separable convolutions, CBNN transforms standard BNNs into MPC-friendly customized BNNs, maintaining high utility. It performs secure inference using optimized protocols for basic operations. Specifically, CBNN enhances linear operations with replicated secret sharing and MPC-friendly convolutions, while introducing a novel secure activation function to optimize non-linear operations. We demonstrate the effectiveness of CBNN by transforming and securely implementing several typical BNN models. Experimental results indicate that CBNN maintains impressive performance even after customized binarization and security measures




Abstract:In the realm of Sign Language Translation (SLT), reliance on costly gloss-annotated datasets has posed a significant barrier. Recent advancements in gloss-free SLT methods have shown promise, yet they often largely lag behind gloss-based approaches in terms of translation accuracy. To narrow this performance gap, we introduce LLaVA-SLT, a pioneering Large Multimodal Model (LMM) framework designed to leverage the power of Large Language Models (LLMs) through effectively learned visual language embeddings. Our model is trained through a trilogy. First, we propose linguistic continued pretraining. We scale up the LLM and adapt it to the sign language domain using an extensive corpus dataset, effectively enhancing its textual linguistic knowledge about sign language. Then, we adopt visual contrastive pretraining to align the visual encoder with a large-scale pretrained text encoder. We propose hierarchical visual encoder that learns a robust word-level intermediate representation that is compatible with LLM token embeddings. Finally, we propose visual language tuning. We freeze pretrained models and employ a lightweight trainable MLP connector. It efficiently maps the pretrained visual language embeddings into the LLM token embedding space, enabling downstream SLT task. Our comprehensive experiments demonstrate that LLaVA-SLT outperforms the state-of-the-art methods. By using extra annotation-free data, it even closes to the gloss-based accuracy.




Abstract:As data volumes expand rapidly, distributed machine learning has become essential for addressing the growing computational demands of modern AI systems. However, training models in distributed environments is challenging with participants hold skew, Non-Independent-Identically distributed (Non-IID) data. Low-Rank Adaptation (LoRA) offers a promising solution to this problem by personalizing low-rank updates rather than optimizing the entire model, LoRA-enabled distributed learning minimizes computational and maximize personalization for each participant. Enabling more robust and efficient training in distributed learning settings, especially in large-scale, heterogeneous systems. Despite the strengths of current state-of-the-art methods, they often require manual configuration of the initial rank, which is increasingly impractical as the number of participants grows. This manual tuning is not only time-consuming but also prone to suboptimal configurations. To address this limitation, we propose AutoRank, an adaptive rank-setting algorithm inspired by the bias-variance trade-off. AutoRank leverages the MCDA method TOPSIS to dynamically assign local ranks based on the complexity of each participant's data. By evaluating data distribution and complexity through our proposed data complexity metrics, AutoRank provides fine-grained adjustments to the rank of each participant's local LoRA model. This adaptive approach effectively mitigates the challenges of double-imbalanced, non-IID data. Experimental results demonstrate that AutoRank significantly reduces computational overhead, enhances model performance, and accelerates convergence in highly heterogeneous federated learning environments. Through its strong adaptability, AutoRank offers a scalable and flexible solution for distributed machine learning.




Abstract:Contextual information at the video level has become increasingly crucial for visual object tracking. However, existing methods typically use only a few tokens to convey this information, which can lead to information loss and limit their ability to fully capture the context. To address this issue, we propose a new video-level visual object tracking framework called MCITrack. It leverages Mamba's hidden states to continuously record and transmit extensive contextual information throughout the video stream, resulting in more robust object tracking. The core component of MCITrack is the Contextual Information Fusion module, which consists of the mamba layer and the cross-attention layer. The mamba layer stores historical contextual information, while the cross-attention layer integrates this information into the current visual features of each backbone block. This module enhances the model's ability to capture and utilize contextual information at multiple levels through deep integration with the backbone. Experiments demonstrate that MCITrack achieves competitive performance across numerous benchmarks. For instance, it gets 76.6% AUC on LaSOT and 80.0% AO on GOT-10k, establishing a new state-of-the-art performance. Code and models are available at https://github.com/kangben258/MCITrack.




Abstract:Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal large language models (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
Abstract:Due to the scarcity and unpredictable nature of defect samples, industrial anomaly detection (IAD) predominantly employs unsupervised learning. However, all unsupervised IAD methods face a common challenge: the inherent bias in normal samples, which causes models to focus on variable regions while overlooking potential defects in invariant areas. To effectively overcome this, it is essential to decompose and recalibrate attention, guiding the model to suppress irrelevant variations and concentrate on subtle, defect-susceptible areas. In this paper, we propose Recalibrating Attention of Industrial Anomaly Detection (RAAD), a framework that systematically decomposes and recalibrates attention maps. RAAD employs a two-stage process: first, it reduces attention bias through quantization, and second, it fine-tunes defect-prone regions for improved sensitivity. Central to this framework is Hierarchical Quantization Scoring (HQS), which dynamically allocates bit-widths across layers based on their anomaly detection contributions. HQS dynamically adjusts bit-widths based on the hierarchical nature of attention maps, compressing lower layers that produce coarse and noisy attention while preserving deeper layers with sharper, defect-focused attention. This approach optimizes both computational efficiency and the model' s sensitivity to anomalies. We validate the effectiveness of RAAD on 32 datasets using a single 3090ti. Experiments demonstrate that RAAD, balances the complexity and expressive power of the model, enhancing its anomaly detection capability.




Abstract:Understanding travelers' route choices can help policymakers devise optimal operational and planning strategies for both normal and abnormal circumstances. However, existing choice modeling methods often rely on predefined assumptions and struggle to capture the dynamic and adaptive nature of travel behavior. Recently, Large Language Models (LLMs) have emerged as a promising alternative, demonstrating remarkable ability to replicate human-like behaviors across various fields. Despite this potential, their capacity to accurately simulate human route choice behavior in transportation contexts remains doubtful. To satisfy this curiosity, this paper investigates the potential of LLMs for route choice modeling by introducing an LLM-empowered agent, "LLMTraveler." This agent integrates an LLM as its core, equipped with a memory system that learns from past experiences and makes decisions by balancing retrieved data and personality traits. The study systematically evaluates the LLMTraveler's ability to replicate human-like decision-making through two stages: (1) analyzing its route-switching behavior in single origin-destination (OD) pair congestion game scenarios, where it demonstrates patterns align with laboratory data but are not fully explained by traditional models, and (2) testing its capacity to model day-to-day (DTD) adaptive learning behaviors on the Ortuzar and Willumsen (OW) network, producing results comparable to Multinomial Logit (MNL) and Reinforcement Learning (RL) models. These experiments demonstrate that the framework can partially replicate human-like decision-making in route choice while providing natural language explanations for its decisions. This capability offers valuable insights for transportation policymaking, such as simulating traveler responses to new policies or changes in the network.




Abstract:Hosting diverse large language model workloads in a unified resource pool through co-location is cost-effective. For example, long-running chat services generally follow diurnal traffic patterns, which inspire co-location of batch jobs to fulfill resource valleys between successive peaks, and thus to saturate resource allocation in cluster-wide scope. These heterogeneous workloads often have different business priorities, and therefore preemption can be leveraged for resource elasticity. However, workloads often have distinct topology preferences as well. The resources released by lower-priority instances may fail to meet the requirements of high-priority online services which are usually latency-sensitive. The root cause behind such mis-match is a lack of topology awareness of resource scheduler, especially during preemption. To bridge this gap, we develop a fine-grained topology-aware method for preemptive scheduling of hybrid workloads. The method ensures that the resources freed by preempted tasks adhere to the topological affinity needs of high-priority preemptors in a guaranteed or best-effort manner. This dynamic alignment significantly increases the efficiency of preemption and improves overall scheduled performance for LLM workloads by $55\%$.