We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.
Temporal graph representation learning aims to generate low-dimensional dynamic node embeddings to capture temporal information as well as structural and property information. Current representation learning methods for temporal networks often focus on capturing fine-grained information, which may lead to the model capturing random noise instead of essential semantic information. While graph contrastive learning has shown promise in dealing with noise, it only applies to static graphs or snapshots and may not be suitable for handling time-dependent noise. To alleviate the above challenge, we propose a novel Temporal Graph representation learning with Adaptive augmentation Contrastive (TGAC) model. The adaptive augmentation on the temporal graph is made by combining prior knowledge with temporal information, and the contrastive objective function is constructed by defining the augmented inter-view contrast and intra-view contrast. To complement TGAC, we propose three adaptive augmentation strategies that modify topological features to reduce noise from the network. Our extensive experiments on various real networks demonstrate that the proposed model outperforms other temporal graph representation learning methods.
Large Language Models (LLMs) are smart but forgetful. Recent studies, (e.g., (Bubeck et al., 2023)) on modern LLMs have shown that they are capable of performing amazing tasks typically necessitating human-level intelligence. However, unlike humans, frozen LLMs do not improve over time; they neither acquire new knowledge nor learn from their successes or failures. Some approaches to improving the intelligence of LLMs include fine-tuning models based on problem-solving performance (Zelikman et al., 2022), and building bigger and more sophisticated models (Bubeck et al., 2023). However, these methods have the drawback of requiring substantial data and computational resources to retrain existing models. In this paper, we explore the use of Retrieval Augmented Generation, also known as RAG (Lewis et al., 2021) to improve problem-solving performance. We propose ARM-RAG (Auxiliary Rationale Memory for Retrieval Augmented Generation), a system that learns from its successes without incurring high training costs. We demonstrate that the storage and subsequent retrieval of reasoning chains have a positive influence on performance in grade-school math problems.
The paper proposes a framework to identify and avoid the coverage hole in an indoor industry environment. We assume an edge cloud co-located controller that followers the Automated Guided Vehicle (AGV) movement on a factory floor over a wireless channel. The coverage holes are caused due to blockage, path-loss, and fading effects. An AGV in the coverage hole may lose connectivity to the edge-cloud and become unstable. To avoid connectivity loss, we proposed a framework that identifies the position of coverage hole using a Support- Vector Machine (SVM) classifier model and constructs a binary coverage hole map incorporating the AGV trajectory re-planning to avoid the identified coverage hole. The AGV's re-planned trajectory is optimized and selected to avoid coverage hole the shortest coverage-hole-free trajectory. We further investigated the look-ahead time's impact on the AGV's re-planned trajectory performance. The results reveal that an AGV's re-planned trajectory can be shorter and further optimized if the coverage hole position is known ahead of time
Federated learning (FL) is a classic paradigm of 6G edge intelligence (EI), which alleviates privacy leaks and high communication pressure caused by traditional centralized data processing in the artificial intelligence of things (AIoT). The implementation of multimodal federated perception (MFP) services involves three sub-processes, including sensing-based multimodal data generation, communication-based model transmission, and computing-based model training, ultimately relying on available underlying multi-domain physical resources such as time, frequency, and computing power. How to reasonably coordinate the multi-domain resources scheduling among sensing, communication, and computing, therefore, is crucial to the MFP networks. To address the above issues, this paper investigates service-oriented resource management with integrated sensing, communication, and computing (ISCC). With the incentive mechanism of the MFP service market, the resources management problem is redefined as a social welfare maximization problem, where the idea of "expanding resources" and "reducing costs" is used to improve learning performance gain and reduce resource costs. Experimental results demonstrate the effectiveness and robustness of the proposed resource scheduling mechanisms.
Thin-cap fibroatheroma (TCFA) is a prominent risk factor for plaque rupture. Intravascular optical coherence tomography (IVOCT) enables identification of fibrous cap (FC), measurement of FC thicknesses, and assessment of plaque vulnerability. We developed a fully-automated deep learning method for FC segmentation. This study included 32,531 images across 227 pullbacks from two registries. Images were semi-automatically labeled using our OCTOPUS with expert editing using established guidelines. We employed preprocessing including guidewire shadow detection, lumen segmentation, pixel-shifting, and Gaussian filtering on raw IVOCT (r,theta) images. Data were augmented in a natural way by changing theta in spiral acquisitions and by changing intensity and noise values. We used a modified SegResNet and comparison networks to segment FCs. We employed transfer learning from our existing much larger, fully-labeled calcification IVOCT dataset to reduce deep-learning training. Overall, our method consistently delivered better FC segmentation results (Dice: 0.837+/-0.012) than other deep-learning methods. Transfer learning reduced training time by 84% and reduced the need for more training samples. Our method showed a high level of generalizability, evidenced by highly-consistent segmentations across five-fold cross-validation (sensitivity: 85.0+/-0.3%, Dice: 0.846+/-0.011) and the held-out test (sensitivity: 84.9%, Dice: 0.816) sets. In addition, we found excellent agreement of FC thickness with ground truth (2.95+/-20.73 um), giving clinically insignificant bias. There was excellent reproducibility in pre- and post-stenting pullbacks (average FC angle: 200.9+/-128.0 deg / 202.0+/-121.1 deg). Our method will be useful for multiple research purposes and potentially for planning stent deployments that avoid placing a stent edge over an FC.
We present a video decomposition method that facilitates layer-based editing of videos with spatiotemporally varying lighting and motion effects. Our neural model decomposes an input video into multiple layered representations, each comprising a 2D texture map, a mask for the original video, and a multiplicative residual characterizing the spatiotemporal variations in lighting conditions. A single edit on the texture maps can be propagated to the corresponding locations in the entire video frames while preserving other contents' consistencies. Our method efficiently learns the layer-based neural representations of a 1080p video in 25s per frame via coordinate hashing and allows real-time rendering of the edited result at 71 fps on a single GPU. Qualitatively, we run our method on various videos to show its effectiveness in generating high-quality editing effects. Quantitatively, we propose to adopt feature-tracking evaluation metrics for objectively assessing the consistency of video editing. Project page: https://lightbulb12294.github.io/hashing-nvd/
We present a highly compact run-time monitoring approach for deep computer vision networks that extracts selected knowledge from only a few (down to merely two) hidden layers, yet can efficiently detect silent data corruption originating from both hardware memory and input faults. Building on the insight that critical faults typically manifest as peak or bulk shifts in the activation distribution of the affected network layers, we use strategically placed quantile markers to make accurate estimates about the anomaly of the current inference as a whole. Importantly, the detector component itself is kept algorithmically transparent to render the categorization of regular and abnormal behavior interpretable to a human. Our technique achieves up to ~96% precision and ~98% recall of detection. Compared to state-of-the-art anomaly detection techniques, this approach requires minimal compute overhead (as little as 0.3% with respect to non-supervised inference time) and contributes to the explainability of the model.
Orthogonal frequency division multiplexing (OFDM) is one of the representative integrated sensing and communication (ISAC) waveforms, where sensing and communications tend to be assigned with different resource elements (REs) due to their diverse design requirements. This motivates optimization of resource allocation/waveform design across time, frequency, power and delay-Doppler domains. Therefore, this article proposes two cross-domain waveform optimization strategies for OFDM-based ISAC systems, following communication-centric and sensing-centric criteria, respectively. For the communication-centric design, to maximize the achievable data rate, a fraction of REs are optimally allocated for communications according to prior knowledge of the communication channel. The remaining REs are then employed for sensing, where the sidelobe level and peak to average power ratio are suppressed by optimizing its power-frequency and phase-frequency characteristics. For the sensing-centric design, a `locally' perfect auto-correlation property is ensured by adjusting the unit cells of the ambiguity function within its region of interest (RoI). Afterwards, the irrelevant cells beyond RoI, which can readily determine the sensing power allocation, are optimized with the communication power allocation to enhance the achievable data rate. Numerical results demonstrate the superiority of the proposed communication-centric and sensing-centric waveform designs for ISAC applications.
Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimating alpha mattes of each instance at each frame of a video sequence. To tackle this challenging problem, we present MSG-VIM, a Mask Sequence Guided Video Instance Matting neural network, as a novel baseline model for VIM. MSG-VIM leverages a mixture of mask augmentations to make predictions robust to inaccurate and inconsistent mask guidance. It incorporates temporal mask and temporal feature guidance to improve the temporal consistency of alpha matte predictions. Furthermore, we build a new benchmark for VIM, called VIM50, which comprises 50 video clips with multiple human instances as foreground objects. To evaluate performances on the VIM task, we introduce a suitable metric called Video Instance-aware Matting Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50 benchmark and outperforms existing methods by a large margin. The project is open-sourced at https://github.com/SHI-Labs/VIM.