The unprecedented surge in video data production in recent years necessitates efficient tools to extract meaningful frames from videos for downstream tasks. Long-term temporal reasoning is a key desideratum for frame retrieval systems. While state-of-the-art foundation models, like VideoLLaMA and ViCLIP, are proficient in short-term semantic understanding, they surprisingly fail at long-term reasoning across frames. A key reason for this failure is that they intertwine per-frame perception and temporal reasoning into a single deep network. Hence, decoupling but co-designing semantic understanding and temporal reasoning is essential for efficient scene identification. We propose a system that leverages vision-language models for semantic understanding of individual frames but effectively reasons about the long-term evolution of events using state machines and temporal logic (TL) formulae that inherently capture memory. Our TL-based reasoning improves the F1 score of complex event identification by 9-15% compared to benchmarks that use GPT4 for reasoning on state-of-the-art self-driving datasets such as Waymo and NuScenes.
Although pre-trained language models encode generic knowledge beneficial for planning and control, they may fail to generate appropriate control policies for domain-specific tasks. Existing fine-tuning methods use human feedback to address this limitation, however, sourcing human feedback is labor intensive and costly. We present a fully automated approach to fine-tune pre-trained language models for applications in autonomous systems, bridging the gap between generic knowledge and domain-specific requirements while reducing cost. The method synthesizes automaton-based controllers from pre-trained models guided by natural language task descriptions. These controllers are verifiable against independently provided specifications within a world model, which can be abstract or obtained from a high-fidelity simulator. Controllers with high compliance with the desired specifications receive higher ranks, guiding the iterative fine-tuning process. We provide quantitative evidences, primarily in autonomous driving, to demonstrate the method's effectiveness across multiple tasks. The results indicate an improvement in percentage of specifications satisfied by the controller from 60% to 90%.
The increasing abundance of video data enables users to search for events of interest, e.g., emergency incidents. Meanwhile, it raises new concerns, such as the need for preserving privacy. Existing approaches to video search require either manual inspection or a deep learning model with massive training. We develop a method that uses recent advances in vision and language models, as well as formal methods, to search for events of interest in video clips automatically and efficiently. The method consists of an algorithm to map text-based event descriptions into linear temporal logic over finite traces (LTL$_f$) and an algorithm to construct an automaton encoding the video information. Then, the method formally verifies the automaton representing the video against the LTL$_f$ specifications and adds the pertinent video clips to the search result if the automaton satisfies the specifications. We provide qualitative and quantitative analysis to demonstrate the video-searching capability of the proposed method. It achieves over 90 percent precision in searching over privacy-sensitive videos and a state-of-the-art autonomous driving dataset.
The rapid advancement of large language models, such as the Generative Pre-trained Transformer (GPT) series, has had significant implications across various disciplines. In this study, we investigate the potential of the state-of-the-art large language model (GPT-4) for planning tasks. We explore its effectiveness in multiple planning subfields, highlighting both its strengths and limitations. Through a comprehensive examination, we identify areas where large language models excel in solving planning problems and reveal the constraints that limit their applicability. Our empirical analysis focuses on GPT-4's performance in planning domain extraction, graph search path planning, and adversarial planning. We then propose a way of fine-tuning a domain-specific large language model to improve its Chain of Thought (CoT) capabilities for the above-mentioned tasks. The results provide valuable insights into the potential applications of large language models in the planning domain and pave the way for future research to overcome their limitations and expand their capabilities.
Recently developed pretrained models can encode rich world knowledge expressed in multiple modalities, such as text and images. However, the outputs of these models cannot be integrated into algorithms to solve sequential decision-making tasks. We develop an algorithm that utilizes the knowledge from pretrained models to construct and verify controllers for sequential decision-making tasks, and to ground these controllers to task environments through visual observations. In particular, the algorithm queries a pretrained model with a user-provided, text-based task description and uses the model's output to construct an automaton-based controller that encodes the model's task-relevant knowledge. It then verifies whether the knowledge encoded in the controller is consistent with other independently available knowledge, which may include abstract information on the environment or user-provided specifications. If this verification step discovers any inconsistency, the algorithm automatically refines the controller to resolve the inconsistency. Next, the algorithm leverages the vision and language capabilities of pretrained models to ground the controller to the task environment. It collects image-based observations from the task environment and uses the pretrained model to link these observations to the text-based control logic encoded in the controller (e.g., actions and conditions that trigger the actions). We propose a mechanism to ensure the controller satisfies the user-provided specification even when perceptual uncertainties are present. We demonstrate the algorithm's ability to construct, verify, and ground automaton-based controllers through a suite of real-world tasks, including daily life and robot manipulation tasks.
Automata-based representations play an important role in control and planning in sequential decision-making, but obtaining high-level task knowledge for building automata is often difficult. Although large-scale generative language models (GLMs) can help automatically distill task knowledge, the textual outputs from GLMs are not directly utilizable in sequential decision-making. We resolve this problem by proposing a novel algorithm named GLM2FSA, which obtains high-level task knowledge, represented in a finite state automaton (FSA), from a given brief description of the task goal. GLM2FSA sends queries to a GLM for task knowledge in textual form and then builds a FSA to represent the textual knowledge. This algorithm fills the gap between text and automata-based representations, and the constructed FSA can be directly utilized in sequential decision-making. We provide examples to demonstrate how GLM2FSA constructs FSAs to represent knowledge encoded in the texts generated by the large-scale GLMs.
We study the privacy risks that are associated with training a neural network's weights with self-supervised learning algorithms. Through empirical evidence, we show that the fine-tuning stage, in which the network weights are updated with an informative and often private dataset, is vulnerable to privacy attacks. To address the vulnerabilities, we design a post-training privacy-protection algorithm that adds noise to the fine-tuned weights and propose a novel differential privacy mechanism that samples noise from the logistic distribution. Compared to the two conventional additive noise mechanisms, namely the Laplace and the Gaussian mechanisms, the proposed mechanism uses a bell-shaped distribution that resembles the distribution of the Gaussian mechanism, and it satisfies pure $\epsilon$-differential privacy similar to the Laplace mechanism. We apply membership inference attacks on both unprotected and protected models to quantify the trade-off between the models' privacy and performance. We show that the proposed protection algorithm can effectively reduce the attack accuracy to roughly 50\%-equivalent to random guessing-while maintaining a performance loss below 5\%.
Current camera image and signal processing pipelines (ISPs), including deep trained versions, tend to apply a single filter that is uniformly applied to the entire image. This despite the fact that most acquired camera images have spatially heterogeneous artifacts. This spatial heterogeneity manifests itself across the image space as varied Moire ringing, motion-blur, color-bleaching or lens based projection distortions. Moreover, combinations of these image artifacts can be present in small or large pixel neighborhoods, within an acquired image. Here, we present a deep reinforcement learning model that works in learned latent subspaces, recursively improves camera image quality through a patch-based spatially adaptive artifact filtering and image enhancement. Our RSE-RL model views the identification and correction of artifacts as a recursive self-learning and self-improvement exercise and consists of two major sub-modules: (i) The latent feature sub-space clustering/grouping obtained through an equivariant variational auto-encoder enabling rapid identification of the correspondence and discrepancy between noisy and clean image patches. (ii) The adaptive learned transformation controlled by a trust-region soft actor-critic agent that progressively filters and enhances the noisy patches using its closest feature distance neighbors of clean patches. Artificial artifacts that may be introduced in a patch-based ISP, are also removed through a reward based de-blocking recovery and image enhancement. We demonstrate the self-improvement feature of our model by recursively training and testing on images, wherein the enhanced images resulting from each epoch provide a natural data augmentation and robustness to the RSE-RL training-filtering pipeline.
We study the privacy implications of deploying recurrent neural networks in machine learning. We consider membership inference attacks (MIAs) in which an attacker aims to infer whether a given data record has been used in the training of a learning agent. Using existing MIAs that target feed-forward neural networks, we empirically demonstrate that the attack accuracy wanes for data records used earlier in the training history. Alternatively, recurrent networks are specifically designed to better remember their past experience; hence, they are likely to be more vulnerable to MIAs than their feed-forward counterparts. We develop a pair of MIA layouts for two primary applications of recurrent networks, namely, deep reinforcement learning and sequence-to-sequence tasks. We use the first attack to provide empirical evidence that recurrent networks are indeed more vulnerable to MIAs than feed-forward networks with the same performance level. We use the second attack to showcase the differences between the effects of overtraining recurrent and feed-forward networks on the accuracy of their respective MIAs. Finally, we deploy a differential privacy mechanism to resolve the privacy vulnerability that the MIAs exploit. For both attack layouts, the privacy mechanism degrades the attack accuracy from above 80% to 50%, which is equal to guessing the data membership uniformly at random, while trading off less than 10% utility.
In sequence-to-sequence language processing tasks, sentences with heterogeneous semantics or grammatical structures may increase the difficulty of convergence while training the network. To resolve this problem, we introduce a model that concentrates the each of the heterogeneous features in the input-output sequences. Build upon the encoder-decoder architecture, we design a latent-enhanced multi-filter seq2seq model (LMS2S) that analyzes the latent space representations using a clustering algorithm. The representations are generated from an encoder and a latent space enhancer. A cluster classifier is applied to group the representations into clusters. A soft actor-critic reinforcement learning algorithm is applied to the cluster classifier to enhance the clustering quality by maximizing the Silhouette score. Then, multiple filters are trained by the features only from their corresponding clusters, the heterogeneity of the training data can be resolved accordingly. Our experiments on semantic parsing and machine translation demonstrate the positive correlation between the clustering quality and the model's performance, as well as show the enhancement our model has made with respect to the ordinary encoder-decoder model.