In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations ((a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by $\sim 50\%$) and memory consumption while maintaining or slightly improving performance by up to 1.79\% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption.
LIDAR-based 3D object detection and classification is crucial for autonomous driving. However, inference in real-time from extremely sparse 3D data poses a formidable challenge. To address this issue, a common approach is to project point clouds onto a bird's-eye or perspective view, effectively converting them into an image-like data format. However, this excessive compression of point cloud data often leads to the loss of information. This paper proposes a 3D object detector based on voxel and projection double branch feature extraction (PV-SSD) to address the problem of information loss. We add voxel features input containing rich local semantic information, which is fully fused with the projected features in the feature extraction stage to reduce the local information loss caused by projection. A good performance is achieved compared to the previous work. In addition, this paper makes the following contributions: 1) a voxel feature extraction method with variable receptive fields is proposed; 2) a feature point sampling method by weight sampling is used to filter out the feature points that are more conducive to the detection task; 3) the MSSFA module is proposed based on the SSFA module. To verify the effectiveness of our method, we designed comparison experiments.
Predicting pedestrians' trajectories is a crucial capability for autonomous vehicles' safe navigation, especially in spaces shared with pedestrians. Pedestrian motion in shared spaces is influenced by both the presence of vehicles and other pedestrians. Therefore, effectively modelling both pedestrian-pedestrian and pedestrian-vehicle interactions can increase the accuracy of the pedestrian trajectory prediction models. Despite the huge literature on ways to encode the effect of interacting agents on a pedestrian's predicted trajectory using deep-learning models, limited effort has been put into the effective selection of interacting agents. In the majority of cases, the interaction features used are mainly based on relative distances while paying less attention to the effect of the velocity and approaching direction in the interaction formulation. In this paper, we propose a heuristic-based process of selecting the interacting agents based on collision risk calculation. Focusing on interactions of potentially colliding agents with a target pedestrian, we propose the use of time-to-collision and the approach direction angle of two agents for encoding the interaction effect. This is done by introducing a novel polar collision grid map. Our results have shown predicted trajectories closer to the ground truth compared to existing methods (used as a baseline) on the HBS dataset.
Modern deep neural networks, particularly recent large language models, come with massive model sizes that require significant computational and storage resources. To enable the deployment of modern models on resource-constrained environments and accelerate inference time, researchers have increasingly explored pruning techniques as a popular research direction in neural network compression. However, there is a dearth of up-to-date comprehensive review papers on pruning. To address this issue, in this survey, we provide a comprehensive review of existing research works on deep neural network pruning in a taxonomy of 1) universal/specific speedup, 2) when to prune, 3) how to prune, and 4) fusion of pruning and other compression techniques. We then provide a thorough comparative analysis of seven pairs of contrast settings for pruning (e.g., unstructured/structured) and explore emerging topics, including post-training pruning, different levels of supervision for pruning, and broader applications (e.g., adversarial robustness) to shed light on the commonalities and differences of existing methods and lay the foundation for further method development. To facilitate future research, we build a curated collection of datasets, networks, and evaluations on different applications. Finally, we provide some valuable recommendations on selecting pruning methods and prospect promising research directions. We build a repository at https://github.com/hrcheng1066/awesome-pruning.
Social media posts contain an abundant amount of information about public opinion on major events, especially natural disasters such as hurricanes. Posts related to an event, are usually published by the users who live near the place of the event at the time of the event. Special correlation between the social media data and the events can be obtained using data mining approaches. This paper presents research work to find the mappings between social media data and the severity level of a disaster. Specifically, we have investigated the Twitter data posted during hurricanes Harvey and Irma, and attempted to find the correlation between the Twitter data of a specific area and the hurricane level in that area. Our experimental results indicate a positive correlation between them. We also present a method to predict the hurricane category for a specific area using relevant Twitter data.
With the increasing number of IoT devices, there is a growing demand for energy-free sensors. Human activity recognition holds immense value in numerous daily healthcare applications. However, the majority of current sensing modalities consume energy, thus limiting their sustainable adoption. In this paper, we present a novel activity recognition system that not only operates without requiring energy for sensing but also harvests energy. Our proposed system utilizes photovoltaic cells, attached to the wrist and shoes, as eco-friendly sensing devices for activity recognition. By capturing photovoltaic readings and employing a deep transformer model with powerful learning capabilities, the system effectively recognizes user activities. To ensure robust performance across various subjects, time periods, and lighting conditions, the system incorporates feature extraction and different processing modules. The evaluation of the proposed system on realistic indoor and outdoor environments demonstrated its ability to recognize activities with an accuracy of 91.7%.
While tumor dynamic modeling has been widely applied to support the development of oncology drugs, there remains a need to increase predictivity, enable personalized therapy, and improve decision-making. We propose the use of Tumor Dynamic Neural-ODE (TDNODE) as a pharmacology-informed neural network to enable model discovery from longitudinal tumor size data. We show that TDNODE overcomes a key limitation of existing models in its ability to make unbiased predictions from truncated data. The encoder-decoder architecture is designed to express an underlying dynamical law which possesses the fundamental property of generalized homogeneity with respect to time. Thus, the modeling formalism enables the encoder output to be interpreted as kinetic rate metrics, with inverse time as the physical unit. We show that the generated metrics can be used to predict patients' overall survival (OS) with high accuracy. The proposed modeling formalism provides a principled way to integrate multimodal dynamical datasets in oncology disease modeling.
The rise of Alzheimers Disease worldwide has prompted a search for efficient tools which can be used to predict deterioration in cognitive decline leading to dementia. In this paper, we explore the potential of survival machine learning as such a tool for building models capable of predicting not only deterioration but also the likely time to deterioration. We demonstrate good predictive ability (0.86 C-Index), lending support to its use in clinical investigation and prediction of Alzheimers Disease risk.
Fluorescence lifetime imaging (FLI) has been receiving increased attention in recent years as a powerful imaging technique in biological and medical research. However, existing FLI systems often suffer from a tradeoff between processing speed, accuracy, and robustness. In this paper, we propose a SPAD TCSPC system coupled to a recurrent neural network (RNN) for FLI that accurately estimates on the fly fluorescence lifetime directly from raw timestamps instead of histograms, which drastically reduces the data transfer rate and hardware resource utilization. We train two variants of the RNN on a synthetic dataset and compare the results to those obtained using the center-of-mass method (CMM) and least squares fitting (LS fitting) methods. The results demonstrate that two RNN variants, gated recurrent unit (GRU) and long short-term memory (LSTM), are comparable to CMM and LS fitting in terms of accuracy and outperform CMM and LS fitting by a large margin in the presence of background noise. We also look at the Cramer-Rao lower bound and detailed analysis showed that the RNN models are close to the theoretical optima. The analysis of experimental data shows that our model, which is purely trained on synthetic datasets, works well on real-world data. We build a FLI microscope setup for evaluation based on Piccolo, a 32$\times$32 SPAD sensor developed in our lab. Four quantized GRU cores, capable of processing up to 4 million photons per second, are deployed on a Xilinx Kintex-7 FPGA. Powered by the GRU, the FLI setup can retrieve real-time fluorescence lifetime images at up to 10 frames per second. The proposed FLI system is promising for many important biomedical applications, ranging from biological imaging of fast-moving cells to fluorescence-assisted diagnosis and surgery.
Machine learning (ML) holds great potential for accurately forecasting treatment outcomes over time, which could ultimately enable the adoption of more individualized treatment strategies in many practical applications. However, a significant challenge that has been largely overlooked by the ML literature on this topic is the presence of informative sampling in observational data. When instances are observed irregularly over time, sampling times are typically not random, but rather informative -- depending on the instance's characteristics, past outcomes, and administered treatments. In this work, we formalize informative sampling as a covariate shift problem and show that it can prohibit accurate estimation of treatment outcomes if not properly accounted for. To overcome this challenge, we present a general framework for learning treatment outcomes in the presence of informative sampling using inverse intensity-weighting, and propose a novel method, TESAR-CDE, that instantiates this framework using Neural CDEs. Using a simulation environment based on a clinical use case, we demonstrate the effectiveness of our approach in learning under informative sampling.