Video moment retrieval and highlight detection are two highly valuable tasks in video understanding, but until recently they have been jointly studied. Although existing studies have made impressive advancement recently, they predominantly follow the data-driven bottom-up paradigm. Such paradigm overlooks task-specific and inter-task effects, resulting in poor model performance. In this paper, we propose a novel task-driven top-down framework TaskWeave for joint moment retrieval and highlight detection. The framework introduces a task-decoupled unit to capture task-specific and common representations. To investigate the interplay between the two tasks, we propose an inter-task feedback mechanism, which transforms the results of one task as guiding masks to assist the other task. Different from existing methods, we present a task-dependent joint loss function to optimize the model. Comprehensive experiments and in-depth ablation studies on QVHighlights, TVSum, and Charades-STA datasets corroborate the effectiveness and flexibility of the proposed framework. Codes are available at https://github.com/EdenGabriel/TaskWeave.
In the past decades, deep neural networks, particularly convolutional neural networks, have achieved state-of-the-art performance in a variety of medical image segmentation tasks. Recently, the introduction of the vision transformer (ViT) has significantly altered the landscape of deep segmentation models. There has been a growing focus on ViTs, driven by their excellent performance and scalability. However, we argue that the current design of the vision transformer-based UNet (ViT-UNet) segmentation models may not effectively handle the heterogeneous appearance (e.g., varying shapes and sizes) of objects of interest in medical image segmentation tasks. To tackle this challenge, we present a structured approach to introduce spatially dynamic components to the ViT-UNet. This adaptation enables the model to effectively capture features of target objects with diverse appearances. This is achieved by three main components: \textbf{(i)} deformable patch embedding; \textbf{(ii)} spatially dynamic multi-head attention; \textbf{(iii)} deformable positional encoding. These components were integrated into a novel architecture, termed AgileFormer. AgileFormer is a spatially agile ViT-UNet designed for medical image segmentation. Experiments in three segmentation tasks using publicly available datasets demonstrated the effectiveness of the proposed method. The code is available at \href{https://github.com/sotiraslab/AgileFormer}{https://github.com/sotiraslab/AgileFormer}.
Hierarchical transformers have achieved significant success in medical image segmentation due to their large receptive field and capabilities of effectively leveraging global long-range contextual information. Convolutional neural networks (CNNs) can also deliver a large receptive field by using large kernels, enabling them to achieve competitive performance with fewer model parameters. However, CNNs incorporated with large convolutional kernels remain constrained in adaptively capturing multi-scale features from organs with large variations in shape and size due to the employment of fixed-sized kernels. Additionally, they are unable to utilize global contextual information efficiently. To address these limitations, we propose Dynamic Large Kernel (DLK) and Dynamic Feature Fusion (DFF) modules. The DLK module employs multiple large kernels with varying kernel sizes and dilation rates to capture multi-scale features. Subsequently, a dynamic selection mechanism is utilized to adaptively highlight the most important spatial features based on global information. Additionally, the DFF module is proposed to adaptively fuse multi-scale local feature maps based on their global information. We integrate DLK and DFF in a hierarchical transformer architecture to develop a novel architecture, termed D-Net. D-Net is able to effectively utilize a multi-scale large receptive field and adaptively harness global contextual information. Extensive experimental results demonstrate that D-Net outperforms other state-of-the-art models in the two volumetric segmentation tasks, including abdominal multi-organ segmentation and multi-modality brain tumor segmentation. Our code is available at https://github.com/sotiraslab/DLK.
U-Net has been widely used for segmenting abdominal organs, achieving promising performance. However, when it is used for multi-organ segmentation, first, it may be limited in exploiting global long-range contextual information due to the implementation of standard convolutions. Second, the use of spatial-wise downsampling (e.g., max pooling or strided convolutions) in the encoding path may lead to the loss of deformable or discriminative details. Third, features upsampled from the higher level are concatenated with those that persevered via skip connections. However, repeated downsampling and upsampling operations lead to misalignments between them and their concatenation degrades segmentation performance. To address these limitations, we propose Dynamically Calibrated Convolution (DCC), Dynamically Calibrated Downsampling (DCD), and Dynamically Calibrated Upsampling (DCU) modules, respectively. The DCC module can utilize global inter-dependencies between spatial and channel features to calibrate these features adaptively. The DCD module enables networks to adaptively preserve deformable or discriminative features during downsampling. The DCU module can dynamically align and calibrate upsampled features to eliminate misalignments before concatenations. We integrated the proposed modules into a standard U-Net, resulting in a new architecture, termed Dynamic U-Net. This architectural design enables U-Net to dynamically adjust features for different organs. We evaluated Dynamic U-Net in two abdominal multi-organ segmentation benchmarks. Dynamic U-Net achieved statistically improved segmentation accuracy compared with standard U-Net. Our code is available at https://github.com/sotiraslab/DynamicUNet.
Recent years have witnessed an exponential growth in developing deep learning (DL) models for the time-series electricity forecasting in power systems. However, most of the proposed models are designed based on the designers' inherent knowledge and experience without elaborating on the suitability of the proposed neural architectures. Moreover, these models cannot be self-adjusted to the dynamically changing data patterns due to an inflexible design of their structures. Even though several latest studies have considered application of the neural architecture search (NAS) technique for obtaining a network with an optimized structure in the electricity forecasting sector, their training process is quite time-consuming, computationally expensive and not intelligent, indicating that the NAS application in electricity forecasting area is still at an infancy phase. In this research study, we propose an intelligent automated architecture search (IAAS) framework for the development of time-series electricity forecasting models. The proposed framework contains two primary components, i.e., network function-preserving transformation operation and reinforcement learning (RL)-based network transformation control. In the first component, we introduce a theoretical function-preserving transformation of recurrent neural networks (RNN) to the literature for capturing the hidden temporal patterns within the time-series data. In the second component, we develop three RL-based transformation actors and a net pool to intelligently and effectively search a high-quality neural architecture. After conducting comprehensive experiments on two publicly-available electricity load datasets and two wind power datasets, we demonstrate that the proposed IAAS framework significantly outperforms the ten existing models or methods in terms of forecasting accuracy and stability.
We present a new particle tracking algorithm to accurately resolve large deformation and rotational motion fields, which takes advantage of both local and global particle tracking algorithms. We call this method the ScalE and Rotation Invariant Augmented Lagrangian Particle Tracking (SerialTrack). This method builds an iterative scale and rotation invariant topology-based feature for each particle within a multi-scale tracking algorithm. The global kinematic compatibility condition is applied as a global augmented Lagrangian constraint to enhance the tracking accuracy. An open source software package implementing this numerical approach to track both 2D and 3D, incremental and cumulative deformation fields is provided.
This work focuses on an efficient Agile design methodology for domain-specific accelerators. We employ feature-by-feature enhancement of a vertical development stack and apply it to the TVM/VTA inference accelerator. We have enhanced the VTA design space and enabled end-to-end support for additional workloads. This has been accomplished by augmenting the VTA micro-architecture and instruction set architecture (ISA), as well as by enhancing the TVM compilation stack to support a wide range of VTA configs. The VTA tsim implementation (CHISEL-based) has been enhanced with fully pipelined versions of the ALU/GEMM execution units. In tsim, memory width can now range between 8-64 bytes. Field widths have been made more flexible to support larger scratchpads. New instructions have been added: element-wise 8-bit multiplication to support depthwise convolution, and load with a choice of pad values to support max pooling. Support for more layers and better double buffering has also been added. Fully pipelining ALU/GEMM helps significantly: 4.9x fewer cycles with minimal area change to run ResNet-18 under the default config. Configs featuring a further 11.5x decrease in cycle count at a cost of 12x greater area can be instantiated. Many points on the area-performance pareto curve are shown, showcasing the balance of execution unit sizing, memory interface width, and scratchpad sizing. Finally, VTA is now able to run Mobilenet 1.0 and all layers for ResNets, including the previously disabled pooling and fully connected layers. The TVM/VTA architecture has always featured end-to-end workload evaluation on RTL in minutes. With our modifications, it now offers a much greater number of feasible configurations with a wide range of cost vs. performance. All capabilities mentioned are available in opensource forks while a subset of these capabilities have already been upstreamed.
Concise and abstract models of system-level behaviors are invaluable in design analysis, testing, and validation. In this paper, we consider the problem of inferring models from communication traces of system-on-chip~(SoC) designs. The traces capture communications among different blocks of a SoC design in terms of messages exchanged. The extracted models characterize the system-level communication protocols governing how blocks exchange messages, and coordinate with each other to realize various system functions. In this paper, the above problem is formulated as a constraint satisfaction problem, which is then fed to a SMT solver. The solutions returned by the SMT solver are used to extract the models that accept the input traces. In the experiments, we demonstrate the proposed approach with traces collected from a transaction-level simulation model of a multicore SoC design and traces of a more detailed multicore SoC design developed in GEM5 environment.
To improve the efficiency of deep reinforcement learning (DRL) based methods for robot manipulator trajectory planning in random working environment. Different from the traditional sparse reward function, we present three dense reward functions in this paper. Firstly, posture reward function is proposed to accelerate the learning process with a more reasonable trajectory by modeling the distance and direction constraints, which can reduce the blindness of exploration. Secondly, to improve the stability, a reward function at stride reward is proposed by modeling the distance and movement distance of joints constraints, it can make the learning process more stable. In order to further improve learning efficiency, we are inspired by the cognitive process of human behavior and propose a stage incentive mechanism, including hard stage incentive reward function and soft stage incentive reward function. Extensive experiments show that the soft stage incentive reward function proposed is able to improve convergence rate by up to 46.9% with the state-of-the-art DRL methods. The percentage increase in convergence mean reward is 4.4%~15.5% and the percentage decreases with respect to standard deviation by 21.9%~63.2%. In the evaluation, the success rate of trajectory planning for robot manipulator is up to 99.6%.
With the growing dependence on wind power generation, improving the accuracy of short-term forecasting has become increasingly important for ensuring continued economical and reliable system operations. In the wind power forecasting field, ensemble-based forecasting models have been studied extensively; however, few of them considered learning the features from both historical wind data and NWP data. In addition, the exploration of the multiple-input and multiple-output learning structures is lacking in the wind power forecasting literature. Therefore, this study exploits the NWP and historical wind data as input and proposes a two-stage forecasting framework on the shelf of moving window algorithm. Specifically, at the first stage, four forecasting models are constructed with deep neural networks considering the multiple-input and multiple-output structures; at the second stage, an ensemble model is developed using ridge regression method for reducing the extrapolation error. The experiments are conducted on three existing wind farms for examining the 2-h ahead forecasting point. The results demonstrate that 1) the single-input-multiple-output (SIMO) structure leads to a better forecasting accuracy than the other threes; 2) ridge regression method results in a better ensemble model that is able to further improve the forecasting accuracy, than the other machine learning methods; 3) the proposed two-stage forecasting framework is likely to generate more accurate and stable results than the other existing algorithms.