Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable computing resources and memory, hence many efficient approaches have been developed for improving system pipelines as well as operators. However, the runtime performance can vary significantly across hardware and software stacks, which makes it difficult to choose the best configuration. In this work, we aim to benchmark the performance from both macro and micro perspectives. First, we benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and 70B) on three 8-GPU platforms with and without individual optimization techniques, including ZeRO, quantization, recomputation, FlashAttention. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs. For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs. For researchers, our in-depth module-wise analyses discover potential opportunities for future work to further optimize the runtime performance of LLMs.
In this work, a parameter-efficient attention module is presented for emotion classification using a limited, or relatively small, number of electroencephalogram (EEG) signals. This module is called the Monotonicity Constrained Attention Module (MCAM) due to its capability of incorporating priors on the monotonicity when converting features' Gram matrices into attention matrices for better feature refinement. Our experiments have shown that MCAM's effectiveness is comparable to state-of-the-art attention modules in boosting the backbone network's performance in prediction while requiring less parameters. Several accompanying sensitivity analyses on trained models' prediction concerning different attacks are also performed. These attacks include various frequency domain filtering levels and gradually morphing between samples associated with multiple labels. Our results can help better understand different modules' behaviour in prediction and can provide guidance in applications where data is limited and are with noises.
Electromagnetic (EM) imaging is widely applied in sensing for security, biomedicine, geophysics, and various industries. It is an ill-posed inverse problem whose solution is usually computationally expensive. Machine learning (ML) techniques and especially deep learning (DL) show potential in fast and accurate imaging. However, the high performance of purely data-driven approaches relies on constructing a training set that is statistically consistent with practical scenarios, which is often not possible in EM imaging tasks. Consequently, generalizability becomes a major concern. On the other hand, physical principles underlie EM phenomena and provide baselines for current imaging techniques. To benefit from prior knowledge in big data and the theoretical constraint of physical laws, physics embedded ML methods for EM imaging have become the focus of a large body of recent work. This article surveys various schemes to incorporate physics in learning-based EM imaging. We first introduce background on EM imaging and basic formulations of the inverse problem. We then focus on three types of strategies combining physics and ML for linear and nonlinear imaging and discuss their advantages and limitations. Finally, we conclude with open challenges and possible ways forward in this fast-developing field. Our aim is to facilitate the study of intelligent EM imaging methods that will be efficient, interpretable and controllable.
Depth estimation is one of the key technologies in some fields such as autonomous driving and robot navigation. However, the traditional method of using a single sensor is inevitably limited by the performance of the sensor. Therefore, a precision and robust method for fusing the LiDAR and stereo cameras is proposed. This method fully combines the advantages of the LiDAR and stereo camera, which can retain the advantages of the high precision of the LiDAR and the high resolution of images respectively. Compared with the traditional stereo matching method, the texture of the object and lighting conditions have less influence on the algorithm. Firstly, the depth of the LiDAR data is converted to the disparity of the stereo camera. Because the density of the LiDAR data is relatively sparse on the y-axis, the converted disparity map is up-sampled using the interpolation method. Secondly, in order to make full use of the precise disparity map, the disparity map and stereo matching are fused to propagate the accurate disparity. Finally, the disparity map is converted to the depth map. Moreover, the converted disparity map can also increase the speed of the algorithm. We evaluate the proposed pipeline on the KITTI benchmark. The experiment demonstrates that our algorithm has higher accuracy than several classic methods.
Artificial intelligence (AI) has been widely applied to music generation topics such as continuation, melody/harmony generation, genre transfer and music infilling application. Although with the burst interest to apply AI to music, there are still few interfaces for the musicians to take advantage of the latest progress of the AI technology. This makes those tools less valuable in practice and harder to find its advantage/drawbacks without utilizing them in the real scenario. This work builds a max patch for interactive music infilling application with different levels of control, including track density/polyphony/occupation rate and bar tonal tension control. The user can select the melody/bass/harmony track as the infilling content up to 16 bars. The infilling algorithm is based on the author's previous work, and the interface sends/receives messages to the AI system hosted in the cloud. This interface lowers the barrier of AI technology and can generate different variations of the selected content. Those results can give several alternatives to the musicians' composition, and the interactive process realizes the value of the AI infilling system.
We present a novel music generation framework for music infilling, with a user friendly interface. Infilling refers to the task of generating musical sections given the surrounding multi-track music. The proposed transformer-based framework is extensible for new control tokens as the added music control tokens such as tonal tension per bar and track polyphony level in this work. We explore the effects of including several musically meaningful control tokens, and evaluate the results using objective metrics related to pitch and rhythm. Our results demonstrate that adding additional control tokens helps to generate music with stronger stylistic similarities to the original music. It also provides the user with more control to change properties like the music texture and tonal tension in each bar compared to previous research which only provided control for track density. We present the model in a Google Colab notebook to enable interactive generation.
We propose a method that learns to camouflage 3D objects within scenes. Given an object's shape and a distribution of viewpoints from which it will be seen, we estimate a texture that will make it difficult to detect. Successfully solving this task requires a model that can accurately reproduce textures from the scene, while simultaneously dealing with the highly conflicting constraints imposed by each viewpoint. We address these challenges with a model based on texture fields and adversarial learning. Our model learns to camouflage a variety of object shapes from randomly sampled locations and viewpoints within the input scene, and is the first to address the problem of hiding complex object shapes. Using a human visual search study, we find that our estimated textures conceal objects significantly better than previous methods. Project site: https://rrrrrguo.github.io/ganmouflage/
Collaborative localization is an essential capability for a team of robots such as connected vehicles to collaboratively estimate object locations from multiple perspectives with reliant cooperation. To enable collaborative localization, four key challenges must be addressed, including modeling complex relationships between observed objects, fusing observations from an arbitrary number of collaborating robots, quantifying localization uncertainty, and addressing latency of robot communications. In this paper, we introduce a novel approach that integrates uncertainty-aware spatiotemporal graph learning and model-based state estimation for a team of robots to collaboratively localize objects. Specifically, we introduce a new uncertainty-aware graph learning model that learns spatiotemporal graphs to represent historical motions of the objects observed by each robot over time and provides uncertainties in object localization. Moreover, we propose a novel method for integrated learning and model-based state estimation, which fuses asynchronous observations obtained from an arbitrary number of robots for collaborative localization. We evaluate our approach in two collaborative object localization scenarios in simulations and on real robots. Experimental results show that our approach outperforms previous methods and achieves state-of-the-art performance on asynchronous collaborative localization.
Purpose: To develop and evaluate MyoMapNet, a rapid myocardial T1 mapping approach that uses neural networks (NN) to estimate voxel-wise myocardial T1 and extracellular (ECV) from T1-weighted images collected after a single inversion pulse over 4-5 heartbeats. Method: MyoMapNet utilizes a simple fully-connected NN to estimate T1 values from 5 (native) or 4 (post-contrast) T1-weighted images. Native MOLLI-5(3)3 T1 was collected in 717 subjects (386 males, 55$\pm$16.5 years) and post-contrast MOLLI-4(1)3(1)2 in 535 subjects (232 male, 56.5$\pm$15 years). The dataset was divided into training (80%) and testing (20%), where 20% of the training set was used to optimize MyoMapNet architecture (size and loss functions). We used MyoMapNet to estimate T1 and ECV maps with the first 5 (native) or 4 (post-contrast) T1-weighted images from the corresponding MOLLI sequence compared to the conventional and an abbreviated MOLLI using similar number of T1-weighted images with 3-parameter curve-fitting. Results: In our preliminary optimizaiton step, we determined that a 5-layers NN trained using mean-absolute-error loss yields lower estimation errors and was used subsequently in independent testing study. The myocardial T1 by MyoMapNet was similar to MOLLI (1200$\pm$45ms vs. 1199$\pm$46ms; P=0.3 for native T1, and 27.3$\pm$3.5% vs. 27.1$\pm$4%; P=0.4 for ECV). MyoMapNet had significantly smaller errors in T1 estimations compared to abbreviated-MOLLI (1$\pm$17ms vs. 31$\pm$34ms, P<0.01 for in native T1, and 0.1$\pm$1.3% vs. 1.9$\pm$2.5%, P<0.01 for ECV). The duration of T1 estimation was approximately 2 ms per slice using MyoMapNet. Conclusion: MyoMapNet T1 mapping enables myocardial T1 quantification in 4-5 heartbeats with near-instantaneous map estimation time with similar accuracy and precision as MOLLI. Keywords: Myocardial T1 mapping, MOLLI, T1 reconstruction, Neural network, Deep Learning.
The recent advancement in computational and communication systems has led to the introduction of high-performing neural networks and high-speed wireless vehicular communication networks. As a result, new technologies such as cooperative perception and cognition have emerged, addressing the inherent limitations of sensory devices by providing solutions for the detection of partially occluded targets and expanding the sensing range. However, designing a reliable cooperative cognition or perception system requires addressing the challenges caused by limited network resources and discrepancies between the data shared by different sources. In this paper, we examine the requirements, limitations, and performance of different cooperative perception techniques, and present an in-depth analysis of the notion of Deep Feature Sharing (DFS). We explore different cooperative object detection designs and evaluate their performance in terms of average precision. We use the Volony dataset for our experimental study. The results confirm that the DFS methods are significantly less sensitive to the localization error caused by GPS noise. Furthermore, the results attest that detection gain of DFS methods caused by adding more cooperative participants in the scenes is comparable to raw information sharing technique while DFS enables flexibility in design toward satisfying communication requirements.