Alert button
Picture for Muhammad Usama

Muhammad Usama

Alert button

Action Segmentation Using 2D Skeleton Heatmaps

Sep 19, 2023
Syed Waleed Hyder, Muhammad Usama, Anas Zafar, Muhammad Naufil, Fawad Javed Fateh, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

This paper presents a 2D skeleton-based action segmentation method with applications in fine-grained human activity recognition. In contrast with state-of-the-art methods which directly take sequences of 3D skeleton coordinates as inputs and apply Graph Convolutional Networks (GCNs) for spatiotemporal feature learning, our main idea is to use sequences of 2D skeleton heatmaps as inputs and employ Temporal Convolutional Networks (TCNs) to extract spatiotemporal features. Despite lacking 3D information, our approach yields comparable/superior performances and better robustness against missing keypoints than previous methods on action segmentation datasets. Moreover, we improve the performances further by using both 2D skeleton heatmaps and RGB videos as inputs. To our best knowledge, this is the first work to utilize 2D skeleton heatmap inputs and the first work to explore 2D skeleton+RGB fusion for action segmentation.

Viaarxiv icon

Sparks of Large Audio Models: A Survey and Outlook

Sep 03, 2023
Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Yi Ren, Heriberto Cuayáhuitl, Wenwu Wang, Xulong Zhang, Roberto Togneri, Björn W. Schuller

Figure 1 for Sparks of Large Audio Models: A Survey and Outlook
Figure 2 for Sparks of Large Audio Models: A Survey and Outlook
Figure 3 for Sparks of Large Audio Models: A Survey and Outlook
Figure 4 for Sparks of Large Audio Models: A Survey and Outlook

This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Processing scenarios. Nevertheless, \textit{Large Audio Models}, epitomized by transformer-based architectures, have shown marked efficacy in this sphere. By leveraging massive amount of data, these models have demonstrated prowess in a variety of audio tasks, spanning from Automatic Speech Recognition and Text-To-Speech to Music Generation, among others. Notably, recently these Foundational Audio Models, like SeamlessM4T, have started showing abilities to act as universal translators, supporting multiple speech tasks for up to 100 languages without any reliance on separate task-specific systems. This paper presents an in-depth analysis of state-of-the-art methodologies regarding \textit{Foundational Large Audio Models}, their performance benchmarks, and their applicability to real-world scenarios. We also highlight current limitations and provide insights into potential future research directions in the realm of \textit{Large Audio Models} with the intent to spark further discussion, thereby fostering innovation in the next generation of audio-processing systems. Furthermore, to cope with the rapid development in this area, we will consistently update the relevant repository with relevant recent articles and their open-source implementations at https://github.com/EmulationAI/awesome-large-audio-models.

* work in progress, Repo URL: https://github.com/EmulationAI/awesome-large-audio-models 
Viaarxiv icon

Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers

Jul 12, 2023
Siddique Latif, Muhammad Usama, Mohammad Ibrahim Malik, Björn W. Schuller

Figure 1 for Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers
Figure 2 for Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers
Figure 3 for Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers
Figure 4 for Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers

Despite recent advancements in speech emotion recognition (SER) models, state-of-the-art deep learning (DL) approaches face the challenge of the limited availability of annotated data. Large language models (LLMs) have revolutionised our understanding of natural language, introducing emergent properties that broaden comprehension in language, speech, and vision. This paper examines the potential of LLMs to annotate abundant speech data, aiming to enhance the state-of-the-art in SER. We evaluate this capability across various settings using publicly available speech emotion classification datasets. Leveraging ChatGPT, we experimentally demonstrate the promising role of LLMs in speech emotion data annotation. Our evaluation encompasses single-shot and few-shots scenarios, revealing performance variability in SER. Notably, we achieve improved results through data augmentation, incorporating ChatGPT-annotated samples into existing datasets. Our work uncovers new frontiers in speech emotion classification, highlighting the increasing significance of LLMs in this field moving forward.

* Under review 
Viaarxiv icon

Emotions Beyond Words: Non-Speech Audio Emotion Recognition With Edge Computing

May 01, 2023
Ibrahim Malik, Siddique Latif, Sanaullah Manzoor, Muhammad Usama, Junaid Qadir, Raja Jurdak

Figure 1 for Emotions Beyond Words: Non-Speech Audio Emotion Recognition With Edge Computing
Figure 2 for Emotions Beyond Words: Non-Speech Audio Emotion Recognition With Edge Computing
Figure 3 for Emotions Beyond Words: Non-Speech Audio Emotion Recognition With Edge Computing
Figure 4 for Emotions Beyond Words: Non-Speech Audio Emotion Recognition With Edge Computing

Non-speech emotion recognition has a wide range of applications including healthcare, crime control and rescue, and entertainment, to name a few. Providing these applications using edge computing has great potential, however, recent studies are focused on speech-emotion recognition using complex architectures. In this paper, a non-speech-based emotion recognition system is proposed, which can rely on edge computing to analyse emotions conveyed through non-speech expressions like screaming and crying. In particular, we explore knowledge distillation to design a computationally efficient system that can be deployed on edge devices with limited resources without degrading the performance significantly. We comprehensively evaluate our proposed framework using two publicly available datasets and highlight its effectiveness by comparing the results with the well-known MobileNet model. Our results demonstrate the feasibility and effectiveness of using edge computing for non-speech emotion detection, which can potentially improve applications that rely on emotion detection in communication networks. To the best of our knowledge, this is the first work on an edge-computing-based framework for detecting emotions in non-speech audio, offering promising directions for future research.

* Under review 
Viaarxiv icon

Privacy Enhancement for Cloud-Based Few-Shot Learning

May 10, 2022
Archit Parnami, Muhammad Usama, Liyue Fan, Minwoo Lee

Figure 1 for Privacy Enhancement for Cloud-Based Few-Shot Learning
Figure 2 for Privacy Enhancement for Cloud-Based Few-Shot Learning
Figure 3 for Privacy Enhancement for Cloud-Based Few-Shot Learning
Figure 4 for Privacy Enhancement for Cloud-Based Few-Shot Learning

Requiring less data for accurate models, few-shot learning has shown robustness and generality in many application domains. However, deploying few-shot models in untrusted environments may inflict privacy concerns, e.g., attacks or adversaries that may breach the privacy of user-supplied data. This paper studies the privacy enhancement for the few-shot learning in an untrusted environment, e.g., the cloud, by establishing a novel privacy-preserved embedding space that preserves the privacy of data and maintains the accuracy of the model. We examine the impact of various image privacy methods such as blurring, pixelization, Gaussian noise, and differentially private pixelization (DP-Pix) on few-shot image classification and propose a method that learns privacy-preserved representation through the joint loss. The empirical results show how privacy-performance trade-off can be negotiated for privacy-enhanced few-shot learning.

* 14 pages, 13 figures, 3 tables. Preprint. Accepted in IEEE WCCI 2022 International Joint Conference on Neural Networks (IJCNN) 
Viaarxiv icon

Vehicle and License Plate Recognition with Novel Dataset for Toll Collection

Feb 11, 2022
Muhammad Usama, Hafeez Anwar, Muhammad Muaz Shahid, Abbas Anwar, Saeed Anwar, Helmuth Hlavacs

Figure 1 for Vehicle and License Plate Recognition with Novel Dataset for Toll Collection
Figure 2 for Vehicle and License Plate Recognition with Novel Dataset for Toll Collection
Figure 3 for Vehicle and License Plate Recognition with Novel Dataset for Toll Collection
Figure 4 for Vehicle and License Plate Recognition with Novel Dataset for Toll Collection

We propose an automatic framework for toll collection, consisting of three steps: vehicle type recognition, license plate localization, and reading. However, each of the three steps becomes non-trivial due to image variations caused by several factors. The traditional vehicle decorations on the front cause variations among vehicles of the same type. These decorations make license plate localization and recognition difficult due to severe background clutter and partial occlusions. Likewise, on most vehicles, specifically trucks, the position of the license plate is not consistent. Lastly, for license plate reading, the variations are induced by non-uniform font styles, sizes, and partially occluded letters and numbers. Our proposed framework takes advantage of both data availability and performance evaluation of the backbone deep learning architectures. We gather a novel dataset, \emph{Diverse Vehicle and License Plates Dataset (DVLPD)}, consisting of 10k images belonging to six vehicle types. Each image is then manually annotated for vehicle type, license plate, and its characters and digits. For each of the three tasks, we evaluate You Only Look Once (YOLO)v2, YOLOv3, YOLOv4, and FasterRCNN. For real-time implementation on a Raspberry Pi, we evaluate the lighter versions of YOLO named Tiny YOLOv3 and Tiny YOLOv4. The best Mean Average Precision (mAP@0.5) of 98.8% for vehicle type recognition, 98.5% for license plate detection, and 98.3% for license plate reading is achieved by YOLOv4, while its lighter version, i.e., Tiny YOLOv4 obtained a mAP of 97.1%, 97.4%, and 93.7% on vehicle type recognition, license plate detection, and license plate reading, respectively. The dataset and the training codes are available at https://github.com/usama-x930/VT-LPR

Viaarxiv icon

Fake Visual Content Detection Using Two-Stream Convolutional Neural Networks

Jan 03, 2021
Bilal Yousaf, Muhammad Usama, Waqas Sultani, Arif Mahmood, Junaid Qadir

Figure 1 for Fake Visual Content Detection Using Two-Stream Convolutional Neural Networks
Figure 2 for Fake Visual Content Detection Using Two-Stream Convolutional Neural Networks
Figure 3 for Fake Visual Content Detection Using Two-Stream Convolutional Neural Networks
Figure 4 for Fake Visual Content Detection Using Two-Stream Convolutional Neural Networks

Rapid progress in adversarial learning has enabled the generation of realistic-looking fake visual content. To distinguish between fake and real visual content, several detection techniques have been proposed. The performance of most of these techniques however drops off significantly if the test and the training data are sampled from different distributions. This motivates efforts towards improving the generalization of fake detectors. Since current fake content generation techniques do not accurately model the frequency spectrum of the natural images, we observe that the frequency spectrum of the fake visual data contains discriminative characteristics that can be used to detect fake content. We also observe that the information captured in the frequency spectrum is different from that of the spatial domain. Using these insights, we propose to complement frequency and spatial domain features using a two-stream convolutional neural network architecture called TwoStreamNet. We demonstrate the improved generalization of the proposed two-stream network to several unseen generation architectures, datasets, and techniques. The proposed detector has demonstrated significant performance improvement compared to the current state-of-the-art fake content detectors and fusing the frequency and spatial domain streams has also improved generalization of the detector.

Viaarxiv icon

Intelligent Resource Allocation in Dense LoRa Networks using Deep Reinforcement Learning

Dec 22, 2020
Inaam Ilahi, Muhammad Usama, Muhammad Omer Farooq, Muhammad Umar Janjua, Junaid Qadir

Figure 1 for Intelligent Resource Allocation in Dense LoRa Networks using Deep Reinforcement Learning
Figure 2 for Intelligent Resource Allocation in Dense LoRa Networks using Deep Reinforcement Learning
Figure 3 for Intelligent Resource Allocation in Dense LoRa Networks using Deep Reinforcement Learning
Figure 4 for Intelligent Resource Allocation in Dense LoRa Networks using Deep Reinforcement Learning

The anticipated increase in the count of IoT devices in the coming years motivates the development of efficient algorithms that can help in their effective management while keeping the power consumption low. In this paper, we propose LoRaDRL and provide a detailed performance evaluation. We propose a multi-channel scheme for LoRaDRL. We perform extensive experiments, and our results demonstrate that the proposed algorithm not only significantly improves long-range wide area network (LoRaWAN)'s packet delivery ratio (PDR) but is also able to support mobile end-devices (EDs) while ensuring lower power consumption. Most previous works focus on proposing different MAC protocols for improving the network capacity. We show that through the use of LoRaDRL, we can achieve the same efficiency with ALOHA while moving the complexity from EDs to the gateway thus making the EDs simpler and cheaper. Furthermore, we test the performance of LoRaDRL under large-scale frequency jamming attacks and show its adaptiveness to the changes in the environment. We show that LoRaDRL's output improves the performance of state-of-the-art techniques resulting in some cases an improvement of more than 500% in terms of PDR compared to learning-based techniques.

* 11 pages 
Viaarxiv icon