Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hongyuan Zhu

Self-Supervised Learning for WiFi CSI-Based Human Activity Recognition: A Systematic Study

Jul 19, 2023

Ke Xu, Jiangtao Wang, Hongyuan Zhu, Dingchang Zheng

Abstract:Recently, with the advancement of the Internet of Things (IoT), WiFi CSI-based HAR has gained increasing attention from academic and industry communities. By integrating the deep learning technology with CSI-based HAR, researchers achieve state-of-the-art performance without the need of expert knowledge. However, the scarcity of labeled CSI data remains the most prominent challenge when applying deep learning models in the context of CSI-based HAR due to the privacy and incomprehensibility of CSI-based HAR data. On the other hand, SSL has emerged as a promising approach for learning meaningful representations from data without heavy reliance on labeled examples. Therefore, considerable efforts have been made to address the challenge of insufficient data in deep learning by leveraging SSL algorithms. In this paper, we undertake a comprehensive inventory and analysis of the potential held by different categories of SSL algorithms, including those that have been previously studied and those that have not yet been explored, within the field. We provide an in-depth investigation of SSL algorithms in the context of WiFi CSI-based HAR. We evaluate four categories of SSL algorithms using three publicly available CSI HAR datasets, each encompassing different tasks and environmental settings. To ensure relevance to real-world applications, we design performance metrics that align with specific requirements. Furthermore, our experimental findings uncover several limitations and blind spots in existing work, highlighting the barriers that need to be addressed before SSL can be effectively deployed in real-world WiFi-based HAR applications. Our results also serve as a practical guideline for industry practitioners and provide valuable insights for future research endeavors in this field.

Via

Access Paper or Ask Questions

An Overview of Challenges in Egocentric Text-Video Retrieval

Jun 07, 2023

Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

Abstract:Text-video retrieval contains various challenges, including biases coming from diverse sources. We highlight some of them supported by illustrations to open a discussion. Besides, we address one of the biases, frame length bias, with a simple method which brings a very incremental but promising increase. We conclude with future directions.

* 4 pages, CVPR 2023 Joint Ego4D&EPIC Workshop, Extended Abstract

Via

Access Paper or Ask Questions

Multi-view Vision-Prompt Fusion Network: Can 2D Pre-trained Model Boost 3D Point Cloud Data-scarce Learning?

Apr 20, 2023

Haoyang Peng, Baopu Li, Bo Zhang, Xin Chen, Tao Chen, Hongyuan Zhu

Abstract:Point cloud based 3D deep model has wide applications in many applications such as autonomous driving, house robot, and so on. Inspired by the recent prompt learning in natural language processing, this work proposes a novel Multi-view Vision-Prompt Fusion Network (MvNet) for few-shot 3D point cloud classification. MvNet investigates the possibility of leveraging the off-the-shelf 2D pre-trained models to achieve the few-shot classification, which can alleviate the over-dependence issue of the existing baseline models towards the large-scale annotated 3D point cloud data. Specifically, MvNet first encodes a 3D point cloud into multi-view image features for a number of different views. Then, a novel multi-view prompt fusion module is developed to effectively fuse information from different views to bridge the gap between 3D point cloud data and 2D pre-trained models. A set of 2D image prompts can then be derived to better describe the suitable prior knowledge for a large-scale pre-trained image model for few-shot 3D point cloud classification. Extensive experiments on ModelNet, ScanObjectNN, and ShapeNet datasets demonstrate that MvNet achieves new state-of-the-art performance for 3D few-shot point cloud image classification. The source code of this work will be available soon.

* 10 pages,5 figures

Via

Access Paper or Ask Questions

A Closer Look at Few-Shot 3D Point Cloud Classification

Mar 31, 2023

Chuangguan Ye, Hongyuan Zhu, Bo Zhang, Tao Chen

Abstract:In recent years, research on few-shot learning (FSL) has been fast-growing in the 2D image domain due to the less requirement for labeled training data and greater generalization for novel classes. However, its application in 3D point cloud data is relatively under-explored. Not only need to distinguish unseen classes as in the 2D domain, 3D FSL is more challenging in terms of irregular structures, subtle inter-class differences, and high intra-class variances {when trained on a low number of data.} Moreover, different architectures and learning algorithms make it difficult to study the effectiveness of existing 2D FSL algorithms when migrating to the 3D domain. In this work, for the first time, we perform systematic and extensive investigations of directly applying recent 2D FSL works to 3D point cloud related backbone networks and thus suggest a strong learning baseline for few-shot 3D point cloud classification. Furthermore, we propose a new network, Point-cloud Correlation Interaction (PCIA), with three novel plug-and-play components called Salient-Part Fusion (SPF) module, Self-Channel Interaction Plus (SCI+) module, and Cross-Instance Fusion Plus (CIF+) module to obtain more representative embeddings and improve the feature distinction. These modules can be inserted into most FSL algorithms with minor changes and significantly improve the performance. Experimental results on three benchmark datasets, ModelNet40-FS, ShapeNet70-FS, and ScanObjectNN-FS, demonstrate that our method achieves state-of-the-art performance for the 3D FSL task. Code and datasets are available at https://github.com/cgye96/A_Closer_Look_At_3DFSL.

* Accepted by IJCV 2023

Via

Access Paper or Ask Questions

What Makes for Effective Few-shot Point Cloud Classification?

Mar 31, 2023

Chuangguan Ye, Hongyuan Zhu, Yongbin Liao, Yanggang Zhang, Tao Chen, Jiayuan Fan

Figure 1 for What Makes for Effective Few-shot Point Cloud Classification?

Figure 2 for What Makes for Effective Few-shot Point Cloud Classification?

Figure 3 for What Makes for Effective Few-shot Point Cloud Classification?

Figure 4 for What Makes for Effective Few-shot Point Cloud Classification?

Abstract:Due to the emergence of powerful computing resources and large-scale annotated datasets, deep learning has seen wide applications in our daily life. However, most current methods require extensive data collection and retraining when dealing with novel classes never seen before. On the other hand, we humans can quickly recognize new classes by looking at a few samples, which motivates the recent popularity of few-shot learning (FSL) in machine learning communities. Most current FSL approaches work on 2D image domain, however, its implication in 3D perception is relatively under-explored. Not only needs to recognize the unseen examples as in 2D domain, 3D few-shot learning is more challenging with unordered structures, high intra-class variances, and subtle inter-class differences. Moreover, different architectures and learning algorithms make it difficult to study the effectiveness of existing 2D methods when migrating to the 3D domain. In this work, for the first time, we perform systematic and extensive studies of recent 2D FSL and 3D backbone networks for benchmarking few-shot point cloud classification, and we suggest a strong baseline and learning architectures for 3D FSL. Then, we propose a novel plug-and-play component called Cross-Instance Adaptation (CIA) module, to address the high intra-class variances and subtle inter-class differences issues, which can be easily inserted into current baselines with significant performance improvement. Extensive experiments on two newly introduced benchmark datasets, ModelNet40-FS and ShapeNet70-FS, demonstrate the superiority of our proposed network for 3D FSL.

* Accepted by WACV 2022. arXiv admin note: text overlap with arXiv:2303.18210

Via

Access Paper or Ask Questions

End-to-End 3D Dense Captioning with Vote2Cap-DETR

Jan 06, 2023

Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU

Figure 1 for End-to-End 3D Dense Captioning with Vote2Cap-DETR

Figure 2 for End-to-End 3D Dense Captioning with Vote2Cap-DETR

Figure 3 for End-to-End 3D Dense Captioning with Vote2Cap-DETR

Figure 4 for End-to-End 3D Dense Captioning with Vote2Cap-DETR

Abstract:3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated ``detect-then-describe'' pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular \textbf{DE}tection \textbf{TR}ansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection and captioning in one-stage. 3) Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our Vote2Cap-DETR surpasses current state-of-the-arts by 11.13\% and 7.11\% in CIDEr@0.5IoU, respectively. Codes will be released soon.

Via

Access Paper or Ask Questions

Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

Jun 29, 2022

Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

Figure 1 for Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

Figure 2 for Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

Abstract:In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022. We first parse sentences into semantic roles corresponding to verbs and nouns; then utilize self-attentions to exploit semantic role contextualized video features along with textual features via triplet losses in multiple embedding spaces. Our method overpasses the strong baseline in normalized Discounted Cumulative Gain (nDCG), which is more valuable for semantic similarity. Our submission is ranked 3rd for nDCG and ranked 4th for mAP.

* Ranked joint 3rd place in the Multi-Instance Retrieval Challenge at EPIC@CVPR2022

Via

Access Paper or Ask Questions

Semantic Role Aware Correlation Transformer for Text to Video Retrieval

Jun 26, 2022

Burak Satar, Hongyuan Zhu, Xavier Bresson, Joo Hwee Lim

Figure 1 for Semantic Role Aware Correlation Transformer for Text to Video Retrieval

Figure 2 for Semantic Role Aware Correlation Transformer for Text to Video Retrieval

Abstract:With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at different levels. The preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics. It also overpasses two SOTA methods in terms of two metrics.

* IEEE International Conference on Image Processing (ICIP), 2021, pp. 1334-1338
* Camera-ready for ICIP 2021

Via

Access Paper or Ask Questions

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Jun 26, 2022

Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

Figure 1 for RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Figure 2 for RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Figure 3 for RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Figure 4 for RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Abstract:Seas of videos are uploaded daily with the popularity of social channels; thus, retrieving the most related video contents with user textual queries plays a more crucial role. Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality. Some other approaches consider multiple embedding spaces consisting of global and local features separately, ignoring rich inter-modality correlations. We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels; the roles of spatial contexts, temporal contexts, and object contexts. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels with mixture-of-experts for considering inter-modalities and structures' correlations. The results indicate that our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets, given the same visual backbone without pre-training. Finally, we conducted extensive ablation studies to elucidate our design choices.

* Preprint, under review in TCSVT Journal

Via

Access Paper or Ask Questions

OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

May 23, 2022

Peng Hu, Xi Peng, Hongyuan Zhu, Mohamed M. Sabry Aly, Jie Lin

Figure 1 for OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

Figure 2 for OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

Figure 3 for OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

Figure 4 for OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization

Abstract:As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resource-constrained hardware platforms, e.g., smartphones. Numerous network compression methods such as pruning and quantization are proposed to reduce the model size significantly, of which the key is to find suitable compression allocation (e.g., pruning sparsity and quantization codebook) of each layer. Existing solutions obtain the compression allocation in an iterative/manual fashion while finetuning the compressed model, thus suffering from the efficiency issue. Different from the prior art, we propose a novel One-shot Pruning-Quantization (OPQ) in this paper, which analytically solves the compression allocation with pre-trained weight parameters only. During finetuning, the compression module is fixed and only weight parameters are updated. To our knowledge, OPQ is the first work that reveals pre-trained model is sufficient for solving pruning and quantization simultaneously, without any complex iterative/manual optimization at the finetuning stage. Furthermore, we propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook, which leads to low bit-rate allocation without introducing extra overhead brought by traditional channel-wise quantization. Comprehensive experiments on ImageNet with AlexNet/MobileNet-V1/ResNet-50 show that our method improves accuracy and training efficiency while obtains significantly higher compression rates compared to the state-of-the-art.

* Accepted in AAAI2021 and Just upload to retrieve Arxiv DOI for Project Record

Via

Access Paper or Ask Questions