Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yu Liu

Peking University

A 10.60 $μ$W 150 GOPS Mixed-Bit-Width Sparse CNN Accelerator for Life-Threatening Ventricular Arrhythmia Detection

Oct 22, 2024

Yifan Qin, Zhenge Jia, Zheyu Yan, Jay Mok, Manto Yung, Yu Liu, Xuejiao Liu, Wujie Wen, Luhong Liang, Kwang-Ting Tim Cheng(+2 more)

Figure 1 for A 10.60 $μ$W 150 GOPS Mixed-Bit-Width Sparse CNN Accelerator for Life-Threatening Ventricular Arrhythmia Detection

Figure 2 for A 10.60 $μ$W 150 GOPS Mixed-Bit-Width Sparse CNN Accelerator for Life-Threatening Ventricular Arrhythmia Detection

Figure 3 for A 10.60 $μ$W 150 GOPS Mixed-Bit-Width Sparse CNN Accelerator for Life-Threatening Ventricular Arrhythmia Detection

Figure 4 for A 10.60 $μ$W 150 GOPS Mixed-Bit-Width Sparse CNN Accelerator for Life-Threatening Ventricular Arrhythmia Detection

Abstract:This paper proposes an ultra-low power, mixed-bit-width sparse convolutional neural network (CNN) accelerator to accelerate ventricular arrhythmia (VA) detection. The chip achieves 50% sparsity in a quantized 1D CNN using a sparse processing element (SPE) architecture. Measurement on the prototype chip TSMC 40nm CMOS low-power (LP) process for the VA classification task demonstrates that it consumes 10.60 $\mu$W of power while achieving a performance of 150 GOPS and a diagnostic accuracy of 99.95%. The computation power density is only 0.57 $\mu$W/mm$^2$, which is 14.23X smaller than state-of-the-art works, making it highly suitable for implantable and wearable medical devices.

* 2 pages, accepted to The 30th Asia and South Pacific Design Automation Conference (ASP-DAC 2025)

Via

Access Paper or Ask Questions

Group Diffusion Transformers are Unsupervised Multitask Learners

Oct 19, 2024

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Huanzhang Dou, Yupeng Shi, Yutong Feng, Chen Liang, Yu Liu, Jingren Zhou

Figure 1 for Group Diffusion Transformers are Unsupervised Multitask Learners

Abstract:While large language models (LLMs) have revolutionized natural language processing with their task-agnostic capabilities, visual generation tasks such as image translation, style transfer, and character customization still rely heavily on supervised, task-specific datasets. In this work, we introduce Group Diffusion Transformers (GDTs), a novel framework that unifies diverse visual generation tasks by redefining them as a group generation problem. In this approach, a set of related images is generated simultaneously, optionally conditioned on a subset of the group. GDTs build upon diffusion transformers with minimal architectural modifications by concatenating self-attention tokens across images. This allows the model to implicitly capture cross-image relationships (e.g., identities, styles, layouts, surroundings, and color schemes) through caption-based correlations. Our design enables scalable, unsupervised, and task-agnostic pretraining using extensive collections of image groups sourced from multimodal internet articles, image galleries, and video frames. We evaluate GDTs on a comprehensive benchmark featuring over 200 instructions across 30 distinct visual generation tasks, including picture book creation, font design, style transfer, sketching, colorization, drawing sequence generation, and character customization. Our models achieve competitive zero-shot performance without any additional fine-tuning or gradient updates. Furthermore, ablation studies confirm the effectiveness of key components such as data scaling, group size, and model design. These results demonstrate the potential of GDTs as scalable, general-purpose visual generation systems.

Via

Access Paper or Ask Questions

LoD-Loc: Aerial Visual Localization using LoD 3D Map with Neural Wireframe Alignment

Oct 16, 2024

Juelin Zhu, Shen Yan, Long Wang, Shengyue Zhang, Yu Liu, Maojun Zhang

Abstract:We propose a new method named LoD-Loc for visual localization in the air. Unlike existing localization algorithms, LoD-Loc does not rely on complex 3D representations and can estimate the pose of an Unmanned Aerial Vehicle (UAV) using a Level-of-Detail (LoD) 3D map. LoD-Loc mainly achieves this goal by aligning the wireframe derived from the LoD projected model with that predicted by the neural network. Specifically, given a coarse pose provided by the UAV sensor, LoD-Loc hierarchically builds a cost volume for uniformly sampled pose hypotheses to describe pose probability distribution and select a pose with maximum probability. Each cost within this volume measures the degree of line alignment between projected and predicted wireframes. LoD-Loc also devises a 6-DoF pose optimization algorithm to refine the previous result with a differentiable Gaussian-Newton method. As no public dataset exists for the studied problem, we collect two datasets with map levels of LoD3.0 and LoD2.0, along with real RGB queries and ground-truth pose annotations. We benchmark our method and demonstrate that LoD-Loc achieves excellent performance, even surpassing current state-of-the-art methods that use textured 3D models for localization. The code and dataset are available at https://victorzoo.github.io/LoD-Loc.github.io/.

* Accepted by NeurIPS 2024; for Project page, see https://victorzoo.github.io/LoD-Loc.github.io/

Via

Access Paper or Ask Questions

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

Oct 15, 2024

Lijie Tao, Haokui Zhang, Haizhao Jing, Yu Liu, Kelu Yao, Chao Li, Xizhe Xue

Figure 1 for Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

Figure 2 for Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

Figure 3 for Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

Figure 4 for Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

Abstract:Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in visual language models (VLMs) have pushed this enthusiasm to new heights. Differring from previous AI approaches that generally formulated different tasks as discriminative models, VLMs frame tasks as generative models and align language with visual information, enabling the handling of more challenging problems. The remote sensing (RS) field, a highly practical domain, has also embraced this new trend and introduced several VLM-based RS methods that have demonstrated promising performance and enormous potential. In this paper, we first review the fundamental theories related to VLM, then summarize the datasets constructed for VLMs in remote sensing and the various tasks they addressed. Finally, we categorize the improvement methods into three main parts according to the core components of VLMs and provide a detailed introduction and comparison of these methods.

Via

Access Paper or Ask Questions

SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction

Oct 11, 2024

Yang Zhou, Hao Shao, Letian Wang, Steven L. Waslander, Hongsheng Li, Yu Liu

Figure 1 for SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction

Figure 2 for SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction

Figure 3 for SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction

Figure 4 for SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction

Abstract:Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. However, the scarcity of large-scale driving datasets has hindered the development of robust and generalizable motion prediction models, limiting their ability to capture complex interactions and road geometries. Inspired by recent advances in natural language processing (NLP) and computer vision (CV), self-supervised learning (SSL) has gained significant attention in the motion prediction community for learning rich and transferable scene representations. Nonetheless, existing pre-training methods for motion prediction have largely focused on specific model architectures and single dataset, limiting their scalability and generalizability. To address these challenges, we propose SmartPretrain, a general and scalable SSL framework for motion prediction that is both model-agnostic and dataset-agnostic. Our approach integrates contrastive and reconstructive SSL, leveraging the strengths of both generative and discriminative paradigms to effectively represent spatiotemporal evolution and interactions without imposing architectural constraints. Additionally, SmartPretrain employs a dataset-agnostic scenario sampling strategy that integrates multiple datasets, enhancing data volume, diversity, and robustness. Extensive experiments on multiple datasets demonstrate that SmartPretrain consistently improves the performance of state-of-the-art prediction models across datasets, data splits and main metrics. For instance, SmartPretrain significantly reduces the MissRate of Forecast-MAE by 10.6%. These results highlight SmartPretrain's effectiveness as a unified, scalable solution for motion prediction, breaking free from the limitations of the small-data regime. Codes are available at https://github.com/youngzhou1999/SmartPretrain

* 11 pages, 5 figures

Via

Access Paper or Ask Questions

GLRT-Based Metric Learning for Remote Sensing Object Retrieval

Oct 08, 2024

Linping Zhang, Yu Liu, Xueqian Wang, Gang Li, You He

Figure 1 for GLRT-Based Metric Learning for Remote Sensing Object Retrieval

Figure 2 for GLRT-Based Metric Learning for Remote Sensing Object Retrieval

Figure 3 for GLRT-Based Metric Learning for Remote Sensing Object Retrieval

Figure 4 for GLRT-Based Metric Learning for Remote Sensing Object Retrieval

Abstract:With the improvement in the quantity and quality of remote sensing images, content-based remote sensing object retrieval (CBRSOR) has become an increasingly important topic. However, existing CBRSOR methods neglect the utilization of global statistical information during both training and test stages, which leads to the overfitting of neural networks to simple sample pairs of samples during training and suboptimal metric performance. Inspired by the Neyman-Pearson theorem, we propose a generalized likelihood ratio test-based metric learning (GLRTML) approach, which can estimate the relative difficulty of sample pairs by incorporating global data distribution information during training and test phases. This guides the network to focus more on difficult samples during the training process, thereby encourages the network to learn more discriminative feature embeddings. In addition, GLRT is a more effective than traditional metric space due to the utilization of global data distribution information. Accurately estimating the distribution of embeddings is critical for GLRTML. However, in real-world applications, there is often a distribution shift between the training and target domains, which diminishes the effectiveness of directly using the distribution estimated on training data. To address this issue, we propose the clustering pseudo-labels-based fast parameter adaptation (CPLFPA) method. CPLFPA efficiently estimates the distribution of embeddings in the target domain by clustering target domain instances and re-estimating the distribution parameters for GLRTML. We reorganize datasets for CBRSOR tasks based on fine-grained ship remote sensing image slices (FGSRSI-23) and military aircraft recognition (MAR20) datasets. Extensive experiments on these datasets demonstrate the effectiveness of our proposed GLRTML and CPLFPA.

Via

Access Paper or Ask Questions

Measurements with Noise: Bayesian Optimization for Co-optimizing Noise and Property Discovery in Automated Experiments

Oct 03, 2024

Boris N. Slautin, Yu Liu, Jan Dec, Vladimir V. Shvartsman, Doru C. Lupascu, Maxim Ziatdinov, Sergei V. Kalinin

Abstract:We have developed a Bayesian optimization (BO) workflow that integrates intra-step noise optimization into automated experimental cycles. Traditional BO approaches in automated experiments focus on optimizing experimental trajectories but often overlook the impact of measurement noise on data quality and cost. Our proposed framework simultaneously optimizes both the target property and the associated measurement noise by introducing time as an additional input parameter, thereby balancing the signal-to-noise ratio and experimental duration. Two approaches are explored: a reward-driven noise optimization and a double-optimization acquisition function, both enhancing the efficiency of automated workflows by considering noise and cost within the optimization process. We validate our method through simulations and real-world experiments using Piezoresponse Force Microscopy (PFM), demonstrating the successful optimization of measurement duration and property exploration. Our approach offers a scalable solution for optimizing multiple variables in automated experimental workflows, improving data quality, and reducing resource expenditure in materials science and beyond.

* 22 pages, 9 figures

Via

Access Paper or Ask Questions

Unsupervised Meta-Learning via Dynamic Head and Heterogeneous Task Construction for Few-Shot Classification

Oct 03, 2024

Yunchuan Guan, Yu Liu, Ketong Liu, Ke Zhou, Zhiqi Shen

Figure 1 for Unsupervised Meta-Learning via Dynamic Head and Heterogeneous Task Construction for Few-Shot Classification

Figure 2 for Unsupervised Meta-Learning via Dynamic Head and Heterogeneous Task Construction for Few-Shot Classification

Figure 3 for Unsupervised Meta-Learning via Dynamic Head and Heterogeneous Task Construction for Few-Shot Classification

Figure 4 for Unsupervised Meta-Learning via Dynamic Head and Heterogeneous Task Construction for Few-Shot Classification

Abstract:Meta-learning has been widely used in recent years in areas such as few-shot learning and reinforcement learning. However, the questions of why and when it is better than other algorithms in few-shot classification remain to be explored. In this paper, we perform pre-experiments by adjusting the proportion of label noise and the degree of task heterogeneity in the dataset. We use the metric of Singular Vector Canonical Correlation Analysis to quantify the representation stability of the neural network and thus to compare the behavior of meta-learning and classical learning algorithms. We find that benefiting from the bi-level optimization strategy, the meta-learning algorithm has better robustness to label noise and heterogeneous tasks. Based on the above conclusion, we argue a promising future for meta-learning in the unsupervised area, and thus propose DHM-UHT, a dynamic head meta-learning algorithm with unsupervised heterogeneous task construction. The core idea of DHM-UHT is to use DBSCAN and dynamic head to achieve heterogeneous task construction and meta-learn the whole process of unsupervised heterogeneous task construction. On several unsupervised zero-shot and few-shot datasets, DHM-UHT obtains state-of-the-art performance. The code is released at https://github.com/tuantuange/DHM-UHT.

Via

Access Paper or Ask Questions

Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

Oct 02, 2024

Jianxiong Li, Zhihao Wang, Jinliang Zheng, Xiaoai Zhou, Guanming Wang, Guanglu Song, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Junzhi Yu(+1 more)

Figure 1 for Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

Figure 2 for Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

Figure 3 for Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

Figure 4 for Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

Abstract:Multimodal task specification is essential for enhanced robotic performance, where \textit{Cross-modality Alignment} enables the robot to holistically understand complex task instructions. Directly annotating multimodal instructions for model training proves impractical, due to the sparsity of paired multimodal data. In this study, we demonstrate that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications. First, we endow the robot with strong \textit{Cross-modality Alignment} capabilities, by pretraining a robotic multimodal encoder using extensive out-of-domain data. Then, we employ two Collapse and Corrupt operations to further bridge the remaining modality gap in the learned multimodal representation. This approach projects different modalities of identical task goal as interchangeable representations, thus enabling accurate robotic operations within a well-aligned multimodal latent space. Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of our proposed framework, demonstrating significant advantage in overcoming data constraints in robotic learning. Website: zh1hao.wang/Robo_MUTUAL

* preprint

Via

Access Paper or Ask Questions

CLSP: High-Fidelity Contrastive Language-State Pre-training for Agent State Representation

Sep 24, 2024

Fuxian Huang, Qi Zhang, Shaopeng Zhai, Jie Wang, Tianyi Zhang, Haoran Zhang, Ming Zhou, Yu Liu, Yu Qiao

Abstract:With the rapid development of artificial intelligence, multimodal learning has become an important research area. For intelligent agents, the state is a crucial modality to convey precise information alongside common modalities like images, videos, and language. This becomes especially clear with the broad adoption of reinforcement learning and multimodal large language models. Nevertheless, the representation of state modality still lags in development. To this end, we propose a High-Fidelity Contrastive Language-State Pre-training (CLSP) method, which can accurately encode state information into general representations for both reinforcement learning and multimodal large language models. Specifically, we first design a pre-training task based on the classification to train an encoder with coarse-grained information. Next, we construct data pairs of states and language descriptions, utilizing the pre-trained encoder to initialize the CLSP encoder. Then, we deploy contrastive learning to train the CLSP encoder to effectively represent precise state information. Additionally, we enhance the representation of numerical information using the Random Fourier Features (RFF) method for high-fidelity mapping. Extensive experiments demonstrate the superior precision and generalization capabilities of our representation, achieving outstanding results in text-state retrieval, reinforcement learning navigation tasks, and multimodal large language model understanding.

Via

Access Paper or Ask Questions