Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yi Ma

Improved YOLOv5s model for key components detection of power transmission lines

Feb 10, 2025

Chen Chen, Guowu Yuan, Hao Zhou, Yi Ma

Abstract:High-voltage transmission lines are located far from the road, resulting in inconvenient inspection work and rising maintenance costs. Intelligent inspection of power transmission lines has become increasingly important. However, subsequent intelligent inspection relies on accurately detecting various key components. Due to the low detection accuracy of key components in transmission line image inspection, this paper proposed an improved object detection model based on the YOLOv5s (You Only Look Once Version 5 Small) model to improve the detection accuracy of key components of transmission lines. According to the characteristics of the power grid inspection image, we first modify the distance measurement in the k-means clustering to improve the anchor matching of the YOLOv5s model. Then, we add the convolutional block attention module (CBAM) attention mechanism to the backbone network to improve accuracy. Finally, we apply the focal loss function to reduce the impact of class imbalance. Our improved method's mAP (mean average precision) reached 98.1%, the precision reached 97.5%, the recall reached 94.4%, and the detection rate reached 84.8 FPS (frames per second). The experimental results show that our improved model improves detection accuracy and has performance advantages over other models.

* 23 pages, 14 figures

Via

Access Paper or Ask Questions

Direct Uplink Connectivity in Space MIMO Systems with THz and FSO Inter-Satellite Links

Feb 02, 2025

Zohre Mashayekh Bakhsh, Yasaman Omid, Gaojie Chen, Farbod Kayhan, Yi Ma, Rahim Tafazolli

Figure 1 for Direct Uplink Connectivity in Space MIMO Systems with THz and FSO Inter-Satellite Links

Figure 2 for Direct Uplink Connectivity in Space MIMO Systems with THz and FSO Inter-Satellite Links

Figure 3 for Direct Uplink Connectivity in Space MIMO Systems with THz and FSO Inter-Satellite Links

Figure 4 for Direct Uplink Connectivity in Space MIMO Systems with THz and FSO Inter-Satellite Links

Abstract:This paper investigates uplink transmission from a single-antenna mobile phone to a cluster of satellites, emphasizing the role of inter-satellite links (ISLs) in facilitating cooperative signal detection. The study focuses on non-ideal ISLs, examining both terahertz (THz) and free-space optical (FSO) ISLs concerning their ergodic capacity. We present a practical scenario derived from the recent 3GPP standard, specifying the frequency band, bandwidth, user and satellite antenna gains, power levels, and channel characteristics in alignment with the latest 3GPP for non-terrestrial networks (NTN). Additionally, we propose a satellite selection method to identify the optimal satellite as the master node (MN), responsible for signal processing. This method takes into account both the user-satellite link and ISL channels. For the THz ISL analysis, we derive a closed-form approximation for ergodic capacity under two scenarios: one with instantaneous channel state information (CSI) and another with only statistical CSI shared between satellites. For the FSO ISL analysis, we present a closed-form approximate upper bound for ergodic capacity, accounting for the impact of pointing error loss. Furthermore, we evaluate the effects of different ISL frequencies and pointing errors on spectral efficiency. Simulation results demonstrate that multi-satellite multiple-input multiple-output (MIMO) satellite communication (SatCom) significantly outperforms single-satellite SatCom in terms of spectral efficiency. Additionally, our approximated upper bound for ergodic capacity closely aligns with results obtained from Monte Carlo simulations.

Via

Access Paper or Ask Questions

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Jan 28, 2025

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma

Figure 1 for SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Figure 2 for SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Figure 3 for SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Figure 4 for SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Abstract:Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

* Website at https://tianzhechu.com/SFTvsRL

Via

Access Paper or Ask Questions

ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

Jan 14, 2025

Yi Ma, Shuai Wang, Tianchi Liu, Haizhou Li

Figure 1 for ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

Figure 2 for ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

Figure 3 for ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

Figure 4 for ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

Abstract:In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable results comparable to those from manual forensic voice comparison. A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait which describes the speaker's characteristics at the phonetic level, resembling what forensic comparison does. ExPO not only generates utterance-level speaker embeddings but also allows for fine-grained analysis and visualization of phonetic traits, offering an explainable speaker verification process. Furthermore, we investigate phonetic traits from within-speaker and between-speaker variation perspectives to determine which trait is most effective for speaker verification, marking an important step towards explainable speaker verification. Our code is available at https://github.com/mmmmayi/ExPO.

* Accepted by IEEE Signal Processing Letters

Via

Access Paper or Ask Questions

From Simple to Complex Skills: The Case of In-Hand Object Reorientation

Jan 09, 2025

Haozhi Qi, Brent Yi, Mike Lambeta, Yi Ma, Roberto Calandra, Jitendra Malik

Abstract:Learning policies in simulation and transferring them to the real world has become a promising approach in dexterous manipulation. However, bridging the sim-to-real gap for each new task requires substantial human effort, such as careful reward engineering, hyperparameter tuning, and system identification. In this work, we present a system that leverages low-level skills to address these challenges for more complex tasks. Specifically, we introduce a hierarchical policy for in-hand object reorientation based on previously acquired rotation skills. This hierarchical policy learns to select which low-level skill to execute based on feedback from both the environment and the low-level skill policies themselves. Compared to learning from scratch, the hierarchical policy is more robust to out-of-distribution changes and transfers easily from simulation to real-world environments. Additionally, we propose a generalizable object pose estimator that uses proprioceptive information, low-level skill predictions, and control errors as inputs to estimate the object pose over time. We demonstrate that our system can reorient objects, including symmetrical and textureless ones, to a desired pose.

* website: https://dexhier.github.io

Via

Access Paper or Ask Questions

Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

Dec 23, 2024

Ziyang Wu, Tianjiao Ding, Yifu Lu, Druv Pai, Jingyuan Zhang, Weida Wang, Yaodong Yu, Yi Ma, Benjamin D. Haeffele

Figure 1 for Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

Figure 2 for Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

Figure 3 for Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

Figure 4 for Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

Abstract:The attention operator is arguably the key distinguishing factor of transformer architectures, which have demonstrated state-of-the-art performance on a variety of tasks. However, transformer attention operators often impose a significant computational burden, with the computational complexity scaling quadratically with the number of tokens. In this work, we propose a novel transformer attention operator whose computational complexity scales linearly with the number of tokens. We derive our network architecture by extending prior work which has shown that a transformer style architecture naturally arises by "white-box" architecture design, where each layer of the network is designed to implement an incremental optimization step of a maximal coding rate reduction objective (MCR$^2$). Specifically, we derive a novel variational form of the MCR$^2$ objective and show that the architecture that results from unrolled gradient descent of this variational objective leads to a new attention module called Token Statistics Self-Attention (TSSA). TSSA has linear computational and memory complexity and radically departs from the typical attention architecture that computes pairwise similarities between tokens. Experiments on vision, language, and long sequence tasks show that simply swapping TSSA for standard self-attention, which we refer to as the Token Statistics Transformer (ToST), achieves competitive performance with conventional transformers while being significantly more computationally efficient and interpretable. Our results also somewhat call into question the conventional wisdom that pairwise similarity style attention mechanisms are critical to the success of transformer architectures. Code will be available at https://github.com/RobinWu218/ToST.

* 24 pages, 11 figures

Via

Access Paper or Ask Questions

CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Nov 07, 2024

Jingwei Xu, Chenyu Wang, Zibo Zhao, Wen Liu, Yi Ma, Shenghua Gao

Figure 1 for CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Figure 2 for CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Figure 3 for CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Figure 4 for CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Abstract:This paper aims to design a unified Computer-Aided Design (CAD) generation system that can easily generate CAD models based on the user's inputs in the form of textual description, images, point clouds, or even a combination of them. Towards this goal, we introduce the CAD-MLLM, the first system capable of generating parametric CAD models conditioned on the multimodal input. Specifically, within the CAD-MLLM framework, we leverage the command sequences of CAD models and then employ advanced large language models (LLMs) to align the feature space across these diverse multi-modalities data and CAD models' vectorized representations. To facilitate the model training, we design a comprehensive data construction and annotation pipeline that equips each CAD model with corresponding multimodal data. Our resulting dataset, named Omni-CAD, is the first multimodal CAD dataset that contains textual description, multi-view images, points, and command sequence for each CAD model. It contains approximately 450K instances and their CAD construction sequences. To thoroughly evaluate the quality of our generated CAD models, we go beyond current evaluation metrics that focus on reconstruction quality by introducing additional metrics that assess topology quality and surface enclosure extent. Extensive experimental results demonstrate that CAD-MLLM significantly outperforms existing conditional generative methods and remains highly robust to noises and missing points. The project page and more visualizations can be found at: https://cad-mllm.github.io/

* Project page: https://cad-mllm.github.io/

Via

Access Paper or Ask Questions

Generative Semantic Communications with Foundation Models: Perception-Error Analysis and Semantic-Aware Power Allocation

Nov 07, 2024

Chunmei Xu, Mahdi Boloursaz Mashhadi, Yi Ma, Rahim Tafazolli, Jiangzhou Wang

Figure 1 for Generative Semantic Communications with Foundation Models: Perception-Error Analysis and Semantic-Aware Power Allocation

Figure 2 for Generative Semantic Communications with Foundation Models: Perception-Error Analysis and Semantic-Aware Power Allocation

Figure 3 for Generative Semantic Communications with Foundation Models: Perception-Error Analysis and Semantic-Aware Power Allocation

Figure 4 for Generative Semantic Communications with Foundation Models: Perception-Error Analysis and Semantic-Aware Power Allocation

Abstract:Generative foundation models can revolutionize the design of semantic communication (SemCom) systems allowing high fidelity exchange of semantic information at ultra low rates. In this work, a generative SemCom framework with pretrained foundation models is proposed, where both uncoded forward-with-error and coded discard-with-error schemes are developed for the semantic decoder. To characterize the impact of transmission reliability on the perceptual quality of the regenerated signal, their mathematical relationship is analyzed from a rate-distortion-perception perspective, which is proved to be non-decreasing. The semantic values are defined to measure the semantic information of multimodal semantic features accordingly. We also investigate semantic-aware power allocation problems aiming at power consumption minimization for ultra low rate and high fidelity SemComs. To solve these problems, two semantic-aware power allocation methods are proposed by leveraging the non-decreasing property of the perception-error relationship. Numerically, perception-error functions and semantic values of semantic data streams under both schemes for image tasks are obtained based on the Kodak dataset. Simulation results show that our proposed semanticaware method significantly outperforms conventional approaches, particularly in the channel-coded case (up to 90% power saving).

Via

Access Paper or Ask Questions

Diffusion-based Generative Multicasting with Intent-aware Semantic Decomposition

Nov 04, 2024

Xinkai Liu, Mahdi Boloursaz Mashhadi, Li Qiao, Yi Ma, Rahim Tafazolli, Mehdi Bennis

Figure 1 for Diffusion-based Generative Multicasting with Intent-aware Semantic Decomposition

Figure 2 for Diffusion-based Generative Multicasting with Intent-aware Semantic Decomposition

Figure 3 for Diffusion-based Generative Multicasting with Intent-aware Semantic Decomposition

Figure 4 for Diffusion-based Generative Multicasting with Intent-aware Semantic Decomposition

Abstract:Generative diffusion models (GDMs) have recently shown great success in synthesizing multimedia signals with high perceptual quality enabling highly efficient semantic communications in future wireless networks. In this paper, we develop an intent-aware generative semantic multicasting framework utilizing pre-trained diffusion models. In the proposed framework, the transmitter decomposes the source signal to multiple semantic classes based on the multi-user intent, i.e. each user is assumed to be interested in details of only a subset of the semantic classes. The transmitter then sends to each user only its intended classes, and multicasts a highly compressed semantic map to all users over shared wireless resources that allows them to locally synthesize the other classes, i.e. non-intended classes, utilizing pre-trained diffusion models. The signal retrieved at each user is thereby partially reconstructed and partially synthesized utilizing the received semantic map. This improves utilization of the wireless resources, with better preserving privacy of the non-intended classes. We design a communication/computation-aware scheme for per-class adaptation of the communication parameters, such as the transmission power and compression rate to minimize the total latency of retrieving signals at multiple receivers, tailored to the prevailing channel conditions as well as the users reconstruction/synthesis distortion/perception requirements. The simulation results demonstrate significantly reduced per-user latency compared with non-generative and intent-unaware multicasting benchmarks while maintaining high perceptual quality of the signals retrieved at the users.

Via

Access Paper or Ask Questions

Improving Neuron-level Interpretability with White-box Language Models

Oct 21, 2024

Hao Bai, Yi Ma

Abstract:Neurons in auto-regressive language models like GPT-2 can be interpreted by analyzing their activation patterns. Recent studies have shown that techniques such as dictionary learning, a form of post-hoc sparse coding, enhance this neuron-level interpretability. In our research, we are driven by the goal to fundamentally improve neural network interpretability by embedding sparse coding directly within the model architecture, rather than applying it as an afterthought. In our study, we introduce a white-box transformer-like architecture named Coding RAte TransformEr (CRATE), explicitly engineered to capture sparse, low-dimensional structures within data distributions. Our comprehensive experiments showcase significant improvements (up to 103% relative improvement) in neuron-level interpretability across a variety of evaluation metrics. Detailed investigations confirm that this enhanced interpretability is steady across different layers irrespective of the model size, underlining CRATE's robust performance in enhancing neural network interpretability. Further analysis shows that CRATE's increased interpretability comes from its enhanced ability to consistently and distinctively activate on relevant tokens. These findings point towards a promising direction for creating white-box foundation models that excel in neuron-level interpretation.

Via

Access Paper or Ask Questions