Alert button
Picture for Haoyu Zhang

Haoyu Zhang

Alert button

Learning a Stable Dynamic System with a Lyapunov Energy Function for Demonstratives Using Neural Networks

Sep 16, 2023
Yu Zhang, Yongxiang Zou, Haoyu Zhang, Xiuze Xia, Long Cheng

Autonomous Dynamic System (DS)-based algorithms hold a pivotal and foundational role in the field of Learning from Demonstration (LfD). Nevertheless, they confront the formidable challenge of striking a delicate balance between achieving precision in learning and ensuring the overall stability of the system. In response to this substantial challenge, this paper introduces a novel DS algorithm rooted in neural network technology. This algorithm not only possesses the capability to extract critical insights from demonstration data but also demonstrates the capacity to learn a candidate Lyapunov energy function that is consistent with the provided data. The model presented in this paper employs a straightforward neural network architecture that excels in fulfilling a dual objective: optimizing accuracy while simultaneously preserving global stability. To comprehensively evaluate the effectiveness of the proposed algorithm, rigorous assessments are conducted using the LASA dataset, further reinforced by empirical validation through a robotic experiment.

Viaarxiv icon

A Hierarchical Destroy and Repair Approach for Solving Very Large-Scale Travelling Salesman Problem

Aug 09, 2023
Zhang-Hua Fu, Sipeng Sun, Jintong Ren, Tianshu Yu, Haoyu Zhang, Yuanyuan Liu, Lingxiao Huang, Xiang Yan, Pinyan Lu

Figure 1 for A Hierarchical Destroy and Repair Approach for Solving Very Large-Scale Travelling Salesman Problem
Figure 2 for A Hierarchical Destroy and Repair Approach for Solving Very Large-Scale Travelling Salesman Problem
Figure 3 for A Hierarchical Destroy and Repair Approach for Solving Very Large-Scale Travelling Salesman Problem
Figure 4 for A Hierarchical Destroy and Repair Approach for Solving Very Large-Scale Travelling Salesman Problem

For prohibitively large-scale Travelling Salesman Problems (TSPs), existing algorithms face big challenges in terms of both computational efficiency and solution quality. To address this issue, we propose a hierarchical destroy-and-repair (HDR) approach, which attempts to improve an initial solution by applying a series of carefully designed destroy-and-repair operations. A key innovative concept is the hierarchical search framework, which recursively fixes partial edges and compresses the input instance into a small-scale TSP under some equivalence guarantee. This neat search framework is able to deliver highly competitive solutions within a reasonable time. Fair comparisons based on nineteen famous large-scale instances (with 10,000 to 10,000,000 cities) show that HDR is highly competitive against existing state-of-the-art TSP algorithms, in terms of both efficiency and solution quality. Notably, on two large instances with 3,162,278 and 10,000,000 cities, HDR breaks the world records (i.e., best-known results regardless of computation time), which were previously achieved by LKH and its variants, while HDR is completely independent of LKH. Finally, ablation studies are performed to certify the importance and validity of the hierarchical search framework.

Viaarxiv icon

DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation

Jul 12, 2023
Yipeng Leng, Qiangjuan Huang, Zhiyuan Wang, Yangyang Liu, Haoyu Zhang

Figure 1 for DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation
Figure 2 for DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation
Figure 3 for DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation
Figure 4 for DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation

Diffusion probabilistic models (DPMs) have shown remarkable results on various image synthesis tasks such as text-to-image generation and image inpainting. However, compared to other generative methods like VAEs and GANs, DPMs lack a low-dimensional, interpretable, and well-decoupled latent code. Recently, diffusion autoencoders (Diff-AE) were proposed to explore the potential of DPMs for representation learning via autoencoding. Diff-AE provides an accessible latent space that exhibits remarkable interpretability, allowing us to manipulate image attributes based on latent codes from the space. However, previous works are not generic as they only operated on a few limited attributes. To further explore the latent space of Diff-AE and achieve a generic editing pipeline, we proposed a module called Group-supervised AutoEncoder(dubbed GAE) for Diff-AE to achieve better disentanglement on the latent code. Our proposed GAE has trained via an attribute-swap strategy to acquire the latent codes for multi-attribute image manipulation based on examples. We empirically demonstrate that our method enables multiple-attributes manipulation and achieves convincing sample quality and attribute alignments, while significantly reducing computational requirements compared to pixel-based approaches for representational decoupling. Code will be released soon.

Viaarxiv icon

GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech

Jun 27, 2023
Yahuan Cong, Haoyu Zhang, Haopeng Lin, Shichao Liu, Chunfeng Wang, Yi Ren, Xiang Yin, Zejun Ma

Figure 1 for GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech
Figure 2 for GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech
Figure 3 for GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech
Figure 4 for GenerTTS: Pronunciation Disentanglement for Timbre and Style Generalization in Cross-Lingual Text-to-Speech

Cross-lingual timbre and style generalizable text-to-speech (TTS) aims to synthesize speech with a specific reference timbre or style that is never trained in the target language. It encounters the following challenges: 1) timbre and pronunciation are correlated since multilingual speech of a specific speaker is usually hard to obtain; 2) style and pronunciation are mixed because the speech style contains language-agnostic and language-specific parts. To address these challenges, we propose GenerTTS, which mainly includes the following works: 1) we elaborately design a HuBERT-based information bottleneck to disentangle timbre and pronunciation/style; 2) we minimize the mutual information between style and language to discard the language-specific information in the style embedding. The experiments indicate that GenerTTS outperforms baseline systems in terms of style similarity and pronunciation accuracy, and enables cross-lingual timbre and style generalization.

* Accepted by INTERSPEECH 2023 
Viaarxiv icon

Learning Variable Impedance Skills from Demonstrations with Passivity Guarantee

Jun 20, 2023
Yu Zhang, Long Cheng, Xiuze Xia, Haoyu Zhang

Figure 1 for Learning Variable Impedance Skills from Demonstrations with Passivity Guarantee
Figure 2 for Learning Variable Impedance Skills from Demonstrations with Passivity Guarantee
Figure 3 for Learning Variable Impedance Skills from Demonstrations with Passivity Guarantee
Figure 4 for Learning Variable Impedance Skills from Demonstrations with Passivity Guarantee

Robots are increasingly being deployed not only in workplaces but also in households. Effectively execute of manipulation tasks by robots relies on variable impedance control with contact forces. Furthermore, robots should possess adaptive capabilities to handle the considerable variations exhibited by different robotic tasks in dynamic environments, which can be obtained through human demonstrations. This paper presents a learning-from-demonstration framework that integrates force sensing and motion information to facilitate variable impedance control. The proposed approach involves the estimation of full stiffness matrices from human demonstrations, which are then combined with sensed forces and motion information to create a model using the non-parametric method. This model allows the robot to replicate the demonstrated task while also responding appropriately to new task conditions through the use of the state-dependent stiffness profile. Additionally, a novel tank based variable impedance control approach is proposed to ensure passivity by using the learned stiffness. The proposed approach was evaluated using two virtual variable stiffness systems. The first evaluation demonstrates that the stiffness estimated approach exhibits superior robustness compared to traditional methods when tested on manual datasets, and the second evaluation illustrates that the novel tank based approach is more easily implementable compared to traditional variable impedance control approaches.

Viaarxiv icon

Noise-Resistant Multimodal Transformer for Emotion Recognition

May 04, 2023
Yuanyuan Liu, Haoyu Zhang, Yibing Zhan, Zijing Chen, Guanghao Yin, Lin Wei, Zhe Chen

Figure 1 for Noise-Resistant Multimodal Transformer for Emotion Recognition
Figure 2 for Noise-Resistant Multimodal Transformer for Emotion Recognition
Figure 3 for Noise-Resistant Multimodal Transformer for Emotion Recognition
Figure 4 for Noise-Resistant Multimodal Transformer for Emotion Recognition

Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However, we found that this task can be easily affected by noisy information that does not contain useful semantics. To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding. Our new pipeline, namely Noise-Resistant Multimodal Transformer (NORM-TR), mainly introduces a Noise-Resistant Generic Feature (NRGF) extractor and a Transformer for the multimodal emotion recognition task. In particular, we make the NRGF extractor learn a generic and disturbance-insensitive representation so that consistent and meaningful semantics can be obtained. Furthermore, we apply a Transformer to incorporate Multimodal Features (MFs) of multimodal inputs based on their relations to the NRGF. Therefore, the possible insensitive but useful information of NRGF could be complemented by MFs that contain more details. To train the NORM-TR properly, our proposed noise-aware learning scheme complements normal emotion recognition losses by enhancing the learning against noises. Our learning scheme explicitly adds noises to either all the modalities or a specific modality at random locations of a multimodal input sequence. We correspondingly introduce two adversarial losses to encourage the NRGF extractor to learn to extract the NRGFs invariant to the added noises, thus facilitating the NORM-TR to achieve more favorable multimodal emotion recognition performance. In practice, on several popular multimodal datasets, our NORM-TR achieves state-of-the-art performance and outperforms existing methods by a large margin, which demonstrates that the ability to resist noisy information is important for effective emotion recognition.

Viaarxiv icon

LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion

Mar 02, 2023
Chunfeng Wang, Peisong Huang, Yuxiang Zou, Haoyu Zhang, Shichao Liu, Xiang Yin, Zejun Ma

Figure 1 for LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion
Figure 2 for LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion
Figure 3 for LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion
Figure 4 for LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme conversion

As a key component of automated speech recognition (ASR) and the front-end in text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations. Existing methods are either slow or poor in performance, and are limited in application scenarios, particularly in the process of on-device inference. In this paper, we integrate the advantages of both expert knowledge and connectionist temporal classification (CTC) based neural network and propose a novel method named LiteG2P which is fast, light and theoretically parallel. With the carefully leading design, LiteG2P can be applied both on cloud and on device. Experimental results on the CMU dataset show that the performance of the proposed method is superior to the state-of-the-art CTC based method with 10 times fewer parameters, and even comparable to the state-of-the-art Transformer-based sequence-to-sequence model with less parameters and 33 times less computation.

* Accepted by ICASSP2023 
Viaarxiv icon

Coexistence of Pulsed Radar and Communications: Interference Suppression and Multi-path Combining

Sep 06, 2022
Haoyu Zhang, Li Chen, Yunfei Chen, Huarui Yin, Guo Wei

Figure 1 for Coexistence of Pulsed Radar and Communications: Interference Suppression and Multi-path Combining
Figure 2 for Coexistence of Pulsed Radar and Communications: Interference Suppression and Multi-path Combining
Figure 3 for Coexistence of Pulsed Radar and Communications: Interference Suppression and Multi-path Combining
Figure 4 for Coexistence of Pulsed Radar and Communications: Interference Suppression and Multi-path Combining

The focus of this study is on the spectrum sharing between multiple-input multiple-output (MIMO) communications and co-located pulsed MIMO radar systems in multi-path environments. The major challenge is to suppress the mutual interference between the two systems while combining the useful multi-path components received at each system. We tackle this challenge by jointly designing the communication precoder, radar transmit waveform and receive filter. Specifically, the signal-to-interference-plus-noise ratio (SINR) at the radar receiver is maximized subject to constraints on the radar waveform, communication rate and transmit power. The multi-path propagation complicates the expressions of the radar SINR and communication rate, leading to a non-convex problem. To solve it, a sub-optimal algorithm based on the alternating maximization is used to optimize the precoder, radar transmit waveform and receive filter iteratively. The radar receive filter can be updated by a closed-form solution. The communication precoder and radar transmit waveform can be obtained by the successive convex approximation and alternating direction method of multipliers. Simulation results are provided to demonstrate the effectiveness of the proposed design.

Viaarxiv icon

Time flies by: Analyzing the Impact of Face Ageing on the Recognition Performance with Synthetic Data

Aug 17, 2022
Marcel Grimmer, Haoyu Zhang, Raghavendra Ramachandra, Kiran Raja, Christoph Busch

Figure 1 for Time flies by: Analyzing the Impact of Face Ageing on the Recognition Performance with Synthetic Data
Figure 2 for Time flies by: Analyzing the Impact of Face Ageing on the Recognition Performance with Synthetic Data
Figure 3 for Time flies by: Analyzing the Impact of Face Ageing on the Recognition Performance with Synthetic Data
Figure 4 for Time flies by: Analyzing the Impact of Face Ageing on the Recognition Performance with Synthetic Data

The vast progress in synthetic image synthesis enables the generation of facial images in high resolution and photorealism. In biometric applications, the main motivation for using synthetic data is to solve the shortage of publicly-available biometric data while reducing privacy risks when processing such sensitive information. These advantages are exploited in this work by simulating human face ageing with recent face age modification algorithms to generate mated samples, thereby studying the impact of ageing on the performance of an open-source biometric recognition system. Further, a real dataset is used to evaluate the effects of short-term ageing, comparing the biometric performance to the synthetic domain. The main findings indicate that short-term ageing in the range of 1-5 years has only minor effects on the general recognition performance. However, the correct verification of mated faces with long-term age differences beyond 20 years poses still a significant challenge and requires further investigation.

Viaarxiv icon

A Survey on Surrogate-assisted Efficient Neural Architecture Search

Jun 03, 2022
Shiqing Liu, Haoyu Zhang, Yaochu Jin

Figure 1 for A Survey on Surrogate-assisted Efficient Neural Architecture Search
Figure 2 for A Survey on Surrogate-assisted Efficient Neural Architecture Search
Figure 3 for A Survey on Surrogate-assisted Efficient Neural Architecture Search
Figure 4 for A Survey on Surrogate-assisted Efficient Neural Architecture Search

Neural architecture search (NAS) has become increasingly popular in the deep learning community recently, mainly because it can provide an opportunity to allow interested users without rich expertise to benefit from the success of deep neural networks (DNNs). However, NAS is still laborious and time-consuming because a large number of performance estimations are required during the search process of NAS, and training DNNs is computationally intensive. To solve the major limitation of NAS, improving the efficiency of NAS is essential in the design of NAS. This paper begins with a brief introduction to the general framework of NAS. Then, the methods for evaluating network candidates under the proxy metrics are systematically discussed. This is followed by a description of surrogate-assisted NAS, which is divided into three different categories, namely Bayesian optimization for NAS, surrogate-assisted evolutionary algorithms for NAS, and MOP for NAS. Finally, remaining challenges and open research questions are discussed, and promising research topics are suggested in this emerging field.

* 17 pages, 7 figures 
Viaarxiv icon