Alert button
Picture for Meng Yang

Meng Yang

Alert button

G2-MonoDepth: A General Framework of Generalized Depth Inference from Monocular RGB+X Data

Oct 24, 2023
Haotian Wang, Meng Yang, Nanning Zheng

Figure 1 for G2-MonoDepth: A General Framework of Generalized Depth Inference from Monocular RGB+X Data
Figure 2 for G2-MonoDepth: A General Framework of Generalized Depth Inference from Monocular RGB+X Data
Figure 3 for G2-MonoDepth: A General Framework of Generalized Depth Inference from Monocular RGB+X Data
Figure 4 for G2-MonoDepth: A General Framework of Generalized Depth Inference from Monocular RGB+X Data

Monocular depth inference is a fundamental problem for scene perception of robots. Specific robots may be equipped with a camera plus an optional depth sensor of any type and located in various scenes of different scales, whereas recent advances derived multiple individual sub-tasks. It leads to additional burdens to fine-tune models for specific robots and thereby high-cost customization in large-scale industrialization. This paper investigates a unified task of monocular depth inference, which infers high-quality depth maps from all kinds of input raw data from various robots in unseen scenes. A basic benchmark G2-MonoDepth is developed for this task, which comprises four components: (a) a unified data representation RGB+X to accommodate RGB plus raw depth with diverse scene scale/semantics, depth sparsity ([0%, 100%]) and errors (holes/noises/blurs), (b) a novel unified loss to adapt to diverse depth sparsity/errors of input raw data and diverse scales of output scenes, (c) an improved network to well propagate diverse scene scales from input to output, and (d) a data augmentation pipeline to simulate all types of real artifacts in raw depth maps for training. G2-MonoDepth is applied in three sub-tasks including depth estimation, depth completion with different sparsity, and depth enhancement in unseen scenes, and it always outperforms SOTA baselines on both real-world data and synthetic data.

* 18 pages, 16 figures 
Viaarxiv icon

Asymmetric Co-Training with Explainable Cell Graph Ensembling for Histopathological Image Classification

Aug 24, 2023
Ziqi Yang, Zhongyu Li, Chen Liu, Xiangde Luo, Xingguang Wang, Dou Xu, Chaoqun Li, Xiaoying Qin, Meng Yang, Long Jin

Figure 1 for Asymmetric Co-Training with Explainable Cell Graph Ensembling for Histopathological Image Classification
Figure 2 for Asymmetric Co-Training with Explainable Cell Graph Ensembling for Histopathological Image Classification
Figure 3 for Asymmetric Co-Training with Explainable Cell Graph Ensembling for Histopathological Image Classification
Figure 4 for Asymmetric Co-Training with Explainable Cell Graph Ensembling for Histopathological Image Classification

Convolutional neural networks excel in histopathological image classification, yet their pixel-level focus hampers explainability. Conversely, emerging graph convolutional networks spotlight cell-level features and medical implications. However, limited by their shallowness and suboptimal use of high-dimensional pixel data, GCNs underperform in multi-class histopathological image classification. To make full use of pixel-level and cell-level features dynamically, we propose an asymmetric co-training framework combining a deep graph convolutional network and a convolutional neural network for multi-class histopathological image classification. To improve the explainability of the entire framework by embedding morphological and topological distribution of cells, we build a 14-layer deep graph convolutional network to handle cell graph data. For the further utilization and dynamic interactions between pixel-level and cell-level information, we also design a co-training strategy to integrate the two asymmetric branches. Notably, we collect a private clinically acquired dataset termed LUAD7C, including seven subtypes of lung adenocarcinoma, which is rare and more challenging. We evaluated our approach on the private LUAD7C and public colorectal cancer datasets, showcasing its superior performance, explainability, and generalizability in multi-class histopathological image classification.

Viaarxiv icon

FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

Jul 27, 2023
Chengrui Wei, Meng Yang, Lei He, Nanning Zheng

Figure 1 for FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene
Figure 2 for FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene
Figure 3 for FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene
Figure 4 for FS-Depth: Focal-and-Scale Depth Estimation from a Single Image in Unseen Indoor Scene

It has long been an ill-posed problem to predict absolute depth maps from single images in real (unseen) indoor scenes. We observe that it is essentially due to not only the scale-ambiguous problem but also the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales/semantics. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. Our model is trained on augmented NYUDv2 and tested on three unseen datasets. Our model considerably improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs and well alleviates the deformation problem in 3D reconstruction. Notably, our model well maintains the accuracy of depth estimation on original NYUDv2.

Viaarxiv icon

Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

May 24, 2023
Ziwei He, Meng Yang, Minwei Feng, Jingcheng Yin, Xinbing Wang, Jingwen Leng, Zhouhan Lin

Figure 1 for Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator
Figure 2 for Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator
Figure 3 for Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator
Figure 4 for Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

The transformer model is known to be computationally demanding, and prohibitively costly for long sequences, as the self-attention module uses a quadratic time and space complexity with respect to sequence length. Many researchers have focused on designing new forms of self-attention or introducing new parameters to overcome this limitation, however a large portion of them prohibits the model to inherit weights from large pretrained models. In this work, the transformer's inefficiency has been taken care of from another perspective. We propose Fourier Transformer, a simple yet effective approach by progressively removing redundancies in hidden sequence using the ready-made Fast Fourier Transform (FFT) operator to perform Discrete Cosine Transformation (DCT). Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Experiments show that our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA with significant improvement in both speed and space. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART and other efficient models. \footnote{Our code is publicly available at \url{https://github.com/LUMIA-Group/FourierTransformer}}

Viaarxiv icon

Is Writing Prompts Really Making Art?

Feb 02, 2023
Jon McCormack, Camilo Cruz Gambardella, Nina Rajcic, Stephen James Krol, Maria Teresa Llano, Meng Yang

Figure 1 for Is Writing Prompts Really Making Art?
Figure 2 for Is Writing Prompts Really Making Art?
Figure 3 for Is Writing Prompts Really Making Art?
Figure 4 for Is Writing Prompts Really Making Art?

In recent years Generative Machine Learning systems have advanced significantly. A current wave of generative systems use text prompts to create complex imagery, video, even 3D datasets. The creators of these systems claim a revolution in bringing creativity and art to anyone who can type a prompt. In this position paper, we question the basis for these claims, dividing our analysis into three areas: the limitations of linguistic descriptions, implications of the dataset, and lastly, matters of materiality and embodiment. We conclude with an analysis of the creative possibilities enabled by prompt-based systems, asking if they can be considered a new artistic medium.

* Paper accepted for EvoMUSART Conference, Brno, Czech Republic, 12-14 April 2023 
Viaarxiv icon

Key-frame Guided Network for Thyroid Nodule Recognition using Ultrasound Videos

Jun 30, 2022
Yuchen Wang, Zhongyu Li, Xiangxiang Cui, Liangliang Zhang, Xiang Luo, Meng Yang, Shi Chang

Figure 1 for Key-frame Guided Network for Thyroid Nodule Recognition using Ultrasound Videos
Figure 2 for Key-frame Guided Network for Thyroid Nodule Recognition using Ultrasound Videos
Figure 3 for Key-frame Guided Network for Thyroid Nodule Recognition using Ultrasound Videos
Figure 4 for Key-frame Guided Network for Thyroid Nodule Recognition using Ultrasound Videos

Ultrasound examination is widely used in the clinical diagnosis of thyroid nodules (benign/malignant). However, the accuracy relies heavily on radiologist experience. Although deep learning techniques have been investigated for thyroid nodules recognition. Current solutions are mainly based on static ultrasound images, with limited temporal information used and inconsistent with clinical diagnosis. This paper proposes a novel method for the automated recognition of thyroid nodules through an exhaustive exploration of ultrasound videos and key-frames. We first propose a detection-localization framework to automatically identify the clinical key-frame with a typical nodule in each ultrasound video. Based on the localized key-frame, we develop a key-frame guided video classification model for thyroid nodule recognition. Besides, we introduce a motion attention module to help the network focus on significant frames in an ultrasound video, which is consistent with clinical diagnosis. The proposed thyroid nodule recognition framework is validated on clinically collected ultrasound videos, demonstrating superior performance compared with other state-of-the-art methods.

Viaarxiv icon

Leaning Compact and Representative Features for Cross-Modality Person Re-Identification

Mar 26, 2021
Guangwei Gao, Hao Shao, Yi Yu, Fei Wu, Meng Yang

Figure 1 for Leaning Compact and Representative Features for Cross-Modality Person Re-Identification
Figure 2 for Leaning Compact and Representative Features for Cross-Modality Person Re-Identification
Figure 3 for Leaning Compact and Representative Features for Cross-Modality Person Re-Identification
Figure 4 for Leaning Compact and Representative Features for Cross-Modality Person Re-Identification

This paper pays close attention to the cross-modality visible-infrared person re-identification (VI Re-ID) task, which aims to match human samples between visible and infrared modes. In order to reduce the discrepancy between features of different modalities, most existing works usually use constraints based on Euclidean metric. Since the Euclidean based distance metric cannot effectively measure the internal angles between the embedded vectors, the above methods cannot learn the angularly discriminative feature embedding. Because the most important factor affecting the classification task based on embedding vector is whether there is an angularly discriminativ feature space, in this paper, we propose a new loss function called Enumerate Angular Triplet (EAT) loss. Also, motivated by the knowledge distillation, to narrow down the features between different modalities before feature embedding, we further present a new Cross-Modality Knowledge Distillation (CMKD) loss. The experimental results on RegDB and SYSU-MM01 datasets have shown that the proposed method is superior to the other most advanced methods in terms of impressive performance.

* 9 pages, 6 figures 
Viaarxiv icon

Hierarchical Deep CNN Feature Set-Based Representation Learning for Robust Cross-Resolution Face Recognition

Mar 25, 2021
Guangwei Gao, Yi Yu, Jian Yang, Guo-Jun Qi, Meng Yang

Figure 1 for Hierarchical Deep CNN Feature Set-Based Representation Learning for Robust Cross-Resolution Face Recognition
Figure 2 for Hierarchical Deep CNN Feature Set-Based Representation Learning for Robust Cross-Resolution Face Recognition
Figure 3 for Hierarchical Deep CNN Feature Set-Based Representation Learning for Robust Cross-Resolution Face Recognition
Figure 4 for Hierarchical Deep CNN Feature Set-Based Representation Learning for Robust Cross-Resolution Face Recognition

Cross-resolution face recognition (CRFR), which is important in intelligent surveillance and biometric forensics, refers to the problem of matching a low-resolution (LR) probe face image against high-resolution (HR) gallery face images. Existing shallow learning-based and deep learning-based methods focus on mapping the HR-LR face pairs into a joint feature space where the resolution discrepancy is mitigated. However, little works consider how to extract and utilize the intermediate discriminative features from the noisy LR query faces to further mitigate the resolution discrepancy due to the resolution limitations. In this study, we desire to fully exploit the multi-level deep convolutional neural network (CNN) feature set for robust CRFR. In particular, our contributions are threefold. (i) To learn more robust and discriminative features, we desire to adaptively fuse the contextual features from different layers. (ii) To fully exploit these contextual features, we design a feature set-based representation learning (FSRL) scheme to collaboratively represent the hierarchical features for more accurate recognition. Moreover, FSRL utilizes the primitive form of feature maps to keep the latent structural information, especially in noisy cases. (iii) To further promote the recognition performance, we desire to fuse the hierarchical recognition outputs from different stages. Meanwhile, the discriminability from different scales can also be fully integrated. By exploiting these advantages, the efficiency of the proposed method can be delivered. Experimental results on several face datasets have verified the superiority of the presented algorithm to the other competitive CRFR approaches.

* IEEE Transactions on Circuits and Systems for Video Technology, 11 pages, 9 figures 
Viaarxiv icon

Integrating Pre-trained Model into Rule-based Dialogue Management

Feb 17, 2021
Jun Quan, Meng Yang, Qiang Gan, Deyi Xiong, Yiming Liu, Yuchen Dong, Fangxin Ouyang, Jun Tian, Ruiling Deng, Yongzhi Li, Yang Yang, Daxin Jiang

Figure 1 for Integrating Pre-trained Model into Rule-based Dialogue Management
Figure 2 for Integrating Pre-trained Model into Rule-based Dialogue Management
Figure 3 for Integrating Pre-trained Model into Rule-based Dialogue Management

Rule-based dialogue management is still the most popular solution for industrial task-oriented dialogue systems for their interpretablility. However, it is hard for developers to maintain the dialogue logic when the scenarios get more and more complex. On the other hand, data-driven dialogue systems, usually with end-to-end structures, are popular in academic research and easier to deal with complex conversations, but such methods require plenty of training data and the behaviors are less interpretable. In this paper, we propose a method to leverages the strength of both rule-based and data-driven dialogue managers (DM). We firstly introduce the DM of Carina Dialog System (CDS, an advanced industrial dialogue system built by Microsoft). Then we propose the "model-trigger" design to make the DM trainable thus scalable to scenario changes. Furthermore, we integrate pre-trained models and empower the DM with few-shot capability. The experimental results demonstrate the effectiveness and strong few-shot capability of our method.

* AAAI 2021 Demo Paper 
Viaarxiv icon

HpGAN: Sequence Search with Generative Adversarial Networks

Dec 10, 2020
Mingxing Zhang, Zhengchun Zhou, Lanping Li, Zilong Liu, Meng Yang, Yanghe Feng

Figure 1 for HpGAN: Sequence Search with Generative Adversarial Networks
Figure 2 for HpGAN: Sequence Search with Generative Adversarial Networks
Figure 3 for HpGAN: Sequence Search with Generative Adversarial Networks
Figure 4 for HpGAN: Sequence Search with Generative Adversarial Networks

Sequences play an important role in many engineering applications and systems. Searching sequences with desired properties has long been an interesting but also challenging research topic. This article proposes a novel method, called HpGAN, to search desired sequences algorithmically using generative adversarial networks (GAN). HpGAN is based on the idea of zero-sum game to train a generative model, which can generate sequences with characteristics similar to the training sequences. In HpGAN, we design the Hopfield network as an encoder to avoid the limitations of GAN in generating discrete data. Compared with traditional sequence construction by algebraic tools, HpGAN is particularly suitable for intractable problems with complex objectives which prevent mathematical analysis. We demonstrate the search capabilities of HpGAN in two applications: 1) HpGAN successfully found many different mutually orthogonal complementary code sets (MOCCS) and optimal odd-length Z-complementary pairs (OB-ZCPs) which are not part of the training set. In the literature, both MOCSSs and OB-ZCPs have found wide applications in wireless communications. 2) HpGAN found new sequences which achieve four-times increase of signal-to-interference ratio--benchmarked against the well-known Legendre sequence--of a mismatched filter (MMF) estimator in pulse compression radar systems. These sequences outperform those found by AlphaSeq.

* 12 pages, 16 figures 
Viaarxiv icon