Alert button
Picture for Yifang Yin

Yifang Yin

Alert button

SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection

Aug 26, 2023
Qiu Zhou, Jinming Cao, Hanchao Leng, Yifang Yin, Yu Kun, Roger Zimmermann

Figure 1 for SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection
Figure 2 for SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection
Figure 3 for SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection
Figure 4 for SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection

In the field of autonomous driving, accurate and comprehensive perception of the 3D environment is crucial. Bird's Eye View (BEV) based methods have emerged as a promising solution for 3D object detection using multi-view images as input. However, existing 3D object detection methods often ignore the physical context in the environment, such as sidewalk and vegetation, resulting in sub-optimal performance. In this paper, we propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection), that leverages a 3D semantic-occupancy branch to improve the accuracy of 3D object detection. In particular, the physical context modeled by semantic occupancy helps the detector to perceive the scenes in a more holistic view. Our SOGDet is flexible to use and can be seamlessly integrated with most existing BEV-based methods. To evaluate its effectiveness, we apply this approach to several state-of-the-art baselines and conduct extensive experiments on the exclusive nuScenes dataset. Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP). This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems. The codes are available at: https://github.com/zhouqiu/SOGDet.

* 10 pages 
Viaarxiv icon

Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection

Aug 19, 2023
Yichen Zhang, Yifang Yin, Ying Zhang, Zhenguang Liu, Zheng Wang, Roger Zimmermann

Figure 1 for Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection
Figure 2 for Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection
Figure 3 for Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection
Figure 4 for Prototypical Cross-domain Knowledge Transfer for Cervical Dysplasia Visual Inspection

Early detection of dysplasia of the cervix is critical for cervical cancer treatment. However, automatic cervical dysplasia diagnosis via visual inspection, which is more appropriate in low-resource settings, remains a challenging problem. Though promising results have been obtained by recent deep learning models, their performance is significantly hindered by the limited scale of the available cervix datasets. Distinct from previous methods that learn from a single dataset, we propose to leverage cross-domain cervical images that were collected in different but related clinical studies to improve the model's performance on the targeted cervix dataset. To robustly learn the transferable information across datasets, we propose a novel prototype-based knowledge filtering method to estimate the transferability of cross-domain samples. We further optimize the shared feature space by aligning the cross-domain image representations simultaneously on domain level with early alignment and class level with supervised contrastive learning, which endows model training and knowledge transfer with stronger robustness. The empirical results on three real-world benchmark cervical image datasets show that our proposed method outperforms the state-of-the-art cervical dysplasia visual inspection by an absolute improvement of 4.7% in top-1 accuracy, 7.0% in precision, 1.4% in recall, 4.6% in F1 score, and 0.05 in ROC-AUC.

* accepted in ACM Multimedia 2023 
Viaarxiv icon

Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials

Jul 13, 2023
Wenmiao Hu, Yichen Zhang, Yuxuan Liang, Yifang Yin, Andrei Georgescu, An Tran, Hannes Kruppa, See-Kiong Ng, Roger Zimmermann

Figure 1 for Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials
Figure 2 for Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials
Figure 3 for Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials
Figure 4 for Beyond Geo-localization: Fine-grained Orientation of Street-view Images by Cross-view Matching with Satellite Imagery with Supplementary Materials

Street-view imagery provides us with novel experiences to explore different places remotely. Carefully calibrated street-view images (e.g. Google Street View) can be used for different downstream tasks, e.g. navigation, map features extraction. As personal high-quality cameras have become much more affordable and portable, an enormous amount of crowdsourced street-view images are uploaded to the internet, but commonly with missing or noisy sensor information. To prepare this hidden treasure for "ready-to-use" status, determining missing location information and camera orientation angles are two equally important tasks. Recent methods have achieved high performance on geo-localization of street-view images by cross-view matching with a pool of geo-referenced satellite imagery. However, most of the existing works focus more on geo-localization than estimating the image orientation. In this work, we re-state the importance of finding fine-grained orientation for street-view images, formally define the problem and provide a set of evaluation metrics to assess the quality of the orientation estimation. We propose two methods to improve the granularity of the orientation estimation, achieving 82.4% and 72.3% accuracy for images with estimated angle errors below 2 degrees for CVUSA and CVACT datasets, corresponding to 34.9% and 28.2% absolute improvement compared to previous works. Integrating fine-grained orientation estimation in training also improves the performance on geo-localization, giving top 1 recall 95.5%/85.5% and 86.8%/80.4% for orientation known/unknown tests on the two datasets.

* Proceedings of the 30th ACM International Conference on Multimedia (2022) 6155-6164  
* This paper has been accepted by ACM Multimedia 2022. This version contains additional supplementary materials 
Viaarxiv icon

Emotionally Enhanced Talking Face Generation

Mar 26, 2023
Sahil Goyal, Shagun Uppal, Sarthak Bhagat, Yi Yu, Yifang Yin, Rajiv Ratn Shah

Figure 1 for Emotionally Enhanced Talking Face Generation
Figure 2 for Emotionally Enhanced Talking Face Generation
Figure 3 for Emotionally Enhanced Talking Face Generation
Figure 4 for Emotionally Enhanced Talking Face Generation

Several works have developed end-to-end pipelines for generating lip-synced talking faces with various real-world applications, such as teaching and language translation in videos. However, these prior works fail to create realistic-looking videos since they focus little on people's expressions and emotions. Moreover, these methods' effectiveness largely depends on the faces in the training dataset, which means they may not perform well on unseen faces. To mitigate this, we build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions, making them more realistic and convincing. With a broad range of six emotions, i.e., \emph{happiness}, \emph{sadness}, \emph{fear}, \emph{anger}, \emph{disgust}, and \emph{neutral}, we show that our model can adapt to arbitrary identities, emotions, and languages. Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions. We also conduct a user study for subjective evaluation of our interface's usability, design, and functionality. Project page: https://midas.iiitd.edu.in/emo/

Viaarxiv icon

Mix-up Self-Supervised Learning for Contrast-agnostic Applications

Apr 02, 2022
Yichen Zhang, Yifang Yin, Ying Zhang, Roger Zimmermann

Figure 1 for Mix-up Self-Supervised Learning for Contrast-agnostic Applications
Figure 2 for Mix-up Self-Supervised Learning for Contrast-agnostic Applications
Figure 3 for Mix-up Self-Supervised Learning for Contrast-agnostic Applications
Figure 4 for Mix-up Self-Supervised Learning for Contrast-agnostic Applications

Contrastive self-supervised learning has attracted significant research attention recently. It learns effective visual representations from unlabeled data by embedding augmented views of the same image close to each other while pushing away embeddings of different images. Despite its great success on ImageNet classification, COCO object detection, etc., its performance degrades on contrast-agnostic applications, e.g., medical image classification, where all images are visually similar to each other. This creates difficulties in optimizing the embedding space as the distance between images is rather small. To solve this issue, we present the first mix-up self-supervised learning framework for contrast-agnostic applications. We address the low variance across images based on cross-domain mix-up and build the pretext task based on two synergistic objectives: image reconstruction and transparency prediction. Experimental results on two benchmark datasets validate the effectiveness of our method, where an improvement of 2.5% ~ 7.4% in top-1 accuracy was obtained compared to existing self-supervised learning methods.

* Accepted by ICME 2021 
Viaarxiv icon

Motion Prediction via Joint Dependency Modeling in Phase Space

Jan 07, 2022
Pengxiang Su, Zhenguang Liu, Shuang Wu, Lei Zhu, Yifang Yin, Xuanjing Shen

Figure 1 for Motion Prediction via Joint Dependency Modeling in Phase Space
Figure 2 for Motion Prediction via Joint Dependency Modeling in Phase Space
Figure 3 for Motion Prediction via Joint Dependency Modeling in Phase Space
Figure 4 for Motion Prediction via Joint Dependency Modeling in Phase Space

Motion prediction is a classic problem in computer vision, which aims at forecasting future motion given the observed pose sequence. Various deep learning models have been proposed, achieving state-of-the-art performance on motion prediction. However, existing methods typically focus on modeling temporal dynamics in the pose space. Unfortunately, the complicated and high dimensionality nature of human motion brings inherent challenges for dynamic context capturing. Therefore, we move away from the conventional pose based representation and present a novel approach employing a phase space trajectory representation of individual joints. Moreover, current methods tend to only consider the dependencies between physically connected joints. In this paper, we introduce a novel convolutional neural model to effectively leverage explicit prior knowledge of motion anatomy, and simultaneously capture both spatial and temporal information of joint trajectory dynamics. We then propose a global optimization module that learns the implicit relationships between individual joint features. Empirically, our method is evaluated on large-scale 3D human motion benchmark datasets (i.e., Human3.6M, CMU MoCap). These results demonstrate that our method sets the new state-of-the-art on the benchmark datasets. Our code will be available at https://github.com/Pose-Group/TEID.

Viaarxiv icon

Decoupling Long- and Short-Term Patterns in Spatiotemporal Inference

Sep 16, 2021
Junfeng Hu, Yuxuan Liang, Zhencheng Fan, Yifang Yin, Ying Zhang, Roger Zimmermann

Figure 1 for Decoupling Long- and Short-Term Patterns in Spatiotemporal Inference
Figure 2 for Decoupling Long- and Short-Term Patterns in Spatiotemporal Inference
Figure 3 for Decoupling Long- and Short-Term Patterns in Spatiotemporal Inference
Figure 4 for Decoupling Long- and Short-Term Patterns in Spatiotemporal Inference

Sensors are the key to sensing the environment and imparting benefits to smart cities in many aspects, such as providing real-time air quality information throughout an urban area. However, a prerequisite is to obtain fine-grained knowledge of the environment. There is a limit to how many sensors can be installed in the physical world due to non-negligible expenses. In this paper, we propose to infer real-time information of any given location in a city based on historical and current observations from the available sensors (termed spatiotemporal inference). Our approach decouples the modeling of short-term and long-term patterns, relying on two major components. Firstly, unlike previous studies that separated the spatial and temporal relation learning, we introduce a joint spatiotemporal graph attention network that learns the short-term dependencies across both the spatial and temporal dimensions. Secondly, we propose an adaptive graph recurrent network with a time skip for capturing long-term patterns. The adaptive adjacency matrices are learned inductively first as the inputs of a recurrent network to learn dynamic dependencies. Experimental results on four public read-world datasets show that our method reduces state-of-the-art baseline mean absolute errors by 5%~12%.

Viaarxiv icon

Zero-Shot Multi-View Indoor Localization via Graph Location Networks

Aug 06, 2020
Meng-Jiun Chiou, Zhenguang Liu, Yifang Yin, Anan Liu, Roger Zimmermann

Figure 1 for Zero-Shot Multi-View Indoor Localization via Graph Location Networks
Figure 2 for Zero-Shot Multi-View Indoor Localization via Graph Location Networks
Figure 3 for Zero-Shot Multi-View Indoor Localization via Graph Location Networks
Figure 4 for Zero-Shot Multi-View Indoor Localization via Graph Location Networks

Indoor localization is a fundamental problem in location-based applications. Current approaches to this problem typically rely on Radio Frequency technology, which requires not only supporting infrastructures but human efforts to measure and calibrate the signal. Moreover, data collection for all locations is indispensable in existing methods, which in turn hinders their large-scale deployment. In this paper, we propose a novel neural network based architecture Graph Location Networks (GLN) to perform infrastructure-free, multi-view image based indoor localization. GLN makes location predictions based on robust location representations extracted from images through message-passing networks. Furthermore, we introduce a novel zero-shot indoor localization setting and tackle it by extending the proposed GLN to a dedicated zero-shot version, which exploits a novel mechanism Map2Vec to train location-aware embeddings and make predictions on novel unseen locations. Our extensive experiments show that the proposed approach outperforms state-of-the-art methods in the standard setting, and achieves promising accuracy even in the zero-shot setting where data for half of the locations are not available. The source code and datasets are publicly available at https://github.com/coldmanck/zero-shot-indoor-localization-release.

* Proceedings of the 28th ACM International Conference on Multimedia, 2020  
* Accepted at ACM MM 2020. 10 pages, 7 figures. Code and datasets available at https://github.com/coldmanck/zero-shot-indoor-localization-release 
Viaarxiv icon

"Notic My Speech" -- Blending Speech Patterns With Multimedia

Jun 12, 2020
Dhruva Sahrawat, Yaman Kumar, Shashwat Aggarwal, Yifang Yin, Rajiv Ratn Shah, Roger Zimmermann

Figure 1 for "Notic My Speech" -- Blending Speech Patterns With Multimedia
Figure 2 for "Notic My Speech" -- Blending Speech Patterns With Multimedia
Figure 3 for "Notic My Speech" -- Blending Speech Patterns With Multimedia
Figure 4 for "Notic My Speech" -- Blending Speech Patterns With Multimedia

Speech as a natural signal is composed of three parts - visemes (visual part of speech), phonemes (spoken part of speech), and language (the imposed structure). However, video as a medium for the delivery of speech and a multimedia construct has mostly ignored the cognitive aspects of speech delivery. For example, video applications like transcoding and compression have till now ignored the fact how speech is delivered and heard. To close the gap between speech understanding and multimedia video applications, in this paper, we show the initial experiments by modelling the perception on visual speech and showing its use case on video compression. On the other hand, in the visual speech recognition domain, existing studies have mostly modeled it as a classification problem, while ignoring the correlations between views, phonemes, visemes, and speech perception. This results in solutions which are further away from how human perception works. To bridge this gap, we propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding. We conduct experiments on three public visual speech recognition datasets. The experimental results show that our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate. Moreover, we show that there is a strong correlation between our model's understanding of multi-view speech and the human perception. This characteristic benefits downstream applications such as video compression and streaming where a significant number of less important frames can be compressed or eliminated while being able to maximally preserve human speech understanding with good user experience.

* Under Review 
Viaarxiv icon

COBRA: Contrastive Bi-Modal Representation Algorithm

May 24, 2020
Vishaal Udandarao, Abhishek Maiti, Deepak Srivatsav, Suryatej Reddy Vyalla, Yifang Yin, Rajiv Ratn Shah

Figure 1 for COBRA: Contrastive Bi-Modal Representation Algorithm
Figure 2 for COBRA: Contrastive Bi-Modal Representation Algorithm
Figure 3 for COBRA: Contrastive Bi-Modal Representation Algorithm
Figure 4 for COBRA: Contrastive Bi-Modal Representation Algorithm

There are a wide range of applications that involve multi-modal data, such as cross-modal retrieval, visual question-answering, and image captioning. Such applications are primarily dependent on aligned distributions of the different constituent modalities. Existing approaches generate latent embeddings for each modality in a joint fashion by representing them in a common manifold. However these joint embedding spaces fail to sufficiently reduce the modality gap, which affects the performance in downstream tasks. We hypothesize that these embeddings retain the intra-class relationships but are unable to preserve the inter-class dynamics. In this paper, we present a novel framework COBRA that aims to train two modalities (image and text) in a joint fashion inspired by the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms which preserve both inter and intra-class relationships. We empirically show that this framework reduces the modality gap significantly and generates a robust and task agnostic joint-embedding space. We outperform existing work on four diverse downstream tasks spanning across seven benchmark cross-modal datasets.

* 13 Pages, 6 Figures and 10 Tables 
Viaarxiv icon