Alert button
Picture for Ming-Ching Chang

Ming-Ching Chang

Alert button

MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild

Aug 28, 2023
Yu-Xiang Zeng, Jun-Wei Hsieh, Xin Li, Ming-Ching Chang

Figure 1 for MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild
Figure 2 for MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild
Figure 3 for MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild
Figure 4 for MixNet: Toward Accurate Detection of Challenging Scene Text in the Wild

Detecting small scene text instances in the wild is particularly challenging, where the influence of irregular positions and nonideal lighting often leads to detection errors. We present MixNet, a hybrid architecture that combines the strengths of CNNs and Transformers, capable of accurately detecting small text from challenging natural scenes, regardless of the orientations, styles, and lighting conditions. MixNet incorporates two key modules: (1) the Feature Shuffle Network (FSNet) to serve as the backbone and (2) the Central Transformer Block (CTBlock) to exploit the 1D manifold constraint of the scene text. We first introduce a novel feature shuffling strategy in FSNet to facilitate the exchange of features across multiple scales, generating high-resolution features superior to popular ResNet and HRNet. The FSNet backbone has achieved significant improvements over many existing text detection methods, including PAN, DB, and FAST. Then we design a complementary CTBlock to leverage center line based features similar to the medial axis of text regions and show that it can outperform contour-based approaches in challenging cases when small scene texts appear closely. Extensive experimental results show that MixNet, which mixes FSNet with CTBlock, achieves state-of-the-art results on multiple scene text detection datasets.

Viaarxiv icon

FishEye8K: A Benchmark and Dataset for Fisheye Camera Object Detection

Jun 06, 2023
Munkhjargal Gochoo, Munkh-Erdene Otgonbold, Erkhembayar Ganbold, Jun-Wei Hsieh, Ming-Ching Chang, Ping-Yang Chen, Byambaa Dorj, Hamad Al Jassmi, Ganzorig Batnasan, Fady Alnajjar, Mohammed Abduljabbar, Fang-Pang Lin

Figure 1 for FishEye8K: A Benchmark and Dataset for Fisheye Camera Object Detection
Figure 2 for FishEye8K: A Benchmark and Dataset for Fisheye Camera Object Detection
Figure 3 for FishEye8K: A Benchmark and Dataset for Fisheye Camera Object Detection
Figure 4 for FishEye8K: A Benchmark and Dataset for Fisheye Camera Object Detection

With the advance of AI, road object detection has been a prominent topic in computer vision, mostly using perspective cameras. Fisheye lens provides omnidirectional wide coverage for using fewer cameras to monitor road intersections, however with view distortions. To our knowledge, there is no existing open dataset prepared for traffic surveillance on fisheye cameras. This paper introduces an open FishEye8K benchmark dataset for road object detection tasks, which comprises 157K bounding boxes across five classes (Pedestrian, Bike, Car, Bus, and Truck). In addition, we present benchmark results of State-of-The-Art (SoTA) models, including variations of YOLOv5, YOLOR, YOLO7, and YOLOv8. The dataset comprises 8,000 images recorded in 22 videos using 18 fisheye cameras for traffic monitoring in Hsinchu, Taiwan, at resolutions of 1080$\times$1080 and 1280$\times$1280. The data annotation and validation process were arduous and time-consuming, due to the ultra-wide panoramic and hemispherical fisheye camera images with large distortion and numerous road participants, particularly people riding scooters. To avoid bias, frames from a particular camera were assigned to either the training or test sets, maintaining a ratio of about 70:30 for both the number of images and bounding boxes in each class. Experimental results show that YOLOv8 and YOLOR outperform on input sizes 640$\times$640 and 1280$\times$1280, respectively. The dataset will be available on GitHub with PASCAL VOC, MS COCO, and YOLO annotation formats. The FishEye8K benchmark will provide significant contributions to the fisheye video analytics and smart city applications.

* CVPR Workshops 2023 
Viaarxiv icon

The 7th AI City Challenge

Apr 15, 2023
Milind Naphade, Shuo Wang, David C. Anastasiu, Zheng Tang, Ming-Ching Chang, Yue Yao, Liang Zheng, Mohammed Shaiqur Rahman, Meenakshi S. Arya, Anuj Sharma, Qi Feng, Vitaly Ablavsky, Stan Sclaroff, Pranamesh Chakraborty, Sanjita Prajapati, Alice Li, Shangru Li, Krishna Kunadharaju, Shenxin Jiang, Rama Chellappa

Figure 1 for The 7th AI City Challenge
Figure 2 for The 7th AI City Challenge
Figure 3 for The 7th AI City Challenge
Figure 4 for The 7th AI City Challenge

The AI City Challenge's seventh edition emphasizes two domains at the intersection of computer vision and artificial intelligence - retail business and Intelligent Traffic Systems (ITS) - that have considerable untapped potential. The 2023 challenge had five tracks, which drew a record-breaking number of participation requests from 508 teams across 46 countries. Track 1 was a brand new track that focused on multi-target multi-camera (MTMC) people tracking, where teams trained and evaluated using both real and highly realistic synthetic data. Track 2 centered around natural-language-based vehicle track retrieval. Track 3 required teams to classify driver actions in naturalistic driving analysis. Track 4 aimed to develop an automated checkout system for retail stores using a single view camera. Track 5, another new addition, tasked teams with detecting violations of the helmet rule for motorcyclists. Two leader boards were released for submissions based on different methods: a public leader board for the contest where external private data wasn't allowed and a general leader board for all results submitted. The participating teams' top performances established strong baselines and even outperformed the state-of-the-art in the proposed challenge tracks.

* Summary of the 7th AI City Challenge Workshop in conjunction with CVPR 2023 
Viaarxiv icon

SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking

Nov 17, 2022
Yu-Hsiang Wang, Jun-Wei Hsieh, Ping-Yang Chen, Ming-Ching Chang

Figure 1 for SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking
Figure 2 for SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking
Figure 3 for SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking
Figure 4 for SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking

Multiple Object Tracking (MOT) is widely investigated in computer vision with many applications. Tracking-By-Detection (TBD) is a popular multiple-object tracking paradigm. TBD consists of the first step of object detection and the subsequent of data association, tracklet generation, and update. We propose a Similarity Learning Module (SLM) motivated from the Siamese network to extract important object appearance features and a procedure to combine object motion and appearance features effectively. This design strengthens the modeling of object motion and appearance features for data association. We design a Similarity Matching Cascade (SMC) for the data association of our SMILEtrack tracker. SMILEtrack achieves 81.06 MOTA and 80.5 IDF1 on the MOTChallenge and the MOT17 test set, respectively.

* 9 pages, 6 figures 
Viaarxiv icon

Scale-Aware Crowd Counting Using a Joint Likelihood Density Map and Synthetic Fusion Pyramid Network

Nov 13, 2022
Yi-Kuan Hsieh, Jun-Wei Hsieh, Yu-Chee Tseng, Ming-Ching Chang, Bor-Shiun Wang

Figure 1 for Scale-Aware Crowd Counting Using a Joint Likelihood Density Map and Synthetic Fusion Pyramid Network
Figure 2 for Scale-Aware Crowd Counting Using a Joint Likelihood Density Map and Synthetic Fusion Pyramid Network
Figure 3 for Scale-Aware Crowd Counting Using a Joint Likelihood Density Map and Synthetic Fusion Pyramid Network
Figure 4 for Scale-Aware Crowd Counting Using a Joint Likelihood Density Map and Synthetic Fusion Pyramid Network

We develop a Synthetic Fusion Pyramid Network (SPF-Net) with a scale-aware loss function design for accurate crowd counting. Existing crowd-counting methods assume that the training annotation points were accurate and thus ignore the fact that noisy annotations can lead to large model-learning bias and counting error, especially for counting highly dense crowds that appear far away. To the best of our knowledge, this work is the first to properly handle such noise at multiple scales in end-to-end loss design and thus push the crowd counting state-of-the-art. We model the noise of crowd annotation points as a Gaussian and derive the crowd probability density map from the input image. We then approximate the joint distribution of crowd density maps with the full covariance of multiple scales and derive a low-rank approximation for tractability and efficient implementation. The derived scale-aware loss function is used to train the SPF-Net. We show that it outperforms various loss functions on four public datasets: UCF-QNRF, UCF CC 50, NWPU and ShanghaiTech A-B datasets. The proposed SPF-Net can accurately predict the locations of people in the crowd, despite training on noisy training annotations.

* 8 pages, 8 figures, 4 tables 
Viaarxiv icon

FedDig: Robust Federated Learning Using Data Digest to Represent Absent Clients

Oct 05, 2022
Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

Figure 1 for FedDig: Robust Federated Learning Using Data Digest to Represent Absent Clients
Figure 2 for FedDig: Robust Federated Learning Using Data Digest to Represent Absent Clients
Figure 3 for FedDig: Robust Federated Learning Using Data Digest to Represent Absent Clients
Figure 4 for FedDig: Robust Federated Learning Using Data Digest to Represent Absent Clients

Federated Learning (FL) effectively protects client data privacy. However, client absence or leaving during training can seriously degrade model performances, particularly for unbalanced and non-IID client data. We address this issue by generating data digests from the raw data and using them to guide training at the FL moderator. The proposed FL framework, called FedDig, can tolerate unexpected client absence in cross-silo scenarios while preserving client data privacy because the digests de-identify the raw data by mixing encoded features in the features space. We evaluate FedDig using EMNIST, CIFAR-10, and CIFAR-100; the results consistently outperform against three baseline algorithms (FedAvg, FedProx, and FedNova) by large margins in various client absence scenarios.

Viaarxiv icon

NAS-based Recursive Stage Partial Network (RSPNet) for Light-Weight Semantic Segmentation

Oct 03, 2022
Yi-Chun Wang, Jun-Wei Hsieh, Ming-Ching Chang

Figure 1 for NAS-based Recursive Stage Partial Network (RSPNet) for Light-Weight Semantic Segmentation
Figure 2 for NAS-based Recursive Stage Partial Network (RSPNet) for Light-Weight Semantic Segmentation
Figure 3 for NAS-based Recursive Stage Partial Network (RSPNet) for Light-Weight Semantic Segmentation
Figure 4 for NAS-based Recursive Stage Partial Network (RSPNet) for Light-Weight Semantic Segmentation

Current NAS-based semantic segmentation methods focus on accuracy improvements rather than light-weight design. In this paper, we proposed a two-stage framework to design our NAS-based RSPNet model for light-weight semantic segmentation. The first architecture search determines the inner cell structure, and the second architecture search considers exponentially growing paths to finalize the outer structure of the network. It was shown in the literature that the fusion of high- and low-resolution feature maps produces stronger representations. To find the expected macro structure without manual design, we adopt a new path-attention mechanism to efficiently search for suitable paths to fuse useful information for better segmentation. Our search for repeatable micro-structures from cells leads to a superior network architecture in semantic segmentation. In addition, we propose an RSP (recursive Stage Partial) architecture to search a light-weight design for NAS-based semantic segmentation. The proposed architecture is very efficient, simple, and effective that both the macro- and micro- structure searches can be completed in five days of computation on two V100 GPUs. The light-weight NAS architecture with only 1/4 parameter size of SoTA architectures can achieve SoTA performance on semantic segmentation on the Cityscapes dataset without using any backbones.

Viaarxiv icon

Class-Specific Channel Attention for Few-Shot Learning

Sep 03, 2022
Ying-Yu Chen, Jun-Wei Hsieh, Ming-Ching Chang

Figure 1 for Class-Specific Channel Attention for Few-Shot Learning
Figure 2 for Class-Specific Channel Attention for Few-Shot Learning
Figure 3 for Class-Specific Channel Attention for Few-Shot Learning
Figure 4 for Class-Specific Channel Attention for Few-Shot Learning

Few-Shot Learning (FSL) has attracted growing attention in computer vision due to its capability in model training without the need for excessive data. FSL is challenging because the training and testing categories (the base vs. novel sets) can be largely diversified. Conventional transfer-based solutions that aim to transfer knowledge learned from large labeled training sets to target testing sets are limited, as critical adverse impacts of the shift in task distribution are not adequately addressed. In this paper, we extend the solution of transfer-based methods by incorporating the concept of metric-learning and channel attention. To better exploit the feature representations extracted by the feature backbone, we propose Class-Specific Channel Attention (CSCA) module, which learns to highlight the discriminative channels in each class by assigning each class one CSCA weight vector. Unlike general attention modules designed to learn global-class features, the CSCA module aims to learn local and class-specific features with very effective computation. We evaluated the performance of the CSCA module on standard benchmarks including miniImagenet, Tiered-ImageNet, CIFAR-FS, and CUB-200-2011. Experiments are performed in inductive and in/cross-domain settings. We achieve new state-of-the-art results.

Viaarxiv icon

Open-Eye: An Open Platform to Study Human Performance on Identifying AI-Synthesized Faces

May 13, 2022
Hui Guo, Shu Hu, Xin Wang, Ming-Ching Chang, Siwei Lyu

Figure 1 for Open-Eye: An Open Platform to Study Human Performance on Identifying AI-Synthesized Faces
Figure 2 for Open-Eye: An Open Platform to Study Human Performance on Identifying AI-Synthesized Faces
Figure 3 for Open-Eye: An Open Platform to Study Human Performance on Identifying AI-Synthesized Faces

AI-synthesized faces are visually challenging to discern from real ones. They have been used as profile images for fake social media accounts, which leads to high negative social impacts. Although progress has been made in developing automatic methods to detect AI-synthesized faces, there is no open platform to study the human performance of AI-synthesized faces detection. In this work, we develop an online platform called Open-eye to study the human performance of AI-synthesized face detection. We describe the design and workflow of the Open-eye in this paper.

* Accepted by IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR), 2022 
Viaarxiv icon