Alert button
Picture for Hao Feng

Hao Feng

Alert button

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

Sep 02, 2023
Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, Can Huang

Figure 1 for UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding
Figure 2 for UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding
Figure 3 for UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding
Figure 4 for UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

In the era of Large Language Models (LLMs), tremendous strides have been made in the field of multimodal understanding. However, existing advanced algorithms are limited to effectively utilizing the immense representation capabilities and rich world knowledge inherent to these large pre-trained models, and the beneficial connections among tasks within the context of text-rich scenarios have not been sufficiently explored. In this work, we introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities, which are deficient in existing approaches. Moreover, UniDoc capitalizes on the beneficial interactions among tasks to enhance the performance of each individual task. To implement UniDoc, we perform unified multimodal instruct tuning on the contributed large-scale instruction following datasets. Quantitative and qualitative experimental results show that UniDoc sets state-of-the-art scores across multiple challenging benchmarks. To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.

Viaarxiv icon

Sign Language Translation with Iterative Prototype

Aug 23, 2023
Huijie Yao, Wengang Zhou, Hao Feng, Hezhen Hu, Hao Zhou, Houqiang Li

Figure 1 for Sign Language Translation with Iterative Prototype
Figure 2 for Sign Language Translation with Iterative Prototype
Figure 3 for Sign Language Translation with Iterative Prototype
Figure 4 for Sign Language Translation with Iterative Prototype

This paper presents IP-SLT, a simple yet effective framework for sign language translation (SLT). Our IP-SLT adopts a recurrent structure and enhances the semantic representation (prototype) of the input sign language video via an iterative refinement manner. Our idea mimics the behavior of human reading, where a sentence can be digested repeatedly, till reaching accurate understanding. Technically, IP-SLT consists of feature extraction, prototype initialization, and iterative prototype refinement. The initialization module generates the initial prototype based on the visual feature extracted by the feature extraction module. Then, the iterative refinement module leverages the cross-attention mechanism to polish the previous prototype by aggregating it with the original video feature. Through repeated refinement, the prototype finally converges to a more stable and accurate state, leading to a fluent and appropriate translation. In addition, to leverage the sequential dependence of prototypes, we further propose an iterative distillation loss to compress the knowledge of the final iteration into previous ones. As the autoregressive decoding process is executed only once in inference, our IP-SLT is ready to improve various SLT systems with acceptable overhead. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the IP-SLT.

* Accepted by ICCV 2023 
Viaarxiv icon

SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning

Aug 17, 2023
Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, Houqiang Li

Figure 1 for SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning
Figure 2 for SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning
Figure 3 for SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning
Figure 4 for SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning

In fisheye images, rich distinct distortion patterns are regularly distributed in the image plane. These distortion patterns are independent of the visual content and provide informative cues for rectification. To make the best of such rectification cues, we introduce SimFIR, a simple framework for fisheye image rectification based on self-supervised representation learning. Technically, we first split a fisheye image into multiple patches and extract their representations with a Vision Transformer (ViT). To learn fine-grained distortion representations, we then associate different image patches with their specific distortion patterns based on the fisheye model, and further subtly design an innovative unified distortion-aware pretext task for their learning. The transfer performance on the downstream rectification task is remarkably boosted, which verifies the effectiveness of the learned representations. Extensive experiments are conducted, and the quantitative and qualitative results demonstrate the superiority of our method over the state-of-the-art algorithms as well as its strong generalization ability on real-world fisheye images.

* Accepted to ICCV 2023 
Viaarxiv icon

Mobile Supply: The Last Piece of Jigsaw of Recommender System

Aug 09, 2023
Zhenhao Jiang, Biao Zeng, Hao Feng, Jin Liu, Jie Zhang, Jia Jia, Ning Hu

Figure 1 for Mobile Supply: The Last Piece of Jigsaw of Recommender System
Figure 2 for Mobile Supply: The Last Piece of Jigsaw of Recommender System
Figure 3 for Mobile Supply: The Last Piece of Jigsaw of Recommender System
Figure 4 for Mobile Supply: The Last Piece of Jigsaw of Recommender System

Recommendation system is a fundamental functionality of online platforms. With the development of computing power of mobile phones, some researchers have deployed recommendation algorithms on users' mobile devices to address the problems of data transmission delay and pagination trigger mechanism. However, the existing edge-side mobile rankings cannot completely solve the problem of pagination trigger mechanism. The mobile ranking can only sort the items on the current page, and the fixed set of candidate items limits the performance of the mobile ranking. Besides, after the user has viewed the items of interest to the user on the current page, the user refresh to get a new page of items. This will affect the user's immersive experience because the user is not satisfied with the left items on the current page. In order to address the problem of pagination trigger mechanism, we propose a completely new module in the pipeline of recommender system named Mobile Supply. The pipeline of recommender system is extended to "retrival->pre-ranking->ranking->re-ranking->Mobile Supply->mobile ranking". Specifically, we introduce the concept of list value and use point-wise paradigm to approximate list-wise estimation to calculate the maximum revenue that can be achieved by mobile ranking for the current page. We also design a new mobile ranking approach named device-aware mobile ranking considering the differences of mobile devices tailored to the new pipeline. Extensive offline and online experiments show the superiority of our proposed method and prove that Mobile Supply can further improve the performance of edge-side recommender system and user experience. Mobile Supply has been deployed on the homepage of a large-scale online food platform and has yielded considerable profits in our business.

Viaarxiv icon

ESMC: Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint

Jul 29, 2023
Zhenhao Jiang, Biao Zeng, Hao Feng, Jin Liu, Jicong Fan, Jie Zhang, Jia Jia, Ning Hu, Xingyu Chen, Xuguang Lan

Figure 1 for ESMC: Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint
Figure 2 for ESMC: Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint
Figure 3 for ESMC: Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint
Figure 4 for ESMC: Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint

Large-scale online recommender system spreads all over the Internet being in charge of two basic tasks: Click-Through Rate (CTR) and Post-Click Conversion Rate (CVR) estimations. However, traditional CVR estimators suffer from well-known Sample Selection Bias and Data Sparsity issues. Entire space models were proposed to address the two issues via tracing the decision-making path of "exposure_click_purchase". Further, some researchers observed that there are purchase-related behaviors between click and purchase, which can better draw the user's decision-making intention and improve the recommendation performance. Thus, the decision-making path has been extended to "exposure_click_in-shop action_purchase" and can be modeled with conditional probability approach. Nevertheless, we observe that the chain rule of conditional probability does not always hold. We report Probability Space Confusion (PSC) issue and give a derivation of difference between ground-truth and estimation mathematically. We propose a novel Entire Space Multi-Task Model for Post-Click Conversion Rate via Parameter Constraint (ESMC) and two alternatives: Entire Space Multi-Task Model with Siamese Network (ESMS) and Entire Space Multi-Task Model in Global Domain (ESMG) to address the PSC issue. Specifically, we handle "exposure_click_in-shop action" and "in-shop action_purchase" separately in the light of characteristics of in-shop action. The first path is still treated with conditional probability while the second one is treated with parameter constraint strategy. Experiments on both offline and online environments in a large-scale recommendation system illustrate the superiority of our proposed methods over state-of-the-art models. The real-world datasets will be released.

Viaarxiv icon

Multi-Task Cross-Modality Attention-Fusion for 2D Object Detection

Jul 17, 2023
Huawei Sun, Hao Feng, Georg Stettinger, Lorenzo Servadei, Robert Wille

Figure 1 for Multi-Task Cross-Modality Attention-Fusion for 2D Object Detection
Figure 2 for Multi-Task Cross-Modality Attention-Fusion for 2D Object Detection
Figure 3 for Multi-Task Cross-Modality Attention-Fusion for 2D Object Detection
Figure 4 for Multi-Task Cross-Modality Attention-Fusion for 2D Object Detection

Accurate and robust object detection is critical for autonomous driving. Image-based detectors face difficulties caused by low visibility in adverse weather conditions. Thus, radar-camera fusion is of particular interest but presents challenges in optimally fusing heterogeneous data sources. To approach this issue, we propose two new radar preprocessing techniques to better align radar and camera data. In addition, we introduce a Multi-Task Cross-Modality Attention-Fusion Network (MCAF-Net) for object detection, which includes two new fusion blocks. These allow for exploiting information from the feature maps more comprehensively. The proposed algorithm jointly detects objects and segments free space, which guides the model to focus on the more relevant part of the scene, namely, the occupied space. Our approach outperforms current state-of-the-art radar-camera fusion-based object detectors in the nuScenes dataset and achieves more robust results in adverse weather conditions and nighttime scenarios.

* Accepted by ITSC 2023 
Viaarxiv icon

Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions

Jun 16, 2023
Dongshuo Yin, Xueting Han, Bin Li, Hao Feng, Jing Bai

Figure 1 for Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions
Figure 2 for Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions
Figure 3 for Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions
Figure 4 for Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions

Pre-training & fine-tuning is a prevalent paradigm in computer vision (CV). Recently, parameter-efficient transfer learning (PETL) methods have shown promising performance in transferring knowledge from pre-trained models with only a few trainable parameters. Despite their success, the existing PETL methods in CV can be computationally expensive and require large amounts of memory and time cost during training, which limits low-resource users from conducting research and applications on large models. In this work, we propose Parameter, Memory, and Time Efficient Visual Adapter ($\mathrm{E^3VA}$) tuning to address this issue. We provide a gradient backpropagation highway for low-rank adapters which removes large gradient computations for the frozen pre-trained parameters, resulting in substantial savings of training memory and training time. Furthermore, we optimise the $\mathrm{E^3VA}$ structure for dense predictions tasks to promote model performance. Extensive experiments on COCO, ADE20K, and Pascal VOC benchmarks show that $\mathrm{E^3VA}$ can save up to 62.2% training memory and 26.2% training time on average, while achieving comparable performance to full fine-tuning and better performance than most PETL methods. Note that we can even train the Swin-Large-based Cascade Mask RCNN on GTX 1080Ti GPUs with less than 1.5% trainable parameters.

* 14 pages, 4 figures, 5 tables, Submitted to NeurIPS2023 
Viaarxiv icon

Active RIS-Assisted mmWave Indoor Signal Enhancement Based on Transparent RIS

May 16, 2023
Hao Feng, Yuping Zhao

Figure 1 for Active RIS-Assisted mmWave Indoor Signal Enhancement Based on Transparent RIS
Figure 2 for Active RIS-Assisted mmWave Indoor Signal Enhancement Based on Transparent RIS
Figure 3 for Active RIS-Assisted mmWave Indoor Signal Enhancement Based on Transparent RIS
Figure 4 for Active RIS-Assisted mmWave Indoor Signal Enhancement Based on Transparent RIS

Due to the serious path loss of millimeter-wave (mmWave), the signal sent by the base station is seriously attenuated when it reaches the indoors. Recent studies have proposed a glass-based metasurface that can enhance mmWave indoor signals. The transparent reconfigurable intelligent surface (RIS) focuses on the mmWave signal to a specific location indoors. In this paper, a novel RIS-assisted mmWave indoor enhancement scheme is proposed, in which a transparent RIS is deployed on the glass to enhance mmWave indoor signals, and three assisted transmission scenarios, namely passive RIS (PRIS), active RIS (ARIS), and a novel hybrid RIS (HRIS) are proposed. This paper aims to maximize the signal-to-noise ratio (SNR) of the received signal for the three assisted transmission scenarios. The closed-form solution to the maximum SNR is presented in the PRIS and the ARIS-assisted transmission scenarios. Meanwhile, the closed-form solution to the maximum SNR for the HRIS-assisted transmission scenario is presented for given active unit cells. In addition, the performance of the proposed scheme is analyzed under three assisted transmission scenarios. The results indicate that under a specific RIS power budget, the ARIS-assisted transmission scenario achieves the highest data rate and energy efficiency. Also, it requires very few unit cells, thus dramatically reducing the size of the metasurface.

Viaarxiv icon