Alert button
Picture for Rui Su

Rui Su

Alert button

FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead

Apr 06, 2023
Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, Jing-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, Yuanzheng Ci, Bin Li, Xiaokang Yang, Wanli Ouyang

Figure 1 for FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead
Figure 2 for FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead
Figure 3 for FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead
Figure 4 for FengWu: Pushing the Skillful Global Medium-range Weather Forecast beyond 10 Days Lead

We present FengWu, an advanced data-driven global medium-range weather forecast system based on Artificial Intelligence (AI). Different from existing data-driven weather forecast methods, FengWu solves the medium-range forecast problem from a multi-modal and multi-task perspective. Specifically, a deep learning architecture equipped with model-specific encoder-decoders and cross-modal fusion Transformer is elaborately designed, which is learned under the supervision of an uncertainty loss to balance the optimization of different predictors in a region-adaptive manner. Besides this, a replay buffer mechanism is introduced to improve medium-range forecast performance. With 39-year data training based on the ERA5 reanalysis, FengWu is able to accurately reproduce the atmospheric dynamics and predict the future land and atmosphere states at 37 vertical levels on a 0.25{\deg} latitude-longitude resolution. Hindcasts of 6-hourly weather in 2018 based on ERA5 demonstrate that FengWu performs better than GraphCast in predicting 80\% of the 880 reported predictands, e.g., reducing the root mean square error (RMSE) of 10-day lead global z500 prediction from 733 to 651 $m^{2}/s^2$. In addition, the inference cost of each iteration is merely 600ms on NVIDIA Tesla A100 hardware. The results suggest that FengWu can significantly improve the forecast skill and extend the skillful global medium-range weather forecast out to 10.75 days lead (with ACC of z500 > 0.6) for the first time.

* 12 pages 
Viaarxiv icon

Slow Motion Matters: A Slow Motion Enhanced Network for Weakly Supervised Temporal Action Localization

Nov 21, 2022
Weiqi Sun, Rui Su, Qian Yu, Dong Xu

Figure 1 for Slow Motion Matters: A Slow Motion Enhanced Network for Weakly Supervised Temporal Action Localization
Figure 2 for Slow Motion Matters: A Slow Motion Enhanced Network for Weakly Supervised Temporal Action Localization
Figure 3 for Slow Motion Matters: A Slow Motion Enhanced Network for Weakly Supervised Temporal Action Localization
Figure 4 for Slow Motion Matters: A Slow Motion Enhanced Network for Weakly Supervised Temporal Action Localization

Weakly supervised temporal action localization (WTAL) aims to localize actions in untrimmed videos with only weak supervision information (e.g. video-level labels). Most existing models handle all input videos with a fixed temporal scale. However, such models are not sensitive to actions whose pace of the movements is different from the ``normal" speed, especially slow-motion action instances, which complete the movements with a much slower speed than their counterparts with a normal speed. Here arises the slow-motion blurred issue: It is hard to explore salient slow-motion information from videos at ``normal" speed. In this paper, we propose a novel framework termed Slow Motion Enhanced Network (SMEN) to improve the ability of a WTAL network by compensating its sensitivity on slow-motion action segments. The proposed SMEN comprises a Mining module and a Localization module. The mining module generates mask to mine slow-motion-related features by utilizing the relationships between the normal motion and slow motion; while the localization module leverages the mined slow-motion features as complementary information to improve the temporal action localization results. Our proposed framework can be easily adapted by existing WTAL networks and enable them be more sensitive to slow-motion actions. Extensive experiments on three benchmarks are conducted, which demonstrate the high performance of our proposed framework.

* IEEE Transactions on Circuits and Systems for Video Technology, 2022  
Viaarxiv icon

3D-QueryIS: A Query-based Framework for 3D Instance Segmentation

Nov 17, 2022
Jiaheng Liu, Tong He, Honghui Yang, Rui Su, Jiayi Tian, Junran Wu, Hongcheng Guo, Ke Xu, Wanli Ouyang

Previous top-performing methods for 3D instance segmentation often maintain inter-task dependencies and the tendency towards a lack of robustness. Besides, inevitable variations of different datasets make these methods become particularly sensitive to hyper-parameter values and manifest poor generalization capability. In this paper, we address the aforementioned challenges by proposing a novel query-based method, termed as 3D-QueryIS, which is detector-free, semantic segmentation-free, and cluster-free. Specifically, we propose to generate representative points in an implicit manner, and use them together with the initial queries to generate the informative instance queries. Then, the class and binary instance mask predictions can be produced by simply applying MLP layers on top of the instance queries and the extracted point cloud embeddings. Thus, our 3D-QueryIS is free from the accumulated errors caused by the inter-task dependencies. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and efficiency of our proposed 3D-QueryIS method.

Viaarxiv icon

NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition

Jul 21, 2022
Boyang Xia, Wenhao Wu, Haoran Wang, Rui Su, Dongliang He, Haosen Yang, Xiaoran Fan, Wanli Ouyang

Figure 1 for NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition
Figure 2 for NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition
Figure 3 for NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition
Figure 4 for NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition

It is challenging for artificial intelligence systems to achieve accurate video recognition under the scenario of low computation costs. Adaptive inference based efficient video recognition methods typically preview videos and focus on salient parts to reduce computation costs. Most existing works focus on complex networks learning with video classification based objectives. Taking all frames as positive samples, few of them pay attention to the discrimination between positive samples (salient frames) and negative samples (non-salient frames) in supervisions. To fill this gap, in this paper, we propose a novel Non-saliency Suppression Network (NSNet), which effectively suppresses the responses of non-salient frames. Specifically, on the frame level, effective pseudo labels that can distinguish between salient and non-salient frames are generated to guide the frame saliency learning. On the video level, a temporal attention module is learned under dual video-level supervisions on both the salient and the non-salient representations. Saliency measurements from both two levels are combined for exploitation of multi-granularity complementary information. Extensive experiments conducted on four well-known benchmarks verify our NSNet not only achieves the state-of-the-art accuracy-efficiency trade-off but also present a significantly faster (2.4~4.3x) practical inference speed than state-of-the-art methods. Our project page is at https://lawrencexia2008.github.io/projects/nsnet .

* Accepted by ECCV 2022 
Viaarxiv icon

SGE net: Video object detection with squeezed GRU and information entropy map

Jun 14, 2021
Rui Su, Wenjing Huang, Haoyu Ma, Xiaowei Song, Jinglu Hu

Figure 1 for SGE net: Video object detection with squeezed GRU and information entropy map
Figure 2 for SGE net: Video object detection with squeezed GRU and information entropy map
Figure 3 for SGE net: Video object detection with squeezed GRU and information entropy map
Figure 4 for SGE net: Video object detection with squeezed GRU and information entropy map

Recently, deep learning based video object detection has attracted more and more attention. Compared with object detection of static images, video object detection is more challenging due to the motion of objects, while providing rich temporal information. The RNN-based algorithm is an effective way to enhance detection performance in videos with temporal information. However, most studies in this area only focus on accuracy while ignoring the calculation cost and the number of parameters. In this paper, we propose an efficient method that combines channel-reduced convolutional GRU (Squeezed GRU), and Information Entropy map for video object detection (SGE-Net). The experimental results validate the accuracy improvement, computational savings of the Squeezed GRU, and superiority of the information entropy attention mechanism on the classification performance. The mAP has increased by 3.7 contrasted with the baseline, and the number of parameters has decreased from 6.33 million to 0.67 million compared with the standard GRU.

* ICIP 2021 
Viaarxiv icon

Deep Learning for Depression Recognition with Audiovisual Cues: A Review

May 27, 2021
Lang He, Mingyue Niu, Prayag Tiwari, Pekka Marttinen, Rui Su, Jiewei Jiang, Chenguang Guo, Hongyu Wang, Songtao Ding, Zhongmin Wang, Wei Dang, Xiaoying Pan

With the acceleration of the pace of work and life, people have to face more and more pressure, which increases the possibility of suffering from depression. However, many patients may fail to get a timely diagnosis due to the serious imbalance in the doctor-patient ratio in the world. Promisingly, physiological and psychological studies have indicated some differences in speech and facial expression between patients with depression and healthy individuals. Consequently, to improve current medical care, many scholars have used deep learning to extract a representation of depression cues in audio and video for automatic depression detection. To sort out and summarize these works, this review introduces the databases and describes objective markers for automatic depression estimation (ADE). Furthermore, we review the deep learning methods for automatic depression detection to extract the representation of depression from audio and video. Finally, this paper discusses challenges and promising directions related to automatic diagnosing of depression using deep learning technologies.

Viaarxiv icon

Improving Action Localization by Progressive Cross-stream Cooperation

May 28, 2019
Rui Su, Wanli Ouyang, Luping Zhou, Dong Xu

Figure 1 for Improving Action Localization by Progressive Cross-stream Cooperation
Figure 2 for Improving Action Localization by Progressive Cross-stream Cooperation
Figure 3 for Improving Action Localization by Progressive Cross-stream Cooperation
Figure 4 for Improving Action Localization by Progressive Cross-stream Cooperation

Spatio-temporal action localization consists of three levels of tasks: spatial localization, action classification, and temporal segmentation. In this work, we propose a new Progressive Cross-stream Cooperation (PCSC) framework to use both region proposals and features from one stream (i.e. Flow/RGB) to help another stream (i.e. RGB/Flow) to iteratively improve action localization results and generate better bounding boxes in an iterative fashion. Specifically, we first generate a larger set of region proposals by combining the latest region proposals from both streams, from which we can readily obtain a larger set of labelled training samples to help learn better action detection models. Second, we also propose a new message passing approach to pass information from one stream to another stream in order to learn better representations, which also leads to better action detection models. As a result, our iterative framework progressively improves action localization results at the frame level. To improve action localization results at the video level, we additionally propose a new strategy to train class-specific actionness detectors for better temporal segmentation, which can be readily learnt by focusing on "confusing" samples from the same action class. Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate the effectiveness of our newly proposed approaches for spatio-temporal action localization in realistic scenarios.

* CVPR2019 
Viaarxiv icon

Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space

Mar 04, 2019
Zhou Fan, Rui Su, Weinan Zhang, Yong Yu

Figure 1 for Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space
Figure 2 for Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space
Figure 3 for Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space
Figure 4 for Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space

In this paper we propose a hybrid architecture of actor-critic algorithms for reinforcement learning in parameterized action space, which consists of multiple parallel sub-actor networks to decompose the structured action space into simpler action spaces along with a critic network to guide the training of all sub-actor networks. While this paper is mainly focused on parameterized action space, the proposed architecture, which we call hybrid actor-critic, can be extended for more general action spaces which has a hierarchical structure. We present an instance of the hybrid actor-critic architecture based on proximal policy optimization (PPO), which we refer to as hybrid proximal policy optimization (H-PPO). Our experiments test H-PPO on a collection of tasks with parameterized action space, where H-PPO demonstrates superior performance over previous methods of parameterized action reinforcement learning.

Viaarxiv icon