Alert button
Picture for Jingwen Wang

Jingwen Wang

Alert button

MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video

Dec 01, 2023
Hengyi Wang, Jingwen Wang, Lourdes Agapito

Neural rendering has demonstrated remarkable success in dynamic scene reconstruction. Thanks to the expressiveness of neural representations, prior works can accurately capture the motion and achieve high-fidelity reconstruction of the target object. Despite this, real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge, we introduce MorpheuS, a framework for dynamic 360{\deg} surface reconstruction from a casually captured RGB-D video. Our approach models the target scene as a canonical field that encodes its geometry and appearance, in conjunction with a deformation field that warps points from the current frame to the canonical space. We leverage a view-dependent diffusion prior and distill knowledge from it to achieve realistic completion of unobserved regions. Experimental results on various real-world and synthetic datasets show that our method can achieve high-fidelity 360{\deg} surface reconstruction of a deformable object from a monocular RGB-D video.

* Project page: https://hengyiwang.github.io/projects/morpheus 
Viaarxiv icon

SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation

Jun 28, 2023
Jingwen Wang, Juan Tarrio, Lourdes Agapito, Pablo F. Alcantarilla, Alexander Vakhitov

Figure 1 for SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation
Figure 2 for SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation
Figure 3 for SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation
Figure 4 for SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation

The availability of real-time semantics greatly improves the core geometric functionality of SLAM systems, enabling numerous robotic and AR/VR applications. We present a new methodology for real-time semantic mapping from RGB-D sequences that combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping. When segmenting a new frame we perform latent feature re-projection from previous frames based on differentiable rendering. Fusing re-projected feature maps from previous frames with current-frame features greatly improves image segmentation quality, compared to a baseline that processes images independently. For 3D map processing, we propose a novel geometric quasi-planar over-segmentation method that groups 3D map elements likely to belong to the same semantic classes, relying on surface normals. We also describe a novel neural network design for lightweight semantic map post-processing. Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems and matches the performance of 3D convolutional networks on three real indoor datasets, while working in real-time. Moreover, it shows better cross-sensor generalization abilities compared to 3D CNNs, enabling training and inference with different depth sensors. Code and data will be released on project page: http://jingwenwang95.github.io/SeMLaPS

* 8 pages, 7 figures, submitted to RA-L. Project page: http://jingwenwang95.github.io/SeMLaPS 
Viaarxiv icon

First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment

Jun 23, 2023
Tom Tongjia Chen, Hongshan Yu, Zhengeng Yang, Ming Li, Zechuan Li, Jingwen Wang, Wei Miao, Wei Sun, Chen Chen

Figure 1 for First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment
Figure 2 for First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment
Figure 3 for First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment
Figure 4 for First Place Solution to the CVPR'2023 AQTC Challenge: A Function-Interaction Centric Approach with Spatiotemporal Visual-Language Alignment

Affordance-Centric Question-driven Task Completion (AQTC) has been proposed to acquire knowledge from videos to furnish users with comprehensive and systematic instructions. However, existing methods have hitherto neglected the necessity of aligning spatiotemporal visual and linguistic signals, as well as the crucial interactional information between humans and objects. To tackle these limitations, we propose to combine large-scale pre-trained vision-language and video-language models, which serve to contribute stable and reliable multimodal data and facilitate effective spatiotemporal visual-textual alignment. Additionally, a novel hand-object-interaction (HOI) aggregation module is proposed which aids in capturing human-object interaction information, thereby further augmenting the capacity to understand the presented scenario. Our method achieved first place in the CVPR'2023 AQTC Challenge, with a Recall@1 score of 78.7\%. The code is available at https://github.com/tomchen-ctj/CVPR23-LOVEU-AQTC.

* Winner of CVPR2023 Long-form Video Understanding and Generation Challenge (Track 3) 
Viaarxiv icon

Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM

Apr 27, 2023
Hengyi Wang, Jingwen Wang, Lourdes Agapito

We present Co-SLAM, a neural RGB-D SLAM system based on a hybrid representation, that performs robust camera tracking and high-fidelity surface reconstruction in real time. Co-SLAM represents the scene as a multi-resolution hash-grid to exploit its high convergence speed and ability to represent high-frequency local features. In addition, Co-SLAM incorporates one-blob encoding, to encourage surface coherence and completion in unobserved areas. This joint parametric-coordinate encoding enables real-time and robust performance by bringing the best of both worlds: fast convergence and surface hole filling. Moreover, our ray sampling strategy allows Co-SLAM to perform global bundle adjustment over all keyframes instead of requiring keyframe selection to maintain a small number of active keyframes as competing neural SLAM approaches do. Experimental results show that Co-SLAM runs at 10-17Hz and achieves state-of-the-art scene reconstruction results, and competitive tracking performance in various datasets and benchmarks (ScanNet, TUM, Replica, Synthetic RGBD). Project page: https://hengyiwang.github.io/projects/CoSLAM

* CVPR2023. First two authors contributed equally. Project page: https://hengyiwang.github.io/projects/CoSLAM 
Viaarxiv icon

MalIoT: Scalable and Real-time Malware Traffic Detection for IoT Networks

Apr 02, 2023
Ethan Weitkamp, Yusuke Satani, Adam Omundsen, Jingwen Wang, Peilong Li

Figure 1 for MalIoT: Scalable and Real-time Malware Traffic Detection for IoT Networks
Figure 2 for MalIoT: Scalable and Real-time Malware Traffic Detection for IoT Networks
Figure 3 for MalIoT: Scalable and Real-time Malware Traffic Detection for IoT Networks
Figure 4 for MalIoT: Scalable and Real-time Malware Traffic Detection for IoT Networks

The machine learning approach is vital in Internet of Things (IoT) malware traffic detection due to its ability to keep pace with the ever-evolving nature of malware. Machine learning algorithms can quickly and accurately analyze the vast amount of data produced by IoT devices, allowing for the real-time identification of malicious network traffic. The system can handle the exponential growth of IoT devices thanks to the usage of distributed systems like Apache Kafka and Apache Spark, and Intel's oneAPI software stack accelerates model inference speed, making it a useful tool for real-time malware traffic detection. These technologies work together to create a system that can give scalable performance and high accuracy, making it a crucial tool for defending against cyber threats in smart communities and medical institutions.

Viaarxiv icon

CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction

Dec 12, 2022
Jinhong Wang, Jingwen Wang, Tingting Chen, Wenhao Zheng, Zhe Xu, Xingdi Wu, Wen Xu, Haochao Ying, Danny Chen, Jian Wu

Figure 1 for CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction
Figure 2 for CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction
Figure 3 for CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction
Figure 4 for CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction

Surgery is the only viable treatment for cataract patients with visual acuity (VA) impairment. Clinically, to assess the necessity of cataract surgery, accurately predicting postoperative VA before surgery by analyzing multi-view optical coherence tomography (OCT) images is crucially needed. Unfortunately, due to complicated fundus conditions, determining postoperative VA remains difficult for medical experts. Deep learning methods for this problem were developed in recent years. Although effective, these methods still face several issues, such as not efficiently exploring potential relations between multi-view OCT images, neglecting the key role of clinical prior knowledge (e.g., preoperative VA value), and using only regression-based metrics which are lacking reference. In this paper, we propose a novel Cross-token Transformer Network (CTT-Net) for postoperative VA prediction by analyzing both the multi-view OCT images and preoperative VA. To effectively fuse multi-view features of OCT images, we develop cross-token attention that could restrict redundant/unnecessary attention flow. Further, we utilize the preoperative VA value to provide more information for postoperative VA prediction and facilitate fusion between views. Moreover, we design an auxiliary classification loss to improve model performance and assess VA recovery more sufficiently, avoiding the limitation by only using the regression metrics. To evaluate CTT-Net, we build a multi-view OCT image dataset collected from our collaborative hospital. A set of extensive experiments validate the effectiveness of our model compared to existing methods in various metrics. Code is available at: https://github.com/wjh892521292/Cataract OCT.

* 5 pages, 3 figures, accepted for publication in BIBM 
Viaarxiv icon

Visual Subtitle Feature Enhanced Video Outline Generation

Sep 01, 2022
Qi Lv, Ziqiang Cao, Wenrui Xie, Derui Wang, Jingwen Wang, Zhiwei Hu, Tangkun Zhang, Ba Yuan, Yuanhang Li, Min Cao, Wenjie Li, Sujian Li, Guohong Fu

Figure 1 for Visual Subtitle Feature Enhanced Video Outline Generation
Figure 2 for Visual Subtitle Feature Enhanced Video Outline Generation
Figure 3 for Visual Subtitle Feature Enhanced Video Outline Generation
Figure 4 for Visual Subtitle Feature Enhanced Video Outline Generation

With the tremendously increasing number of videos, there is a great demand for techniques that help people quickly navigate to the video segments they are interested in. However, current works on video understanding mainly focus on video content summarization, while little effort has been made to explore the structure of a video. Inspired by textual outline generation, we introduce a novel video understanding task, namely video outline generation (VOG). This task is defined to contain two sub-tasks: (1) first segmenting the video according to the content structure and then (2) generating a heading for each segment. To learn and evaluate VOG, we annotate a 10k+ dataset, called DuVOG. Specifically, we use OCR tools to recognize subtitles of videos. Then annotators are asked to divide subtitles into chapters and title each chapter. In videos, highlighted text tends to be the headline since it is more likely to attract attention. Therefore we propose a Visual Subtitle feature Enhanced video outline generation model (VSENet) which takes as input the textual subtitles together with their visual font sizes and positions. We consider the VOG task as a sequence tagging problem that extracts spans where the headings are located and then rewrites them to form the final outlines. Furthermore, based on the similarity between video outlines and textual outlines, we use a large number of articles with chapter headings to pretrain our model. Experiments on DuVOG show that our model largely outperforms other baseline methods, achieving 77.1 of F1-score for the video segmentation level and 85.0 of ROUGE-L_F0.5 for the headline generation level.

Viaarxiv icon

GO-Surf: Neural Feature Grid Optimization for Fast, High-Fidelity RGB-D Surface Reconstruction

Jun 29, 2022
Jingwen Wang, Tymoteusz Bleja, Lourdes Agapito

Figure 1 for GO-Surf: Neural Feature Grid Optimization for Fast, High-Fidelity RGB-D Surface Reconstruction
Figure 2 for GO-Surf: Neural Feature Grid Optimization for Fast, High-Fidelity RGB-D Surface Reconstruction
Figure 3 for GO-Surf: Neural Feature Grid Optimization for Fast, High-Fidelity RGB-D Surface Reconstruction
Figure 4 for GO-Surf: Neural Feature Grid Optimization for Fast, High-Fidelity RGB-D Surface Reconstruction

We present GO-Surf, a direct feature grid optimization method for accurate and fast surface reconstruction from RGB-D sequences. We model the underlying scene with a learned hierarchical feature voxel grid that encapsulates multi-level geometric and appearance local information. Feature vectors are directly optimized such that after being tri-linearly interpolated, decoded by two shallow MLPs into signed distance and radiance values, and rendered via surface volume rendering, the discrepancy between synthesized and observed RGB/depth values is minimized. Our supervision signals -- RGB, depth and approximate SDF -- can be obtained directly from input images without any need for fusion or post-processing. We formulate a novel SDF gradient regularization term that encourages surface smoothness and hole filling while maintaining high frequency details. GO-Surf can optimize sequences of $1$-$2$K frames in $15$-$45$ minutes, a speedup of $\times60$ over NeuralRGB-D, the most related approach based on an MLP representation, while maintaining on par performance on standard benchmarks. Project page: https://jingwenwang95.github.io/go_surf/

* First two authors contributed equally. Project page: https://jingwenwang95.github.io/go_surf/ 
Viaarxiv icon

Weighted Concordance Index Loss-based Multimodal Survival Modeling for Radiation Encephalopathy Assessment in Nasopharyngeal Carcinoma Radiotherapy

Jun 23, 2022
Jiansheng Fang, Anwei Li, Pu-Yun OuYang, Jiajian Li, Jingwen Wang, Hongbo Liu, Fang-Yun Xie, Jiang Liu

Figure 1 for Weighted Concordance Index Loss-based Multimodal Survival Modeling for Radiation Encephalopathy Assessment in Nasopharyngeal Carcinoma Radiotherapy
Figure 2 for Weighted Concordance Index Loss-based Multimodal Survival Modeling for Radiation Encephalopathy Assessment in Nasopharyngeal Carcinoma Radiotherapy
Figure 3 for Weighted Concordance Index Loss-based Multimodal Survival Modeling for Radiation Encephalopathy Assessment in Nasopharyngeal Carcinoma Radiotherapy
Figure 4 for Weighted Concordance Index Loss-based Multimodal Survival Modeling for Radiation Encephalopathy Assessment in Nasopharyngeal Carcinoma Radiotherapy

Radiation encephalopathy (REP) is the most common complication for nasopharyngeal carcinoma (NPC) radiotherapy. It is highly desirable to assist clinicians in optimizing the NPC radiotherapy regimen to reduce radiotherapy-induced temporal lobe injury (RTLI) according to the probability of REP onset. To the best of our knowledge, it is the first exploration of predicting radiotherapy-induced REP by jointly exploiting image and non-image data in NPC radiotherapy regimen. We cast REP prediction as a survival analysis task and evaluate the predictive accuracy in terms of the concordance index (CI). We design a deep multimodal survival network (MSN) with two feature extractors to learn discriminative features from multimodal data. One feature extractor imposes feature selection on non-image data, and the other learns visual features from images. Because the priorly balanced CI (BCI) loss function directly maximizing the CI is sensitive to uneven sampling per batch. Hence, we propose a novel weighted CI (WCI) loss function to leverage all REP samples effectively by assigning their different weights with a dual average operation. We further introduce a temperature hyper-parameter for our WCI to sharpen the risk difference of sample pairs to help model convergence. We extensively evaluate our WCI on a private dataset to demonstrate its favourability against its counterparts. The experimental results also show multimodal data of NPC radiotherapy can bring more gains for REP risk prediction.

* 11 pages, 3 figures, MICCAI2022 
Viaarxiv icon