Alert button
Picture for Jiuniu Wang

Jiuniu Wang

Alert button

Jointly Optimized Global-Local Visual Localization of UAVs

Oct 12, 2023
Haoling Li, Jiuniu Wang, Zhiwei Wei, Wenjia Xu

Navigation and localization of UAVs present a challenge when global navigation satellite systems (GNSS) are disrupted and unreliable. Traditional techniques, such as simultaneous localization and mapping (SLAM) and visual odometry (VO), exhibit certain limitations in furnishing absolute coordinates and mitigating error accumulation. Existing visual localization methods achieve autonomous visual localization without error accumulation by matching with ortho satellite images. However, doing so cannot guarantee real-time performance due to the complex matching process. To address these challenges, we propose a novel Global-Local Visual Localization (GLVL) network. Our GLVL network is a two-stage visual localization approach, combining a large-scale retrieval module that finds similar regions with the UAV flight scene, and a fine-grained matching module that localizes the precise UAV coordinate, enabling real-time and precise localization. The training process is jointly optimized in an end-to-end manner to further enhance the model capability. Experiments on six UAV flight scenes encompassing both texture-rich and texture-sparse regions demonstrate the ability of our model to achieve the real-time precise localization requirements of UAVs. Particularly, our method achieves a localization error of only 2.39 meters in 0.48 seconds in a village scene with sparse texture features.

Viaarxiv icon

ModelScope Text-to-Video Technical Report

Aug 12, 2023
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang

Figure 1 for ModelScope Text-to-Video Technical Report
Figure 2 for ModelScope Text-to-Video Technical Report
Figure 3 for ModelScope Text-to-Video Technical Report
Figure 4 for ModelScope Text-to-Video Technical Report

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.

* Technical report. Project page: \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary} 
Viaarxiv icon

VideoComposer: Compositional Video Synthesis with Motion Controllability

Jun 06, 2023
Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, Jingren Zhou

Figure 1 for VideoComposer: Compositional Video Synthesis with Motion Controllability
Figure 2 for VideoComposer: Compositional Video Synthesis with Motion Controllability
Figure 3 for VideoComposer: Compositional Video Synthesis with Motion Controllability
Figure 4 for VideoComposer: Compositional Video Synthesis with Motion Controllability

The pursuit of controllability as a higher standard of visual content creation has yielded remarkable progress in customizable image synthesis. However, achieving controllable video synthesis remains challenging due to the large variation of temporal dynamics and the requirement of cross-frame temporal consistency. Based on the paradigm of compositional generation, this work presents VideoComposer that allows users to flexibly compose a video with textual conditions, spatial conditions, and more importantly temporal conditions. Specifically, considering the characteristic of video data, we introduce the motion vector from compressed videos as an explicit control signal to provide guidance regarding temporal dynamics. In addition, we develop a Spatio-Temporal Condition encoder (STC-encoder) that serves as a unified interface to effectively incorporate the spatial and temporal relations of sequential inputs, with which the model could make better use of temporal conditions and hence achieve higher inter-frame consistency. Extensive experimental results suggest that VideoComposer is able to control the spatial and temporal patterns simultaneously within a synthesized video in various forms, such as text description, sketch sequence, reference video, or even simply hand-crafted motions. The code and models will be publicly available at https://videocomposer.github.io.

* The first four authors contributed equally. Project page: https://videocomposer.github.io 
Viaarxiv icon

Distinctive Image Captioning via CLIP Guided Group Optimization

Aug 14, 2022
Youyuan Zhang, Jiuniu Wang, Hao Wu, Wenjia Xu

Figure 1 for Distinctive Image Captioning via CLIP Guided Group Optimization
Figure 2 for Distinctive Image Captioning via CLIP Guided Group Optimization
Figure 3 for Distinctive Image Captioning via CLIP Guided Group Optimization
Figure 4 for Distinctive Image Captioning via CLIP Guided Group Optimization

Image captioning models are usually trained according to human annotated ground-truth captions, which could generate accurate but generic captions. In this paper, we focus on generating the distinctive captions that can distinguish the target image from other similar images. To evaluate the distinctiveness of captions, we introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness. To further improve the distinctiveness of captioning models, we propose a simple and effective training strategy which trains the model by comparing target image with similar image group and optimizing the group embedding gap. Extensive experiments are conducted on various baseline models to demonstrate the wide applicability of our strategy and the consistency of metric results with human evaluation. By comparing the performance of our best model with existing state-of-the-art models, we claim that our model achieves new state-of-the-art towards distinctiveness objective.

Viaarxiv icon

Distincive Image Captioning via CLIP Guided Group Optimization

Aug 08, 2022
Youyuan Zhang, Jiuniu Wang, Hao Wu, Wenjia Xu

Figure 1 for Distincive Image Captioning via CLIP Guided Group Optimization
Figure 2 for Distincive Image Captioning via CLIP Guided Group Optimization
Figure 3 for Distincive Image Captioning via CLIP Guided Group Optimization
Figure 4 for Distincive Image Captioning via CLIP Guided Group Optimization

Image captioning models are usually trained according to human annotated ground-truth captions, which could generate accurate but generic captions. To improve the distinctiveness of captioning models, we firstly propose a series of metrics that use large-scale vision-language pre-training model CLIP to evaluate the distinctiveness of captions. Then we propose a simple and effective training strategy which trains the model by comparison within similar image groups. We conduct extensive experiments on various existing models to demonstrate the wide applicability of our strategy and the consistency of metric based results with human evaluation. By comparing the performance of our best model with existing state-of-the-art models, we claim that our model achieves new state-of-the-art towards distinctiveness objective.

Viaarxiv icon

Multi-dimension Geospatial feature learning for urban region function recognition

Jul 18, 2022
Wenjia Xu, Jiuniu Wang, Yirong Wu

Figure 1 for Multi-dimension Geospatial feature learning for urban region function recognition
Figure 2 for Multi-dimension Geospatial feature learning for urban region function recognition
Figure 3 for Multi-dimension Geospatial feature learning for urban region function recognition
Figure 4 for Multi-dimension Geospatial feature learning for urban region function recognition

Urban region function recognition plays a vital character in monitoring and managing the limited urban areas. Since urban functions are complex and full of social-economic properties, simply using remote sensing~(RS) images equipped with physical and optical information cannot completely solve the classification task. On the other hand, with the development of mobile communication and the internet, the acquisition of geospatial big data~(GBD) becomes possible. In this paper, we propose a Multi-dimension Feature Learning Model~(MDFL) using high-dimensional GBD data in conjunction with RS images for urban region function recognition. When extracting multi-dimension features, our model considers the user-related information modeled by their activity, as well as the region-based information abstracted from the region graph. Furthermore, we propose a decision fusion network that integrates the decisions from several neural networks and machine learning classifiers, and the final decision is made considering both the visual cue from the RS images and the social information from the GBD data. Through quantitative evaluation, we demonstrate that our model achieves overall accuracy at 92.75, outperforming the state-of-the-art by 10 percent.

Viaarxiv icon

On Distinctive Image Captioning via Comparing and Reweighting

Apr 08, 2022
Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan

Figure 1 for On Distinctive Image Captioning via Comparing and Reweighting
Figure 2 for On Distinctive Image Captioning via Comparing and Reweighting
Figure 3 for On Distinctive Image Captioning via Comparing and Reweighting
Figure 4 for On Distinctive Image Captioning via Comparing and Reweighting

Recent image captioning models are achieving impressive results based on popular metrics, i.e., BLEU, CIDEr, and SPICE. However, focusing on the most popular metrics that only consider the overlap between the generated captions and human annotation could result in using common words and phrases, which lacks distinctiveness, i.e., many similar images have the same caption. In this paper, we aim to improve the distinctiveness of image captions via comparing and reweighting with a set of similar images. First, we propose a distinctiveness metric -- between-set CIDEr (CIDErBtw) to evaluate the distinctiveness of a caption with respect to those of similar images. Our metric reveals that the human annotations of each image in the MSCOCO dataset are not equivalent based on distinctiveness; however, previous works normally treat the human annotations equally during training, which could be a reason for generating less distinctive captions. In contrast, we reweight each ground-truth caption according to its distinctiveness during training. We further integrate a long-tailed weight strategy to highlight the rare words that contain more information, and captions from the similar image set are sampled as negative examples to encourage the generated sentence to be unique. Finally, extensive experiments are conducted, showing that our proposed approach significantly improves both distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy (e.g., as measured by CIDEr) for a wide variety of image captioning baselines. These results are further confirmed through a user study.

* IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI 2022)  
* 20 pages. arXiv admin note: substantial text overlap with arXiv:2007.06877 
Viaarxiv icon

Attribute Prototype Network for Any-Shot Learning

Apr 04, 2022
Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata

Figure 1 for Attribute Prototype Network for Any-Shot Learning
Figure 2 for Attribute Prototype Network for Any-Shot Learning
Figure 3 for Attribute Prototype Network for Any-Shot Learning
Figure 4 for Attribute Prototype Network for Any-Shot Learning

Any-shot image classification allows to recognize novel classes with only a few or even zero samples. For the task of zero-shot learning, visual attributes have been shown to play an important role, while in the few-shot regime, the effect of attributes is under-explored. To better transfer attribute-based knowledge from seen to unseen classes, we argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks. To this end, we propose a novel representation learning framework that jointly learns discriminative global and local features using only class-level attributes. While a visual-semantic embedding layer learns global features, local features are learned through an attribute prototype network that simultaneously regresses and decorrelates attributes from intermediate features. Furthermore, we introduce a zoom-in module that localizes and crops the informative regions to encourage the network to learn informative features explicitly. We show that our locality augmented image representations achieve a new state-of-the-art on challenging benchmarks, i.e. CUB, AWA2, and SUN. As an additional benefit, our model points to the visual evidence of the attributes in an image, confirming the improved attribute localization ability of our image representation. The attribute localization is evaluated quantitatively with ground truth part annotations, qualitatively with visualizations, and through well-designed user studies.

* arXiv admin note: text overlap with arXiv:2008.08290 
Viaarxiv icon

VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning

Mar 20, 2022
Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata

Figure 1 for VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning
Figure 2 for VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning
Figure 3 for VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning
Figure 4 for VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning

Human-annotated attributes serve as powerful semantic embeddings in zero-shot learning. However, their annotation process is labor-intensive and needs expert supervision. Current unsupervised semantic embeddings, i.e., word embeddings, enable knowledge transfer between classes. However, word embeddings do not always reflect visual similarities and result in inferior zero-shot performance. We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning, without requiring any human annotation. Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity, and further imposes their class discrimination and semantic relatedness. To associate these clusters with previously unseen classes, we use external knowledge, e.g., word embeddings and propose a novel class relation discovery module. Through quantitative and qualitative evaluation, we demonstrate that our model discovers semantic embeddings that model the visual properties of both seen and unseen classes. Furthermore, we demonstrate on three benchmarks that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.

Viaarxiv icon

Group-based Distinctive Image Captioning with Memory Attention

Aug 20, 2021
Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan

Figure 1 for Group-based Distinctive Image Captioning with Memory Attention
Figure 2 for Group-based Distinctive Image Captioning with Memory Attention
Figure 3 for Group-based Distinctive Image Captioning with Memory Attention
Figure 4 for Group-based Distinctive Image Captioning with Memory Attention

Describing images using natural language is widely known as image captioning, which has made consistent progress due to the development of computer vision and natural language generation techniques. Though conventional captioning models achieve high accuracy based on popular metrics, i.e., BLEU, CIDEr, and SPICE, the ability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employ contrastive learning or re-weighted the ground-truth captions, which focuses on one single input image. However, the relationships between objects in a similar image group (e.g., items or properties within the same album or fine-grained events) are neglected. In this paper, we improve the distinctiveness of image captions using a Group-based Distinctive Captioning Model (GdisCap), which compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we propose a group-based memory attention (GMA) module, which stores object features that are unique among the image group (i.e., with low similarity to objects in other images). These unique object features are highlighted when generating captions, resulting in more distinctive captions. Furthermore, the distinctive words in the ground-truth captions are selected to supervise the language decoder and GMA. Finally, we propose a new evaluation metric, distinctive word rate (DisWordRate) to measure the distinctiveness of captions. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves the state-of-the-art performance on both accuracy and distinctiveness. Results of a user study agree with the quantitative evaluation and demonstrate the rationality of the new metric DisWordRate.

* Accepted at ACM MM 2021 (oral) 
Viaarxiv icon