The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model; (ii) To pair the visual latent space (from RPN box proposal) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to optimise a couple of learnable prompt vectors, converting the textual embedding space to fit those visually object-centric images; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images. The self-trained detector, termed as PromptDet, significantly improves the detection performance on categories for which manual annotations are unavailable or hard to obtain, e.g. rare categories. Finally, (iv) to validate the necessity of our proposed components, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset, showing superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever. Project page with code: https://fcjian.github.io/promptdet.
Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence in an untrimmed video, has drawn widespread attention over the past few years. However, recent studies have found that current benchmark datasets may have obvious moment annotation biases, enabling several simple baselines even without training to achieve SOTA performance. In this paper, we take a closer look at existing evaluation protocols, and find both the prevailing dataset and evaluation metrics are the devils that lead to untrustworthy benchmarking. Therefore, we propose to re-organize the two widely-used datasets, making the ground-truth moment distributions different in the training and test splits, i.e., out-of-distribution (OOD) test. Meanwhile, we introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets. New benchmarking results indicate that our proposed evaluation protocols can better monitor the research progress. Furthermore, we propose a novel causality-based Multi-branch Deconfounding Debiasing (MDD) framework for unbiased moment prediction. Specifically, we design a multi-branch deconfounder to eliminate the effects caused by multiple confounders with causal intervention. In order to help the model better align the semantics between sentence queries and video moments, we enhance the representations during feature encoding. Specifically, for textual information, the query is parsed into several verb-centered phrases to obtain a more fine-grained textual feature. For visual information, the positional information has been decomposed from moment features to enhance representations of moments with diverse locations. Extensive experiments demonstrate that our proposed approach can achieve competitive results among existing SOTA approaches and outperform the base model with great gains.
3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart. However, it is also more challenging due to the higher complexity and wider variety of inter-object relations. Existing methods only treat such relations as by-products of object feature learning in graphs without specifically encoding them, which leads to sub-optimal results. In this paper, aiming at improving 3D dense captioning via capturing and utilizing the complex relations in the 3D scene, we propose MORE, a Multi-Order RElation mining model, to support generating more descriptive and comprehensive captions. Technically, our MORE encodes object relations in a progressive manner since complex relations can be deduced from a limited number of basic ones. We first devise a novel Spatial Layout Graph Convolution (SLGC), which semantically encodes several first-order relations as edges of a graph constructed over 3D object proposals. Next, from the resulting graph, we further extract multiple triplets which encapsulate basic first-order relations as the basic unit and construct several Object-centric Triplet Attention Graphs (OTAG) to infer multi-order relations for every target object. The updated node features from OTAG are aggregated and fed into the caption decoder to provide abundant relational cues so that captions including diverse relations with context objects can be generated. Extensive experiments on the Scan2Cap dataset prove the effectiveness of our proposed MORE and its components, and we also outperform the current state-of-the-art method.
Recently, one-stage visual grounders attract high attention due to the comparable accuracy but significantly higher efficiency than two-stage grounders. However, inter-object relation modeling has not been well studied for one-stage grounders. Inter-object relationship modeling, though important, is not necessarily performed among all the objects within the image, as only a part of them are related to the text query and may confuse the model. We call these objects "suspected objects". However, exploring relationships among these suspected objects in the one-stage visual grounding paradigm is non-trivial due to two core problems: (1) no object proposals are available as the basis on which to select suspected objects and perform relationship modeling; (2) compared with those irrelevant to the text query, suspected objects are more confusing, as they may share similar semantics, be entangled with certain relationships, etc, and thereby more easily mislead the model's prediction. To address the above issues, this paper proposes a Suspected Object Graph (SOG) approach to encourage the correct referred object selection among the suspected ones in the one-stage visual grounding. Suspected objects are dynamically selected from a learned activation map as nodes to adapt to the current discrimination ability of the model during training. Afterward, on top of the suspected objects, a Keyword-aware Node Representation module (KNR) and an Exploration by Random Connection strategy (ERC) are concurrently proposed within the SOG to help the model rethink its initial prediction. Extensive ablation studies and comparison with state-of-the-art approaches on prevalent visual grounding benchmarks demonstrate the effectiveness of our proposed method.
In this paper, we investigate a novel and challenging task, namely controllable video captioning with an exemplar sentence. Formally, given a video and a syntactically valid exemplar sentence, the task aims to generate one caption which not only describes the semantic contents of the video, but also follows the syntactic form of the given exemplar sentence. In order to tackle such an exemplar-based video captioning task, we propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture. The proposed SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network with respect to the encoded syntactic information of the given exemplar sentence. Therefore, SMCG is able to control the states for word prediction and achieve the syntax customized caption generation. We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets. Extensive experimental results demonstrate the effectiveness of our approach on generating syntax controllable and semantic preserved video captions. By providing different exemplar sentences, our approach is capable of producing different captions with various syntactic structures, thus indicating a promising way to strengthen the diversity of video captioning.
Enhancing the diversity of sentences to describe video contents is an important problem arising in recent video captioning research. In this paper, we explore this problem from a novel perspective of customizing video captions by imitating exemplar sentence syntaxes. Specifically, given a video and any syntax-valid exemplar sentence, we introduce a new task of Syntax Customized Video Captioning (SCVC) aiming to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence. To tackle the SCVC task, we propose a novel video captioning model, where a hierarchical sentence syntax encoder is firstly designed to extract the syntactic structure of the exemplar sentence, then a syntax conditioned caption decoder is devised to generate the syntactically structured caption expressing video semantics. As there is no available syntax customized groundtruth video captions, we tackle such a challenge by proposing a new training strategy, which leverages the traditional pairwise video captioning data and our collected exemplar sentences to accomplish the model learning. Extensive experiments, in terms of semantic, syntactic, fluency, and diversity evaluations, clearly demonstrate our model capability to generate syntax-varied and semantics-coherent video captions that well imitate different exemplar sentences with enriched diversities.
The accurate and reliable detection or prediction of freezing of gaits (FOG) is important for fall prevention in Parkinson's Disease (PD) and studying the physiological transitions during the occurrence of FOG. Integrating both commercial and self-designed sensors, a protocal has been designed to acquire multimodal physical and physiological information during FOG, including gait acceleration (ACC), electroencephalogram (EEG), electromyogram (EMG), and skin conductance (SC). Two tasks were designed to trigger FOG, including gait initiation failure and FOG during walking. A total number of 12 PD patients completed the experiments and produced a total length of 3 hours and 42 minutes of valid data. The FOG episodes were labeled by two qualified physicians. Each unimodal data and combinations have been used to detect FOG. Results showed that multimodal data benefit the detection of FOG. Among unimodal data, EEG had better discriminative ability than ACC and EMG. However, the acquisition of EEG are more complicated. Multimodal motional and electrophysiological data can also be used to study the physiological transition process during the occurrence of FOG and provide personalised interventions.
Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions as well as complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents with weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work. In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches on four benchmark datasets.
This paper is a technical report to our submission to the ICCV 2021 VIPriors Re-identification Challenge. In order to make full use of the visual inductive priors of the data, we treat the query and gallery images of the same identity as continuous frames in a video sequence. And we propose one novel post-processing strategy for video temporal relationship mining, which not only calculates the distance matrix between query and gallery images, but also the matrix between gallery images. The initial query image is used to retrieve the most similar image from the gallery, then the retrieved image is treated as a new query to retrieve its most similar image from the gallery. By iteratively searching for the closest image, we can achieve accurate image retrieval and finally obtain a robust retrieval sequence.
Adaptive radiotherapy (ART), especially online ART, effectively accounts for positioning errors and anatomical changes. One key component of online ART is accurately and efficiently delineating organs at risk (OARs) and targets on online images, such as CBCT, to meet the online demands of plan evaluation and adaptation. Deep learning (DL)-based automatic segmentation has gained great success in segmenting planning CT, but its applications to CBCT yielded inferior results due to the low image quality and limited available contour labels for training. To overcome these obstacles to online CBCT segmentation, we propose a registration-guided DL (RgDL) segmentation framework that integrates image registration algorithms and DL segmentation models. The registration algorithm generates initial contours, which were used as guidance by DL model to obtain accurate final segmentations. We had two implementations the proposed framework--Rig-RgDL (Rig for rigid body) and Def-RgDL (Def for deformable)--with rigid body (RB) registration or deformable image registration (DIR) as the registration algorithm respectively and U-Net as DL model architecture. The two implementations of RgDL framework were trained and evaluated on seven OARs in an institutional clinical Head and Neck (HN) dataset. Compared to the baseline approaches using the registration or the DL alone, RgDL achieved more accurate segmentation, as measured by higher mean Dice similarity coefficients (DSC) and other distance-based metrics. Rig-RgDL achieved a DSC of 84.5% on seven OARs on average, higher than RB or DL alone by 4.5% and 4.7%. The DSC of Def-RgDL is 86.5%, higher than DIR or DL alone by 2.4% and 6.7%. The inference time took by the DL model to generate final segmentations of seven OARs is less than one second in RgDL. The resulting segmentation accuracy and efficiency show the promise of applying RgDL framework for online ART.