Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianzhuang Liu

Noah's Ark Lab, Huawei Technologies, Shenzhen, China

Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation

Feb 13, 2023

Bingqian Lin, Yi Zhu, Xiaodan Liang, Liang Lin, Jianzhuang Liu

Abstract:Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position. Most existing VLN agents directly learn to align the raw directional features and visual features trained using one-hot labels to linguistic instruction features. However, the big semantic gap among these multi-modal inputs makes the alignment difficult and therefore limits the navigation performance. In this paper, we propose Actional Atomic-Concept Learning (AACL), which maps visual observations to actional atomic concepts for facilitating the alignment. Specifically, an actional atomic concept is a natural language phrase containing an atomic action and an object, e.g., ``go up stairs''. These actional atomic concepts, which serve as the bridge between observations and instructions, can effectively mitigate the semantic gap and simplify the alignment. AACL contains three core components: 1) a concept mapping module to map the observations to the actional atomic concept representations through the VLN environment and the recently proposed Contrastive Language-Image Pretraining (CLIP) model, 2) a concept refining adapter to encourage more instruction-oriented object concept extraction by re-ranking the predicted object concepts by CLIP, and 3) an observation co-embedding module which utilizes concept representations to regularize the observation representations. Our AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks. Moreover, the visualization shows that AACL significantly improves the interpretability in action decision.

* Accepted by AAAI 2023

Via

Access Paper or Ask Questions

Feature Calibration Network for Occluded Pedestrian Detection

Dec 12, 2022

Tianliang Zhang, Qixiang Ye, Baochang Zhang, Jianzhuang Liu, Xiaopeng Zhang, Qi Tian

Abstract:Pedestrian detection in the wild remains a challenging problem especially for scenes containing serious occlusion. In this paper, we propose a novel feature learning method in the deep learning framework, referred to as Feature Calibration Network (FC-Net), to adaptively detect pedestrians under various occlusions. FC-Net is based on the observation that the visible parts of pedestrians are selective and decisive for detection, and is implemented as a self-paced feature learning framework with a self-activation (SA) module and a feature calibration (FC) module. In a new self-activated manner, FC-Net learns features which highlight the visible parts and suppress the occluded parts of pedestrians. The SA module estimates pedestrian activation maps by reusing classifier weights, without any additional parameter involved, therefore resulting in an extremely parsimony model to reinforce the semantics of features, while the FC module calibrates the convolutional features for adaptive pedestrian representation in both pixel-wise and region-based ways. Experiments on CityPersons and Caltech datasets demonstrate that FC-Net improves detection performance on occluded pedestrians up to 10% while maintaining excellent performance on non-occluded instances.

* Accepted by IEEE Transactions on Intelligent Transportation Systems (TITS)

Via

Access Paper or Ask Questions

CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation

Dec 04, 2022

Zicheng Zhang, Yi Zhu, Jianzhuang Liu, Xiaodan Liang, Wei Ke

Figure 1 for CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation

Figure 2 for CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation

Figure 3 for CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation

Figure 4 for CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation

Abstract:Referring image segmentation aims at localizing all pixels of the visual objects described by a natural language sentence. Previous works learn to straightforwardly align the sentence embedding and pixel-level embedding for highlighting the referred objects, but ignore the semantic consistency of pixels within the same object, leading to incomplete masks and localization errors in predictions. To tackle this problem, we propose CoupAlign, a simple yet effective multi-level visual-semantic alignment method, to couple sentence-mask alignment with word-pixel alignment to enforce object mask constraint for achieving more accurate localization and segmentation. Specifically, the Word-Pixel Alignment (WPA) module performs early fusion of linguistic and pixel-level features in intermediate layers of the vision and language encoders. Based on the word-pixel aligned embedding, a set of mask proposals are generated to hypothesize possible objects. Then in the Sentence-Mask Alignment (SMA) module, the masks are weighted by the sentence embedding to localize the referred object, and finally projected back to aggregate the pixels for the target. To further enhance the learning of the two alignment modules, an auxiliary loss is designed to contrast the foreground and background pixels. By hierarchically aligning pixels and masks with linguistic features, our CoupAlign captures the pixel coherence at both visual and semantic levels, thus generating more accurate predictions. Extensive experiments on popular datasets (e.g., RefCOCO and G-Ref) show that our method achieves consistent improvements over state-of-the-art methods, e.g., about 2% oIoU increase on the validation and testing set of RefCOCO. Especially, CoupAlign has remarkable ability in distinguishing the target from multiple objects of the same class.

* accept to NeurIPS 2022

Via

Access Paper or Ask Questions

FNeVR: Neural Volume Rendering for Face Animation

Sep 21, 2022

Bohan Zeng, Boyu Liu, Hong Li, Xuhui Liu, Jianzhuang Liu, Dapeng Chen, Wei Peng, Baochang Zhang

Figure 1 for FNeVR: Neural Volume Rendering for Face Animation

Figure 2 for FNeVR: Neural Volume Rendering for Face Animation

Figure 3 for FNeVR: Neural Volume Rendering for Face Animation

Figure 4 for FNeVR: Neural Volume Rendering for Face Animation

Abstract:Face animation, one of the hottest topics in computer vision, has achieved a promising performance with the help of generative models. However, it remains a critical challenge to generate identity preserving and photo-realistic images due to the sophisticated motion deformation and complex facial detail modeling. To address these problems, we propose a Face Neural Volume Rendering (FNeVR) network to fully explore the potential of 2D motion warping and 3D volume rendering in a unified framework. In FNeVR, we design a 3D Face Volume Rendering (FVR) module to enhance the facial details for image rendering. Specifically, we first extract 3D information with a well-designed architecture, and then introduce an orthogonal adaptive ray-sampling module for efficient rendering. We also design a lightweight pose editor, enabling FNeVR to edit the facial pose in a simple yet effective way. Extensive experiments show that our FNeVR obtains the best overall quality and performance on widely used talking-head benchmarks.

Via

Access Paper or Ask Questions

Structure-Preserving Graph Representation Learning

Sep 02, 2022

Ruiyi Fang, Liangjian Wen, Zhao Kang, Jianzhuang Liu

Figure 1 for Structure-Preserving Graph Representation Learning

Figure 2 for Structure-Preserving Graph Representation Learning

Figure 3 for Structure-Preserving Graph Representation Learning

Figure 4 for Structure-Preserving Graph Representation Learning

Abstract:Though graph representation learning (GRL) has made significant progress, it is still a challenge to extract and embed the rich topological structure and feature information in an adequate way. Most existing methods focus on local structure and fail to fully incorporate the global topological structure. To this end, we propose a novel Structure-Preserving Graph Representation Learning (SPGRL) method, to fully capture the structure information of graphs. Specifically, to reduce the uncertainty and misinformation of the original graph, we construct a feature graph as a complementary view via k-Nearest Neighbor method. The feature graph can be used to contrast at node-level to capture the local relation. Besides, we retain the global topological structure information by maximizing the mutual information (MI) of the whole graph and feature embeddings, which is theoretically reduced to exchanging the feature embeddings of the feature and the original graphs to reconstruct themselves. Extensive experiments show that our method has quite superior performance on semi-supervised node classification task and excellent robustness under noise perturbation on graph structure or node features.

* Accepted by the IEEE International Conference on Data Mining (ICDM) 2022. arXiv admin note: text overlap with arXiv:2108.04822

Via

Access Paper or Ask Questions

Removing Rain Streaks via Task Transfer Learning

Aug 28, 2022

Yinglong Wang, Chao Ma, Jianzhuang Liu

Figure 1 for Removing Rain Streaks via Task Transfer Learning

Figure 2 for Removing Rain Streaks via Task Transfer Learning

Figure 3 for Removing Rain Streaks via Task Transfer Learning

Figure 4 for Removing Rain Streaks via Task Transfer Learning

Abstract:Due to the difficulty in collecting paired real-world training data, image deraining is currently dominated by supervised learning with synthesized data generated by e.g., Photoshop rendering. However, the generalization to real rainy scenes is usually limited due to the gap between synthetic and real-world data. In this paper, we first statistically explore why the supervised deraining models cannot generalize well to real rainy cases, and find the substantial difference of synthetic and real rainy data. Inspired by our studies, we propose to remove rain by learning favorable deraining representations from other connected tasks. In connected tasks, the label for real data can be easily obtained. Hence, our core idea is to learn representations from real data through task transfer to improve deraining generalization. We thus term our learning strategy as \textit{task transfer learning}. If there are more than one connected tasks, we propose to reduce model size by knowledge distillation. The pretrained models for the connected tasks are treated as teachers, all their knowledge is distilled to a student network, so that we reduce the model size, meanwhile preserve effective prior representations from all the connected tasks. At last, the student network is fine-tuned with minority of paired synthetic rainy data to guide the pretrained prior representations to remove rain. Extensive experiments demonstrate that proposed task transfer learning strategy is surprisingly successful and compares favorably with state-of-the-art supervised learning methods and apparently surpass other semi-supervised deraining methods on synthetic data. Particularly, it shows superior generalization over them to real-world scenes.

* 9 pages, 9 figures

Via

Access Paper or Ask Questions

Anti-Retroactive Interference for Lifelong Learning

Aug 27, 2022

Runqi Wang, Yuxiang Bao, Baochang Zhang, Jianzhuang Liu, Wentao Zhu, Guodong Guo

Figure 1 for Anti-Retroactive Interference for Lifelong Learning

Figure 2 for Anti-Retroactive Interference for Lifelong Learning

Figure 3 for Anti-Retroactive Interference for Lifelong Learning

Figure 4 for Anti-Retroactive Interference for Lifelong Learning

Abstract:Humans can continuously learn new knowledge. However, machine learning models suffer from drastic dropping in performance on previous tasks after learning new tasks. Cognitive science points out that the competition of similar knowledge is an important cause of forgetting. In this paper, we design a paradigm for lifelong learning based on meta-learning and associative mechanism of the brain. It tackles the problem from two aspects: extracting knowledge and memorizing knowledge. First, we disrupt the sample's background distribution through a background attack, which strengthens the model to extract the key features of each task. Second, according to the similarity between incremental knowledge and base knowledge, we design an adaptive fusion of incremental knowledge, which helps the model allocate capacity to the knowledge of different difficulties. It is theoretically analyzed that the proposed learning paradigm can make the models of different tasks converge to the same optimum. The proposed method is validated on the MNIST, CIFAR100, CUB200 and ImageNet100 datasets.

Via

Access Paper or Ask Questions

Low-Light Video Enhancement with Synthetic Event Guidance

Aug 23, 2022

Lin Liu, Junfeng An, Jianzhuang Liu, Shanxin Yuan, Xiangyu Chen, Wengang Zhou, Houqiang Li, Yanfeng Wang, Qi Tian

Figure 1 for Low-Light Video Enhancement with Synthetic Event Guidance

Figure 2 for Low-Light Video Enhancement with Synthetic Event Guidance

Figure 3 for Low-Light Video Enhancement with Synthetic Event Guidance

Figure 4 for Low-Light Video Enhancement with Synthetic Event Guidance

Abstract:Low-light video enhancement (LLVE) is an important yet challenging task with many applications such as photographing and autonomous driving. Unlike single image low-light enhancement, most LLVE methods utilize temporal information from adjacent frames to restore the color and remove the noise of the target frame. However, these algorithms, based on the framework of multi-frame alignment and enhancement, may produce multi-frame fusion artifacts when encountering extreme low light or fast motion. In this paper, inspired by the low latency and high dynamic range of events, we use synthetic events from multiple frames to guide the enhancement and restoration of low-light videos. Our method contains three stages: 1) event synthesis and enhancement, 2) event and image fusion, and 3) low-light enhancement. In this framework, we design two novel modules (event-image fusion transform and event-guided dual branch) for the second and third stages, respectively. Extensive experiments show that our method outperforms existing low-light video or single image enhancement approaches on both synthetic and real LLVE datasets.

Via

Access Paper or Ask Questions

CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation

Aug 01, 2022

Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, Youliang Yan

Figure 1 for CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation

Figure 2 for CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation

Figure 3 for CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation

Figure 4 for CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation

Abstract:Top-down methods dominate the field of 3D human pose and shape estimation, because they are decoupled from human detection and allow researchers to focus on the core problem. However, cropping, their first step, discards the location information from the very beginning, which makes themselves unable to accurately predict the global rotation in the original camera coordinate system. To address this problem, we propose to Carry Location Information in Full Frames (CLIFF) into this task. Specifically, we feed more holistic features to CLIFF by concatenating the cropped-image feature with its bounding box information. We calculate the 2D reprojection loss with a broader view of the full frame, taking a projection process similar to that of the person projected in the image. Fed and supervised by global-location-aware information, CLIFF directly predicts the global rotation along with more accurate articulated poses. Besides, we propose a pseudo-ground-truth annotator based on CLIFF, which provides high-quality 3D annotations for in-the-wild 2D datasets and offers crucial full supervision for regression-based methods. Extensive experiments on popular benchmarks show that CLIFF outperforms prior arts by a significant margin, and reaches the first place on the AGORA leaderboard (the SMPL-Algorithms track). The code and data are available at https://github.com/huawei-noah/noah-research/tree/master/CLIFF.

* to be published in ECCV 2022 for an oral presentation

Via

Access Paper or Ask Questions

Self-Supervision Can Be a Good Few-Shot Learner

Jul 19, 2022

Yuning Lu, Liangjian Wen, Jianzhuang Liu, Yajing Liu, Xinmei Tian

Figure 1 for Self-Supervision Can Be a Good Few-Shot Learner

Figure 2 for Self-Supervision Can Be a Good Few-Shot Learner

Figure 3 for Self-Supervision Can Be a Good Few-Shot Learner

Figure 4 for Self-Supervision Can Be a Good Few-Shot Learner

Abstract:Existing few-shot learning (FSL) methods rely on training with a large labeled dataset, which prevents them from leveraging abundant unlabeled data. From an information-theoretic perspective, we propose an effective unsupervised FSL method, learning representations with self-supervision. Following the InfoMax principle, our method learns comprehensive representations by capturing the intrinsic structure of the data. Specifically, we maximize the mutual information (MI) of instances and their representations with a low-bias MI estimator to perform self-supervised pre-training. Rather than supervised pre-training focusing on the discriminable features of the seen classes, our self-supervised model has less bias toward the seen classes, resulting in better generalization for unseen classes. We explain that supervised pre-training and self-supervised pre-training are actually maximizing different MI objectives. Extensive experiments are further conducted to analyze their FSL performance with various training settings. Surprisingly, the results show that self-supervised pre-training can outperform supervised pre-training under the appropriate conditions. Compared with state-of-the-art FSL methods, our approach achieves comparable performance on widely used FSL benchmarks without any labels of the base classes.

* ECCV 2022, code: https://github.com/bbbdylan/unisiam

Via

Access Paper or Ask Questions