Alert button
Picture for Xiang Xu

Xiang Xu

Alert button

Hierarchical Neural Coding for Controllable CAD Model Generation

Jun 30, 2023
Xiang Xu, Pradeep Kumar Jayaraman, Joseph G. Lambourne, Karl D. D. Willis, Yasutaka Furukawa

Figure 1 for Hierarchical Neural Coding for Controllable CAD Model Generation
Figure 2 for Hierarchical Neural Coding for Controllable CAD Model Generation
Figure 3 for Hierarchical Neural Coding for Controllable CAD Model Generation
Figure 4 for Hierarchical Neural Coding for Controllable CAD Model Generation

This paper presents a novel generative model for Computer Aided Design (CAD) that 1) represents high-level design concepts of a CAD model as a three-level hierarchical tree of neural codes, from global part arrangement down to local curve geometry; and 2) controls the generation or completion of CAD models by specifying the target design using a code tree. Concretely, a novel variant of a vector quantized VAE with "masked skip connection" extracts design variations as neural codebooks at three levels. Two-stage cascaded auto-regressive transformers learn to generate code trees from incomplete CAD models and then complete CAD models following the intended design. Extensive experiments demonstrate superior performance on conventional tasks such as random generation while enabling novel interaction capabilities on conditional generation tasks. The code is available at https://github.com/samxuxiang/hnc-cad.

* Accepted to ICML 2023. Project website at https://hnc-cad.github.io/ 
Viaarxiv icon

M3PT: A Multi-Modal Model for POI Tagging

Jun 16, 2023
Jingsong Yang, Guanzhou Han, Deqing Yang, Jingping Liu, Yanghua Xiao, Xiang Xu, Baohua Wu, Shenghua Ni

Figure 1 for M3PT: A Multi-Modal Model for POI Tagging
Figure 2 for M3PT: A Multi-Modal Model for POI Tagging
Figure 3 for M3PT: A Multi-Modal Model for POI Tagging
Figure 4 for M3PT: A Multi-Modal Model for POI Tagging

POI tagging aims to annotate a point of interest (POI) with some informative tags, which facilitates many services related to POIs, including search, recommendation, and so on. Most of the existing solutions neglect the significance of POI images and seldom fuse the textual and visual features of POIs, resulting in suboptimal tagging performance. In this paper, we propose a novel Multi-Modal Model for POI Tagging, namely M3PT, which achieves enhanced POI tagging through fusing the target POI's textual and visual features, and the precise matching between the multi-modal representations. Specifically, we first devise a domain-adaptive image encoder (DIE) to obtain the image embeddings aligned to their gold tags' semantics. Then, in M3PT's text-image fusion module (TIF), the textual and visual representations are fully fused into the POIs' content embeddings for the subsequent matching. In addition, we adopt a contrastive learning strategy to further bridge the gap between the representations of different modalities. To evaluate the tagging models' performance, we have constructed two high-quality POI tagging datasets from the real-world business scenario of Ali Fliggy. Upon the datasets, we conducted the extensive experiments to demonstrate our model's advantage over the baselines of uni-modality and multi-modality, and verify the effectiveness of important components in M3PT, including DIE, TIF and the contrastive learning strategy.

* Accepted by KDD 2023 
Viaarxiv icon

SkexGen: Autoregressive Generation of CAD Construction Sequences with Disentangled Codebooks

Jul 11, 2022
Xiang Xu, Karl D. D. Willis, Joseph G. Lambourne, Chin-Yi Cheng, Pradeep Kumar Jayaraman, Yasutaka Furukawa

Figure 1 for SkexGen: Autoregressive Generation of CAD Construction Sequences with Disentangled Codebooks
Figure 2 for SkexGen: Autoregressive Generation of CAD Construction Sequences with Disentangled Codebooks
Figure 3 for SkexGen: Autoregressive Generation of CAD Construction Sequences with Disentangled Codebooks
Figure 4 for SkexGen: Autoregressive Generation of CAD Construction Sequences with Disentangled Codebooks

We present SkexGen, a novel autoregressive generative model for computer-aided design (CAD) construction sequences containing sketch-and-extrude modeling operations. Our model utilizes distinct Transformer architectures to encode topological, geometric, and extrusion variations of construction sequences into disentangled codebooks. Autoregressive Transformer decoders generate CAD construction sequences sharing certain properties specified by the codebook vectors. Extensive experiments demonstrate that our disentangled codebook representation generates diverse and high-quality CAD models, enhances user control, and enables efficient exploration of the design space. The code is available at https://samxuxiang.github.io/skexgen.

* Accepted to ICML 2022 
Viaarxiv icon

D3D-HOI: Dynamic 3D Human-Object Interactions from Videos

Aug 19, 2021
Xiang Xu, Hanbyul Joo, Greg Mori, Manolis Savva

Figure 1 for D3D-HOI: Dynamic 3D Human-Object Interactions from Videos
Figure 2 for D3D-HOI: Dynamic 3D Human-Object Interactions from Videos
Figure 3 for D3D-HOI: Dynamic 3D Human-Object Interactions from Videos
Figure 4 for D3D-HOI: Dynamic 3D Human-Object Interactions from Videos

We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions. Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints. Each manipulated object (e.g., microwave oven) is represented with a matching 3D parametric model. This data allows us to evaluate the reconstruction quality of articulated objects and establish a benchmark for this challenging task. In particular, we leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics. We evaluate this approach on our dataset, demonstrating that human-object relations can significantly reduce the ambiguity of articulated object reconstructions from challenging real-world videos. Code and dataset are available at https://github.com/facebookresearch/d3d-hoi.

Viaarxiv icon

Structured Outdoor Architecture Reconstruction by Exploration and Classification

Aug 18, 2021
Fuyang Zhang, Xiang Xu, Nelson Nauata, Yasutaka Furukawa

Figure 1 for Structured Outdoor Architecture Reconstruction by Exploration and Classification
Figure 2 for Structured Outdoor Architecture Reconstruction by Exploration and Classification
Figure 3 for Structured Outdoor Architecture Reconstruction by Exploration and Classification
Figure 4 for Structured Outdoor Architecture Reconstruction by Exploration and Classification

This paper presents an explore-and-classify framework for structured architectural reconstruction from an aerial image. Starting from a potentially imperfect building reconstruction by an existing algorithm, our approach 1) explores the space of building models by modifying the reconstruction via heuristic actions; 2) learns to classify the correctness of building models while generating classification labels based on the ground-truth, and 3) repeat. At test time, we iterate exploration and classification, seeking for a result with the best classification score. We evaluate the approach using initial reconstructions by two baselines and two state-of-the-art reconstruction algorithms. Qualitative and quantitative evaluations demonstrate that our approach consistently improves the reconstruction quality from every initial reconstruction.

* 2021 International Conference on Computer Vision (ICCV 2021) 
Viaarxiv icon

Learning to Recognize Patch-Wise Consistency for Deepfake Detection

Dec 16, 2020
Tianchen Zhao, Xiang Xu, Mingze Xu, Hui Ding, Yuanjun Xiong, Wei Xia

Figure 1 for Learning to Recognize Patch-Wise Consistency for Deepfake Detection
Figure 2 for Learning to Recognize Patch-Wise Consistency for Deepfake Detection
Figure 3 for Learning to Recognize Patch-Wise Consistency for Deepfake Detection
Figure 4 for Learning to Recognize Patch-Wise Consistency for Deepfake Detection

We propose to detect Deepfake generated by face manipulation based on one of their fundamental features: images are blended by patches from multiple sources, carrying distinct and persistent source features. In particular, we propose a novel representation learning approach for this task, called patch-wise consistency learning (PCL). It learns by measuring the consistency of image source features, resulting to representation with good interpretability and robustness to multiple forgery methods. We develop an inconsistency image generator (I2G) to generate training data for PCL and boost its robustness. We evaluate our approach on seven popular Deepfake detection datasets. Our model achieves superior detection accuracy and generalizes well to unseen generation methods. On average, our model outperforms the state-of-the-art in terms of AUC by 2% and 8% in the in- and cross-dataset evaluation, respectively.

* 13 pages, 7 figures 
Viaarxiv icon

MCMI: Multi-Cycle Image Translation with Mutual Information Constraints

Jul 06, 2020
Xiang Xu, Megha Nawhal, Greg Mori, Manolis Savva

Figure 1 for MCMI: Multi-Cycle Image Translation with Mutual Information Constraints
Figure 2 for MCMI: Multi-Cycle Image Translation with Mutual Information Constraints
Figure 3 for MCMI: Multi-Cycle Image Translation with Mutual Information Constraints
Figure 4 for MCMI: Multi-Cycle Image Translation with Mutual Information Constraints

We present a mutual information-based framework for unsupervised image-to-image translation. Our MCMI approach treats single-cycle image translation models as modules that can be used recurrently in a multi-cycle translation setting where the translation process is bounded by mutual information constraints between the input and output images. The proposed mutual information constraints can improve cross-domain mappings by optimizing out translation functions that fail to satisfy the Markov property during image translations. We show that models trained with MCMI produce higher quality images and learn more semantically-relevant mappings compared to state-of-the-art image translation methods. The MCMI framework can be applied to existing unpaired image-to-image translation models with minimum modifications. Qualitative experiments and a perceptual study demonstrate the image quality improvements and generality of our approach using several backbone models and a variety of image datasets.

Viaarxiv icon

On Improving the Generalization of Face Recognition in the Presence of Occlusions

Jun 11, 2020
Xiang Xu, Nikolaos Sarafianos, Ioannis A. Kakadiaris

Figure 1 for On Improving the Generalization of Face Recognition in the Presence of Occlusions
Figure 2 for On Improving the Generalization of Face Recognition in the Presence of Occlusions
Figure 3 for On Improving the Generalization of Face Recognition in the Presence of Occlusions
Figure 4 for On Improving the Generalization of Face Recognition in the Presence of Occlusions

In this paper, we address a key limitation of existing 2D face recognition methods: robustness to occlusions. To accomplish this task, we systematically analyzed the impact of facial attributes on the performance of a state-of-the-art face recognition method and through extensive experimentation, quantitatively analyzed the performance degradation under different types of occlusion. Our proposed Occlusion-aware face REcOgnition (OREO) approach learned discriminative facial templates despite the presence of such occlusions. First, an attention mechanism was proposed that extracted local identity-related region. The local features were then aggregated with the global representations to form a single template. Second, a simple, yet effective, training strategy was introduced to balance the non-occluded and occluded facial images. Extensive experiments demonstrated that OREO improved the generalization ability of face recognition under occlusions by (10.17%) in a single-image-based setting and outperformed the baseline by approximately (2%) in terms of rank-1 accuracy in an image-set-based scenario.

* Technical Report 
Viaarxiv icon

On Improving Temporal Consistency for Online Face Liveness Detection

Jun 11, 2020
Xiang Xu, Yuanjun Xiong, Wei Xia

Figure 1 for On Improving Temporal Consistency for Online Face Liveness Detection
Figure 2 for On Improving Temporal Consistency for Online Face Liveness Detection
Figure 3 for On Improving Temporal Consistency for Online Face Liveness Detection
Figure 4 for On Improving Temporal Consistency for Online Face Liveness Detection

In this paper, we focus on improving the online face liveness detection system to enhance the security of the downstream face recognition system. Most of the existing frame-based methods are suffering from the prediction inconsistency across time. To address the issue, a simple yet effective solution based on temporal consistency is proposed. Specifically, in the training stage, to integrate the temporal consistency constraint, a temporal self-supervision loss and a class consistency loss are proposed in addition to the softmax cross-entropy loss. In the deployment stage, a training-free non-parametric uncertainty estimation module is developed to smooth the predictions adaptively. Beyond the common evaluation approach, a video segment-based evaluation is proposed to accommodate more practical scenarios. Extensive experiments demonstrated that our solution is more robust against several presentation attacks in various scenarios, and significantly outperformed the state-of-the-art on multiple public datasets by at least 40% in terms of ACER. Besides, with much less computational complexity (33% fewer FLOPs), it provides great potential for low-latency online applications.

* technical report 
Viaarxiv icon

Adversarial Representation Learning for Text-to-Image Matching

Aug 28, 2019
Nikolaos Sarafianos, Xiang Xu, Ioannis A. Kakadiaris

Figure 1 for Adversarial Representation Learning for Text-to-Image Matching
Figure 2 for Adversarial Representation Learning for Text-to-Image Matching
Figure 3 for Adversarial Representation Learning for Text-to-Image Matching
Figure 4 for Adversarial Representation Learning for Text-to-Image Matching

For many computer vision applications such as image captioning, visual question answering, and person search, learning discriminative feature representations at both image and text level is an essential yet challenging problem. Its challenges originate from the large word variance in the text domain as well as the difficulty of accurately measuring the distance between the features of the two modalities. Most prior work focuses on the latter challenge, by introducing loss functions that help the network learn better feature representations but fail to account for the complexity of the textual input. With that in mind, we introduce TIMAM: a Text-Image Modality Adversarial Matching approach that learns modality-invariant feature representations using adversarial and cross-modal matching objectives. In addition, we demonstrate that BERT, a publicly-available language model that extracts word embeddings, can successfully be applied in the text-to-image matching domain. The proposed approach achieves state-of-the-art cross-modal matching performance on four widely-used publicly-available datasets resulting in absolute improvements ranging from 2% to 5% in terms of rank-1 accuracy.

* To appear in ICCV 2019 
Viaarxiv icon