Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaodan Du

Teaching an Agent to Sketch One Part at a Time

Mar 19, 2026

Xiaodan Du, Ruize Xu, David Yunis, Yael Vinker, Greg Shakhnarovich

Abstract:We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.

Via

Access Paper or Ask Questions

SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction

Nov 25, 2024

Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu, Alexander H. Liu

Figure 1 for SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction

Figure 2 for SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction

Figure 3 for SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction

Figure 4 for SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction

Abstract:Sign language processing has traditionally relied on task-specific models,limiting the potential for transfer learning across tasks. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised transformer encoder that learns strong representations from approximately 1,000 hours of American Sign Language (ASL) video content. Inspired by the success of the HuBERT speech representation model, SHuBERT adapts masked prediction for multi-stream visual sign language input, learning to predict multiple targets for corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple benchmarks. On sign language translation, it outperforms prior methods trained on publicly available data on the How2Sign (+0.7 BLEU), OpenASL (+10.0 BLEU), and FLEURS-ASL (+0.3 BLEU) benchmarks. Similarly for isolated sign language recognition, SHuBERT's accuracy surpasses that of specialized models on ASL-Citizen (+5\%) and SEM-LEX (+20.6\%), while coming close to them on WLASL2000 (-3\%). Ablation studies confirm the contribution of each component of the approach.

* 17 pages

Via

Access Paper or Ask Questions

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale

Jun 11, 2024

Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu

Abstract:A persistent challenge in sign language video processing, including the task of sign language to written language translation, is how we learn representations of sign language in an effective and efficient way that can preserve the important attributes of these languages, while remaining invariant to irrelevant visual differences. Informed by the nature and linguistics of signed languages, our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer. However, instead of using pose estimation coordinates from off-the-shelf pose tracking models, which have inconsistent performance for hands and faces, we propose to learn the complex handshapes and rich facial expressions of sign languages in a self-supervised fashion. Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training. Compared to a recent model that established a new state of the art in sign language translation on the How2Sign dataset, our approach yields similar translation performance, using less than 3\% of the compute.

Via

Access Paper or Ask Questions

Generative Models: What do they know? Do they know things? Let's find out!

Nov 28, 2023

Xiaodan Du, Nicholas Kolkin, Greg Shakhnarovich, Anand Bhattad

Abstract:Generative models have been shown to be capable of synthesizing highly detailed and realistic images. It is natural to suspect that they implicitly learn to model some image intrinsics such as surface normals, depth, or shadows. In this paper, we present compelling evidence that generative models indeed internally produce high-quality scene intrinsic maps. We introduce Intrinsic LoRA (I LoRA), a universal, plug-and-play approach that transforms any generative model into a scene intrinsic predictor, capable of extracting intrinsic scene maps directly from the original generator network without needing additional decoders or fully fine-tuning the original network. Our method employs a Low-Rank Adaptation (LoRA) of key feature maps, with newly learned parameters that make up less than 0.6% of the total parameters in the generative model. Optimized with a small set of labeled images, our model-agnostic approach adapts to various generative architectures, including Diffusion models, GANs, and Autoregressive models. We show that the scene intrinsic maps produced by our method compare well with, and in some cases surpass those generated by leading supervised techniques.

* https://intrinsic-lora.github.io/

Via

Access Paper or Ask Questions

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation

Dec 01, 2022

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, Greg Shakhnarovich

Figure 1 for Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation

Figure 2 for Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation

Figure 3 for Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation

Figure 4 for Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation

Abstract:A diffusion model learns to predict a vector field of gradients. We propose to apply chain rule on the learned gradients, and back-propagate the score of a diffusion model through the Jacobian of a differentiable renderer, which we instantiate to be a voxel radiance field. This setup aggregates 2D scores at multiple camera viewpoints into a 3D score, and repurposes a pretrained 2D model for 3D data generation. We identify a technical challenge of distribution mismatch that arises in this application, and propose a novel estimation mechanism to resolve it. We run our algorithm on several off-the-shelf diffusion image generative models, including the recently released Stable Diffusion trained on the large-scale LAION dataset.

* project page https://pals.ttic.edu/p/score-jacobian-chaining

Via

Access Paper or Ask Questions

Text-Free Learning of a Natural Language Interface for Pretrained Face Generators

Sep 08, 2022

Xiaodan Du, Raymond A. Yeh, Nicholas Kolkin, Eli Shechtman, Greg Shakhnarovich

Figure 1 for Text-Free Learning of a Natural Language Interface for Pretrained Face Generators

Figure 2 for Text-Free Learning of a Natural Language Interface for Pretrained Face Generators

Figure 3 for Text-Free Learning of a Natural Language Interface for Pretrained Face Generators

Figure 4 for Text-Free Learning of a Natural Language Interface for Pretrained Face Generators

Abstract:We propose Fast text2StyleGAN, a natural language interface that adapts pre-trained GANs for text-guided human face synthesis. Leveraging the recent advances in Contrastive Language-Image Pre-training (CLIP), no text data is required during training. Fast text2StyleGAN is formulated as a conditional variational autoencoder (CVAE) that provides extra control and diversity to the generated images at test time. Our model does not require re-training or fine-tuning of the GANs or CLIP when encountering new text prompts. In contrast to prior work, we do not rely on optimization at test time, making our method orders of magnitude faster than prior work. Empirically, on FFHQ dataset, our method offers faster and more accurate generation of images from natural language descriptions with varying levels of detail compared to prior work.

Via

Access Paper or Ask Questions