Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ying Tai

EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy

Jan 02, 2025

Ao Gao, Luosong Guo, Tao Chen, Zhao Wang, Ying Tai, Jian Yang, Zhenyu Zhang

Figure 1 for EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy

Figure 2 for EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy

Figure 3 for EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy

Figure 4 for EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy

Abstract:3D Gaussian Splatting (3DGS) techniques have achieved satisfactory 3D scene representation. Despite their impressive performance, they confront challenges due to the limitation of structure-from-motion (SfM) methods on acquiring accurate scene initialization, or the inefficiency of densification strategy. In this paper, we introduce a novel framework EasySplat to achieve high-quality 3DGS modeling. Instead of using SfM for scene initialization, we employ a novel method to release the power of large-scale pointmap approaches. Specifically, we propose an efficient grouping strategy based on view similarity, and use robust pointmap priors to obtain high-quality point clouds and camera poses for 3D scene initialization. After obtaining a reliable scene structure, we propose a novel densification approach that adaptively splits Gaussian primitives based on the average shape of neighboring Gaussian ellipsoids, utilizing KNN scheme. In this way, the proposed method tackles the limitation on initialization and optimization, leading to an efficient and accurate 3DGS modeling. Extensive experiments demonstrate that EasySplat outperforms the current state-of-the-art (SOTA) in handling novel view synthesis.

* 6 pages, 5figures

Via

Access Paper or Ask Questions

Guided Real Image Dehazing using YCbCr Color Space

Dec 24, 2024

Wenxuan Fang, Junkai Fan, Yu Zheng, Jiangwei Weng, Ying Tai, Jun Li

Figure 1 for Guided Real Image Dehazing using YCbCr Color Space

Figure 2 for Guided Real Image Dehazing using YCbCr Color Space

Figure 3 for Guided Real Image Dehazing using YCbCr Color Space

Figure 4 for Guided Real Image Dehazing using YCbCr Color Space

Abstract:Image dehazing, particularly with learning-based methods, has gained significant attention due to its importance in real-world applications. However, relying solely on the RGB color space often fall short, frequently leaving residual haze. This arises from two main issues: the difficulty in obtaining clear textural features from hazy RGB images and the complexity of acquiring real haze/clean image pairs outside controlled environments like smoke-filled scenes. To address these issues, we first propose a novel Structure Guided Dehazing Network (SGDN) that leverages the superior structural properties of YCbCr features over RGB. It comprises two key modules: Bi-Color Guidance Bridge (BGB) and Color Enhancement Module (CEM). BGB integrates a phase integration module and an interactive attention module, utilizing the rich texture features of the YCbCr space to guide the RGB space, thereby recovering clearer features in both frequency and spatial domains. To maintain tonal consistency, CEM further enhances the color perception of RGB features by aggregating YCbCr channel information. Furthermore, for effective supervised learning, we introduce a Real-World Well-Aligned Haze (RW$^2$AH) dataset, which includes a diverse range of scenes from various geographical regions and climate conditions. Experimental results demonstrate that our method surpasses existing state-of-the-art methods across multiple real-world smoke/haze datasets. Code and Dataset: \textcolor{blue}{\url{https://github.com/fiwy0527/AAAI25_SGDN.}}

* AAAI 2025

Via

Access Paper or Ask Questions

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Dec 20, 2024

Xiantao Hu, Ying Tai, Xu Zhao, Chen Zhao, Zhenyu Zhang, Jun Li, Bineng Zhong, Jian Yang

Figure 1 for Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Figure 2 for Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Figure 3 for Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Figure 4 for Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking

Abstract:Multimodal tracking has garnered widespread attention as a result of its ability to effectively address the inherent limitations of traditional RGB tracking. However, existing multimodal trackers mainly focus on the fusion and enhancement of spatial features or merely leverage the sparse temporal relationships between video frames. These approaches do not fully exploit the temporal correlations in multimodal videos, making it difficult to capture the dynamic changes and motion information of targets in complex scenarios. To alleviate this problem, we propose a unified multimodal spatial-temporal tracking approach named STTrack. In contrast to previous paradigms that solely relied on updating reference information, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information. These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target. Furthermore, at the spatial level, we introduced the mamba fusion and background suppression interactive (BSI) modules. These modules establish a dual-stage mechanism for coordinating information interaction and fusion between modalities. Extensive comparisons on five benchmark datasets illustrate that STTrack achieves state-of-the-art performance across various multimodal tracking scenarios. Code is available at: https://github.com/NJU-PCALab/STTrack.

Via

Access Paper or Ask Questions

StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Dec 16, 2024

Xiaokun Sun, Zeyu Cai, Zhenyu Zhang, Ying Tai, Jian Yang

Figure 1 for StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Figure 2 for StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Figure 3 for StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Figure 4 for StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

Abstract:While haircut indicates distinct personality, existing avatar generation methods fail to model practical hair due to the general or entangled representation. We propose StrandHead, a novel text to 3D head avatar generation method capable of generating disentangled 3D hair with strand representation. Without using 3D data for supervision, we demonstrate that realistic hair strands can be generated from prompts by distilling 2D generative diffusion models. To this end, we propose a series of reliable priors on shape initialization, geometric primitives, and statistical haircut features, leading to a stable optimization and text-aligned performance. Extensive experiments show that StrandHead achieves the state-of-the-art reality and diversity of generated 3D head and hair. The generated 3D hair can also be easily implemented in the Unreal Engine for physical simulation and other applications. The code will be available at https://xiaokunsun.github.io/StrandHead.github.io.

* Project page: https://xiaokunsun.github.io/StrandHead.github.io

Via

Access Paper or Ask Questions

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Dec 12, 2024

Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai

Figure 1 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Figure 2 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Figure 3 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Figure 4 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Abstract:Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.

Via

Access Paper or Ask Questions

Learning to Decouple the Lights for 3D Face Texture Modeling

Dec 11, 2024

Tianxin Huang, Zhenyu Zhang, Ying Tai, Gim Hee Lee

Abstract:Existing research has made impressive strides in reconstructing human facial shapes and textures from images with well-illuminated faces and minimal external occlusions. Nevertheless, it remains challenging to recover accurate facial textures from scenarios with complicated illumination affected by external occlusions, e.g. a face that is partially obscured by items such as a hat. Existing works based on the assumption of single and uniform illumination cannot correctly process these data. In this work, we introduce a novel approach to model 3D facial textures under such unnatural illumination. Instead of assuming single illumination, our framework learns to imitate the unnatural illumination as a composition of multiple separate light conditions combined with learned neural representations, named Light Decoupling. According to experiments on both single images and video sequences, we demonstrate the effectiveness of our approach in modeling facial textures under challenging illumination affected by occlusions. Please check https://tianxinhuang.github.io/projects/Deface for our videos and codes.

* Accepted by NeurIPS 2024

Via

Access Paper or Ask Questions

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Nov 15, 2024

Zhennan Chen, Yajie Li, Haofan Wang, Zhibo Chen, Zhengkai Jiang, Jun Li, Qian Wang, Jian Yang, Ying Tai

Figure 1 for Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Figure 2 for Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Figure 3 for Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Figure 4 for Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Abstract:Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. RAG decouple the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAG novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.

* Code is available at https://github.com/NJU-PCALab/RAG-Diffusion

Via

Access Paper or Ask Questions

HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Oct 10, 2024

Shanyan Guan, Yanhao Ge, Ying Tai, Jian Yang, Wei Li, Mingyu You

Figure 1 for HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Figure 2 for HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Figure 3 for HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Figure 4 for HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Abstract:Recent advancements in text-to-image diffusion models have shown remarkable creative capabilities with textual prompts, but generating personalized instances based on specific subjects, known as subject-driven generation, remains challenging. To tackle this issue, we present a new hybrid framework called HybridBooth, which merges the benefits of optimization-based and direct-regression methods. HybridBooth operates in two stages: the Word Embedding Probe, which generates a robust initial word embedding using a fine-tuned encoder, and the Word Embedding Refinement, which further adapts the encoder to specific subject images by optimizing key parameters. This approach allows for effective and fast inversion of visual concepts into textual embedding, even from a single image, while maintaining the model's generalization capabilities.

* ECCV 2024, the project page: https://sites.google.com/view/hybridbooth

Via

Access Paper or Ask Questions

Barbie: Text to Barbie-Style 3D Avatars

Aug 17, 2024

Xiaokun Sun, Zhenyu Zhang, Ying Tai, Qian Wang, Hao Tang, Zili Yi, Jian Yang

Figure 1 for Barbie: Text to Barbie-Style 3D Avatars

Figure 2 for Barbie: Text to Barbie-Style 3D Avatars

Figure 3 for Barbie: Text to Barbie-Style 3D Avatars

Figure 4 for Barbie: Text to Barbie-Style 3D Avatars

Abstract:Recent advances in text-guided 3D avatar generation have made substantial progress by distilling knowledge from diffusion models. Despite the plausible generated appearance, existing methods cannot achieve fine-grained disentanglement or high-fidelity modeling between inner body and outfit. In this paper, we propose Barbie, a novel framework for generating 3D avatars that can be dressed in diverse and high-quality Barbie-like garments and accessories. Instead of relying on a holistic model, Barbie achieves fine-grained disentanglement on avatars by semantic-aligned separated models for human body and outfits. These disentangled 3D representations are then optimized by different expert models to guarantee the domain-specific fidelity. To balance geometry diversity and reasonableness, we propose a series of losses for template-preserving and human-prior evolving. The final avatar is enhanced by unified texture refinement for superior texture consistency. Extensive experiments demonstrate that Barbie outperforms existing methods in both dressed human and outfit generation, supporting flexible apparel combination and animation. The code will be released for research purposes. Our project page is: https://2017211801.github.io/barbie.github.io/.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions

From Words to Worth: Newborn Article Impact Prediction with LLM

Aug 07, 2024

Penghai Zhao, Qinghua Xing, Kairan Dou, Jinyu Tian, Ying Tai, Jian Yang, Ming-Ming Cheng, Xiang Li

Abstract:As the academic landscape expands, the challenge of efficiently identifying potentially high-impact articles among the vast number of newly published works becomes critical. This paper introduces a promising approach, leveraging the capabilities of fine-tuned LLMs to predict the future impact of newborn articles solely based on titles and abstracts. Moving beyond traditional methods heavily reliant on external information, the proposed method discerns the shared semantic features of highly impactful papers from a large collection of title-abstract and potential impact pairs. These semantic features are further utilized to regress an improved metric, TNCSI_SP, which has been endowed with value, field, and time normalization properties. Additionally, a comprehensive dataset has been constructed and released for fine-tuning the LLM, containing over 12,000 entries with corresponding titles, abstracts, and TNCSI_SP. The quantitative results, with an NDCG@20 of 0.901, demonstrate that the proposed approach achieves state-of-the-art performance in predicting the impact of newborn articles when compared to competitive counterparts. Finally, we demonstrate a real-world application for predicting the impact of newborn journal articles to demonstrate its noteworthy practical value. Overall, our findings challenge existing paradigms and propose a shift towards a more content-focused prediction of academic impact, offering new insights for assessing newborn article impact.

* 7 pages for main sections, plus 3 additional pages for appendices. Code, dataset are released at https://sway.cloud.microsoft/KOH09sPR21Ubojbc

Via

Access Paper or Ask Questions