Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Teng Zhou

COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation

May 31, 2026

Xinlong Zhang, Jia Wei, Xiaoyu Zhang, Teng Zhou, Chengyu Lin, Yongchuan Tang

Abstract:Achieving high-fidelity object-level control in Diffusion Transformers remains a significant challenge despite the introduction of structural priors like depth and Canny maps. Current object-level conditional generation methods frequently suffer from visual artifacts and struggle to maintain precise control over objects within small localized regions. To address these limitations, we propose Cascaded Object-Level Latent Refinement (COLLAR), a training-free framework that progressively optimizes object-level features via the Field-of-View (FoV) expansion. First, we propose the Cross-Scale Semantic Alignment (CSSA) module to address spatial-semantic gaps by injecting object-level features into extended-FoV branches via attention mechanisms. To further optimize these features, the Cyclic Feature Injection (CFI) module introduces a reciprocal background feedback mechanism. It leverages a frequency-based adaptive strategy to selectively update the global backbone with context-aligned local information. Finally, the extended-FoV branch serves as a hub for feature optimization, ensuring that object-level features are integrated into the global generation process without compromising final image quality. Extensive experiments on the COCO-MIG and COCO-POS benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods across semantic alignment, image quality, and spatial fidelity.

Via

Access Paper or Ask Questions

VisAlgae 2023: A Dataset and Challenge for Algae Detection in Microscopy Images

May 27, 2025

Mingxuan Sun, Juntao Jiang, Zhiqiang Yang, Shenao Kong, Jiamin Qi, Jianru Shang, Shuangling Luo, Wanfa Sun, Tianyi Wang, Yanqi Wang(+14 more)

Abstract:Microalgae, vital for ecological balance and economic sectors, present challenges in detection due to their diverse sizes and conditions. This paper summarizes the second "Vision Meets Algae" (VisAlgae 2023) Challenge, aiming to enhance high-throughput microalgae cell detection. The challenge, which attracted 369 participating teams, includes a dataset of 1000 images across six classes, featuring microalgae of varying sizes and distinct features. Participants faced tasks such as detecting small targets, handling motion blur, and complex backgrounds. The top 10 methods, outlined here, offer insights into overcoming these challenges and maximizing detection accuracy. This intersection of algae research and computer vision offers promise for ecological understanding and technological advancement. The dataset can be accessed at: https://github.com/juntaoJianggavin/Visalgae2023/.

Via

Access Paper or Ask Questions

Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation

May 06, 2025

Teng Zhou, Jax Luo, Yuping Sun, Yiheng Tan, Shun Yao, Nazim Haouchine, Scott Raymond

Figure 1 for Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation

Figure 2 for Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation

Figure 3 for Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation

Figure 4 for Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation

Abstract:Accurate MRI-to-CT translation promises the integration of complementary imaging information without the need for additional imaging sessions. Given the practical challenges associated with acquiring paired MRI and CT scans, the development of robust methods capable of leveraging unpaired datasets is essential for advancing the MRI-to-CT translation. Current unpaired MRI-to-CT translation methods, which predominantly rely on cycle consistency and contrastive learning frameworks, frequently encounter challenges in accurately translating anatomical features that are highly discernible on CT but less distinguishable on MRI, such as bone structures. This limitation renders these approaches less suitable for applications in radiation therapy, where precise bone representation is essential for accurate treatment planning. To address this challenge, we propose a path- and bone-contour regularized approach for unpaired MRI-to-CT translation. In our method, MRI and CT images are projected to a shared latent space, where the MRI-to-CT mapping is modeled as a continuous flow governed by neural ordinary differential equations. The optimal mapping is obtained by minimizing the transition path length of the flow. To enhance the accuracy of translated bone structures, we introduce a trainable neural network to generate bone contours from MRI and implement mechanisms to directly and indirectly encourage the model to focus on bone contours and their adjacent regions. Evaluations conducted on three datasets demonstrate that our method outperforms existing unpaired MRI-to-CT translation approaches, achieving lower overall error rates. Moreover, in a downstream bone segmentation task, our approach exhibits superior performance in preserving the fidelity of bone structures. Our code is available at: https://github.com/kennysyp/PaBoT.

Via

Access Paper or Ask Questions

PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Nov 24, 2024

Teng Zhou, Xiaoyu Zhang, Yongchuan Tang

Figure 1 for PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Figure 2 for PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Figure 3 for PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Figure 4 for PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

Abstract:Panoramic Image Generation has emerged as an important task in image generation, driven by growing demands for large-scale visuals in creative and technical applications. While diffusion models have dominated this field, they face inherent limitations, including the multilevel-coherence challenge and implementation complexity, leading to suboptimal outcomes. In this paper, we introduce PanoLlama, a novel framework that redefines panoramic image generation as a next-token prediction task. Building on the pre-trained LlamaGen architecture, we generate images in an autoregressive manner and develop an expansion strategy to handle size limitations. This method aligns with the image token structure in a crop-wise and training-free manner, resulting in high-quality panoramas with minimal seams and maximum scalability. PanoLlama demonstrates its effectiveness and versatility in our experiments, achieving the best overall performance while offering flexibility for multi-scale, multi-layout, and multi-guidance generation. It overcomes the challenges that diffusion-based methods fail to address, setting a new paradigm for panoramic image generation tasks. Code is available at https://github.com/0606zt/PanoLlama.

Via

Access Paper or Ask Questions

Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation

Oct 24, 2024

Xiaoyu Zhang, Teng Zhou, Xinlong Zhang, Jia Wei, Yongchuan Tang

Figure 1 for Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation

Figure 2 for Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation

Figure 3 for Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation

Figure 4 for Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation

Abstract:Diffusion models have recently gained recognition for generating diverse and high-quality content, especially in the domain of image synthesis. These models excel not only in creating fixed-size images but also in producing panoramic images. However, existing methods often struggle with spatial layout consistency when producing high-resolution panoramas, due to the lack of guidance of the global image layout. In this paper, we introduce the Multi-Scale Diffusion (MSD) framework, a plug-and-play module that extends the existing panoramic image generation framework to multiple resolution levels. By utilizing gradient descent techniques, our method effectively incorporates structural information from low-resolution images into high-resolution outputs. A comprehensive evaluation of the proposed method was conducted, comparing it with the prior works in qualitative and quantitative dimensions. The evaluation results demonstrate that our method significantly outperforms others in generating coherent high-resolution panoramas.

Via

Access Paper or Ask Questions

TwinDiffusion: Enhancing Coherence and Efficiency in Panoramic Image Generation with Diffusion Models

Apr 30, 2024

Teng Zhou, Yongchuan Tang

Abstract:Diffusion models have emerged as effective tools for generating diverse and high-quality content. However, their capability in high-resolution image generation, particularly for panoramic images, still faces challenges such as visible seams and incoherent transitions. In this paper, we propose TwinDiffusion, an optimized framework designed to address these challenges through two key innovations: Crop Fusion for quality enhancement and Cross Sampling for efficiency optimization. We introduce a training-free optimizing stage to refine the similarity of the adjacent image areas, as well as an interleaving sampling strategy to yield dynamic patches during the cropping process. A comprehensive evaluation is conducted to compare TwinDiffusion with the existing methods, considering factors including coherence, fidelity, compatibility, and efficiency. The results demonstrate the superior performance of our approach in generating seamless and coherent panoramas, setting a new standard in quality and efficiency for panoramic image generation.

Via

Access Paper or Ask Questions

SketchBodyNet: A Sketch-Driven Multi-faceted Decoder Network for 3D Human Reconstruction

Oct 10, 2023

Fei Wang, Kongzhang Tang, Hefeng Wu, Baoquan Zhao, Hao Cai, Teng Zhou

Figure 1 for SketchBodyNet: A Sketch-Driven Multi-faceted Decoder Network for 3D Human Reconstruction

Figure 2 for SketchBodyNet: A Sketch-Driven Multi-faceted Decoder Network for 3D Human Reconstruction

Figure 3 for SketchBodyNet: A Sketch-Driven Multi-faceted Decoder Network for 3D Human Reconstruction

Figure 4 for SketchBodyNet: A Sketch-Driven Multi-faceted Decoder Network for 3D Human Reconstruction

Abstract:Reconstructing 3D human shapes from 2D images has received increasing attention recently due to its fundamental support for many high-level 3D applications. Compared with natural images, freehand sketches are much more flexible to depict various shapes, providing a high potential and valuable way for 3D human reconstruction. However, such a task is highly challenging. The sparse abstract characteristics of sketches add severe difficulties, such as arbitrariness, inaccuracy, and lacking image details, to the already badly ill-posed problem of 2D-to-3D reconstruction. Although current methods have achieved great success in reconstructing 3D human bodies from a single-view image, they do not work well on freehand sketches. In this paper, we propose a novel sketch-driven multi-faceted decoder network termed SketchBodyNet to address this task. Specifically, the network consists of a backbone and three separate attention decoder branches, where a multi-head self-attention module is exploited in each decoder to obtain enhanced features, followed by a multi-layer perceptron. The multi-faceted decoders aim to predict the camera, shape, and pose parameters, respectively, which are then associated with the SMPL model to reconstruct the corresponding 3D human mesh. In learning, existing 3D meshes are projected via the camera parameters into 2D synthetic sketches with joints, which are combined with the freehand sketches to optimize the model. To verify our method, we collect a large-scale dataset of about 26k freehand sketches and their corresponding 3D meshes containing various poses of human bodies from 14 different angles. Extensive experimental results demonstrate our SketchBodyNet achieves superior performance in reconstructing 3D human meshes from freehand sketches.

* 9 pages, to appear in Pacific Graphics 2023

Via

Access Paper or Ask Questions

CNN in CT Image Segmentation: Beyound Loss Function for Expoliting Ground Truth Images

Apr 08, 2020

Youyi Song, Zhen Yu, Teng Zhou, Jeremy Yuen-Chun Teoh, Baiying Lei, Kup-Sze Choi, Jing Qin

Figure 1 for CNN in CT Image Segmentation: Beyound Loss Function for Expoliting Ground Truth Images

Figure 2 for CNN in CT Image Segmentation: Beyound Loss Function for Expoliting Ground Truth Images

Figure 3 for CNN in CT Image Segmentation: Beyound Loss Function for Expoliting Ground Truth Images

Figure 4 for CNN in CT Image Segmentation: Beyound Loss Function for Expoliting Ground Truth Images

Abstract:Exploiting more information from ground truth (GT) images now is a new research direction for further improving CNN's performance in CT image segmentation. Previous methods focus on devising the loss function for fulfilling such a purpose. However, it is rather difficult to devise a general and optimization-friendly loss function. We here present a novel and practical method that exploits GT images beyond the loss function. Our insight is that feature maps of two CNNs trained respectively on GT and CT images should be similar on some metric space, because they both are used to describe the same objects for the same purpose. We hence exploit GT images by enforcing such two CNNs' feature maps to be consistent. We assess the proposed method on two data sets, and compare its performance to several competitive methods. Extensive experimental results show that the proposed method is effective, outperforming all the compared methods.

* 4 pages, 3 figures, and having been accepted by ISBI 2020

Via

Access Paper or Ask Questions