Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jing Tan

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

Apr 10, 2025

Mengchen Zhang, Tong Wu, Jing Tan, Ziwei Liu, Gordon Wetzstein, Dahua Lin

Abstract:Camera trajectory design plays a crucial role in video production, serving as a fundamental tool for conveying directorial intent and enhancing visual storytelling. In cinematography, Directors of Photography meticulously craft camera movements to achieve expressive and intentional framing. However, existing methods for camera trajectory generation remain limited: Traditional approaches rely on geometric optimization or handcrafted procedural systems, while recent learning-based methods often inherit structural biases or lack textual alignment, constraining creative synthesis. In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. We first introduce DataDoP, a large-scale multi-modal dataset containing 29K real-world shots with free-moving camera trajectories, depth maps, and detailed captions in specific movements, interaction with the scene, and directorial intent. Thanks to the comprehensive and diverse database, we further train an auto-regressive, decoder-only Transformer for high-quality, context-aware camera movement generation based on text guidance and RGBD inputs, named GenDoP. Extensive experiments demonstrate that compared to existing methods, GenDoP offers better controllability, finer-grained trajectory adjustments, and higher motion stability. We believe our approach establishes a new standard for learning-based cinematography, paving the way for future advancements in camera control and filmmaking. Our project website: https://kszpxxzmc.github.io/GenDoP/.

Via

Access Paper or Ask Questions

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Dec 16, 2024

Zhibing Li, Tong Wu, Jing Tan, Mengchen Zhang, Jiaqi Wang, Dahua Lin

Figure 1 for IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Figure 2 for IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Figure 3 for IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Figure 4 for IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Abstract:Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency. In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training. Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.

Via

Access Paper or Ask Questions

Imagine360: Immersive 360 Video Generation from Perspective Anchor

Dec 04, 2024

Jing Tan, Shuai Yang, Tong Wu, Jingwen He, Yuwei Guo, Ziwei Liu, Dahua Lin

Figure 1 for Imagine360: Immersive 360 Video Generation from Perspective Anchor

Figure 2 for Imagine360: Immersive 360 Video Generation from Perspective Anchor

Figure 3 for Imagine360: Immersive 360 Video Generation from Perspective Anchor

Figure 4 for Imagine360: Immersive 360 Video Generation from Perspective Anchor

Abstract:$360^\circ$ videos offer a hyper-immersive experience that allows the viewers to explore a dynamic scene from full 360 degrees. To achieve more user-friendly and personalized content creation in $360^\circ$ video format, we seek to lift standard perspective videos into $360^\circ$ equirectangular videos. To this end, we introduce Imagine360, the first perspective-to-$360^\circ$ video generation framework that creates high-quality $360^\circ$ videos with rich and diverse motion patterns from video anchors. Imagine360 learns fine-grained spherical visual and motion patterns from limited $360^\circ$ video data with several key designs. 1) Firstly we adopt the dual-branch design, including a perspective and a panorama video denoising branch to provide local and global constraints for $360^\circ$ video generation, with motion module and spatial LoRA layers fine-tuned on extended web $360^\circ$ videos. 2) Additionally, an antipodal mask is devised to capture long-range motion dependencies, enhancing the reversed camera motion between antipodal pixels across hemispheres. 3) To handle diverse perspective video inputs, we propose elevation-aware designs that adapt to varying video masking due to changing elevations across frames. Extensive experiments show Imagine360 achieves superior graphics quality and motion coherence among state-of-the-art $360^\circ$ video generation methods. We believe Imagine360 holds promise for advancing personalized, immersive $360^\circ$ video creation.

* Project page: https://ys-imtech.github.io/projects/Imagine360

Via

Access Paper or Ask Questions

LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Aug 23, 2024

Shuai Yang, Jing Tan, Mengchen Zhang, Tong Wu, Yixuan Li, Gordon Wetzstein, Ziwei Liu, Dahua Lin

Figure 1 for LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Figure 2 for LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Figure 3 for LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Figure 4 for LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Abstract:3D immersive scene generation is a challenging yet critical task in computer vision and graphics. A desired virtual 3D scene should 1) exhibit omnidirectional view consistency, and 2) allow for free exploration in complex scene hierarchies. Existing methods either rely on successive scene expansion via inpainting or employ panorama representation to represent large FOV scene environments. However, the generated scene suffers from semantic drift during expansion and is unable to handle occlusion among scene hierarchies. To tackle these challenges, we introduce LayerPano3D, a novel framework for full-view, explorable panoramic 3D scene generation from a single text prompt. Our key insight is to decompose a reference 2D panorama into multiple layers at different depth levels, where each layer reveals the unseen space from the reference views via diffusion prior. LayerPano3D comprises multiple dedicated designs: 1) we introduce a novel text-guided anchor view synthesis pipeline for high-quality, consistent panorama generation. 2) We pioneer the Layered 3D Panorama as underlying representation to manage complex scene hierarchies and lift it into 3D Gaussians to splat detailed 360-degree omnidirectional scenes with unconstrained viewing paths. Extensive experiments demonstrate that our framework generates state-of-the-art 3D panoramic scene in both full view consistency and immersive exploratory experience. We believe that LayerPano3D holds promise for advancing 3D panoramic scene creation with numerous applications.

* Project page: https://ys-imtech.github.io/projects/LayerPano3D/

Via

Access Paper or Ask Questions

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Jul 28, 2024

Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai(+1 more)

Figure 1 for HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Figure 2 for HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Figure 3 for HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Figure 4 for HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

Abstract:Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation. To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at https://github.com/zhenzhiwang/HumanVid/.

* camera controllable human image animation, a dataset and a baseline

Via

Access Paper or Ask Questions

Dual DETRs for Multi-Label Temporal Action Detection

Mar 31, 2024

Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, Limin Wang

Abstract:Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos. Inspired by the success of DETR in object detection, several methods have adapted the query-based framework to the TAD task. However, these approaches primarily followed DETR to predict actions at the instance level (i.e., identify each action by its center point), leading to sub-optimal boundary localization. To address this issue, we propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level. Decoding at different levels requires semantics of different granularity, therefore we introduce a two-branch decoding structure. This structure builds distinctive decoding processes for different levels, facilitating explicit capture of temporal cues and semantics at each level. On top of the two-branch design, we present a joint query initialization strategy to align queries from both levels. Specifically, we leverage encoder proposals to match queries from each level in a one-to-one manner. Then, the matched queries are initialized using position and content prior from the matched action proposal. The aligned dual-level queries can refine the matched proposal with complementary cues during subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD benchmarks. The experimental results demonstrate the superior performance of DualDETR to the existing state-of-the-art methods, achieving a substantial improvement under det-mAP and delivering impressive results under seg-mAP.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

Multi-Objective Optimization Using Adaptive Distributed Reinforcement Learning

Mar 13, 2024

Jing Tan, Ramin Khalili, Holger Karl

Abstract:The Intelligent Transportation System (ITS) environment is known to be dynamic and distributed, where participants (vehicle users, operators, etc.) have multiple, changing and possibly conflicting objectives. Although Reinforcement Learning (RL) algorithms are commonly applied to optimize ITS applications such as resource management and offloading, most RL algorithms focus on single objectives. In many situations, converting a multi-objective problem into a single-objective one is impossible, intractable or insufficient, making such RL algorithms inapplicable. We propose a multi-objective, multi-agent reinforcement learning (MARL) algorithm with high learning efficiency and low computational requirements, which automatically triggers adaptive few-shot learning in a dynamic, distributed and noisy environment with sparse and delayed reward. We test our algorithm in an ITS environment with edge cloud computing. Empirical results show that the algorithm is quick to adapt to new environments and performs better in all individual and system metrics compared to the state-of-the-art benchmark. Our algorithm also addresses various practical concerns with its modularized and asynchronous online training method. In addition to the cloud simulation, we test our algorithm on a single-board computer and show that it can make inference in 6 milliseconds.

Via

Access Paper or Ask Questions

Multi-Agent Reinforcement Learning for Long-Term Network Resource Allocation through Auction: a V2X Application

Jul 29, 2022

Jing Tan, Ramin Khalili, Holger Karl, Artur Hecker

Figure 1 for Multi-Agent Reinforcement Learning for Long-Term Network Resource Allocation through Auction: a V2X Application

Figure 2 for Multi-Agent Reinforcement Learning for Long-Term Network Resource Allocation through Auction: a V2X Application

Figure 3 for Multi-Agent Reinforcement Learning for Long-Term Network Resource Allocation through Auction: a V2X Application

Figure 4 for Multi-Agent Reinforcement Learning for Long-Term Network Resource Allocation through Auction: a V2X Application

Abstract:We formulate offloading of computational tasks from a dynamic group of mobile agents (e.g., cars) as decentralized decision making among autonomous agents. We design an interaction mechanism that incentivizes such agents to align private and system goals by balancing between competition and cooperation. In the static case, the mechanism provably has Nash equilibria with optimal resource allocation. In a dynamic environment, this mechanism's requirement of complete information is impossible to achieve. For such environments, we propose a novel multi-agent online learning algorithm that learns with partial, delayed and noisy state information, thus greatly reducing information need. Our algorithm is also capable of learning from long-term and sparse reward signals with varying delay. Empirical results from the simulation of a V2X application confirm that through learning, agents with the learning algorithm significantly improve both system and individual performance, reducing up to 30% of offloading failure rate, communication overhead and load variation, increasing computation resource utilization and fairness. Results also confirm the algorithm's good convergence and generalization property in different environments.

* arXiv admin note: substantial text overlap with arXiv:2204.02267

Via

Access Paper or Ask Questions

Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Jun 30, 2022

Jiaqi Tang, Zhaoyang Liu, Jing Tan, Chen Qian, Wayne Wu, Limin Wang

Figure 1 for Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Figure 2 for Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Figure 3 for Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Abstract:Generic event boundary detection (GEBD) is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries. In this paper, we present a local context modeling and global boundary decoding approach for GEBD task. Local context modeling sub-network is proposed to perceive diverse patterns of generic event boundaries, and it generates powerful video representations and reliable boundary confidence. Based on them, global boundary decoding sub-network is exploited to decode event boundaries from a global view. Our proposed method achieves 85.13% F1-score on Kinetics-GEBD testing set, which achieves a more than 22% F1-score boost compared to the baseline method. The code is available at https://github.com/JackyTown/GEBD_Challenge_CVPR2022.

* arXiv admin note: text overlap with arXiv:2112.04771

Via

Access Paper or Ask Questions

A Novel Multi-Agent Scheduling Mechanism for Adaptation of Production Plans in Case of Supply Chain Disruptions

Jun 23, 2022

Jing Tan, Lars Braubach, Kai Jander, Rongjun Xu, Kai Chen

Figure 1 for A Novel Multi-Agent Scheduling Mechanism for Adaptation of Production Plans in Case of Supply Chain Disruptions

Figure 2 for A Novel Multi-Agent Scheduling Mechanism for Adaptation of Production Plans in Case of Supply Chain Disruptions

Figure 3 for A Novel Multi-Agent Scheduling Mechanism for Adaptation of Production Plans in Case of Supply Chain Disruptions

Figure 4 for A Novel Multi-Agent Scheduling Mechanism for Adaptation of Production Plans in Case of Supply Chain Disruptions

Abstract:Manufacturing companies typically use sophisticated production planning systems optimizing production steps, often delivering near-optimal solutions. As a downside for delivering a near-optimal schedule, planning systems have high computational demands resulting in hours of computation. Under normal circumstances this is not issue if there is enough buffer time before implementation of the schedule (e.g. at night for the next day). However, in case of unexpected disruptions such as delayed part deliveries or defectively manufactured goods, the planned schedule may become invalid and swift replanning becomes necessary. Such immediate replanning is unsuited for existing optimal planners due to the computational requirements. This paper proposes a novel solution that can effectively and efficiently perform replanning in case of different types of disruptions using an existing plan. The approach is based on the idea to adhere to the existing schedule as much as possible, adapting it based on limited local changes. For that purpose an agent-based scheduling mechanism has been devised, in which agents represent materials and production sites and use local optimization techniques and negotiations to generate an adapted (sufficient, but non-optimal) schedule. The approach has been evaluated using real production data from Huawei, showing that efficient schedules are produced in short time. The system has been implemented as proof of concept and is currently reimplemented and transferred to a production system based on the Jadex agent platform.

Via

Access Paper or Ask Questions