Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qingwen Liu

Can video generation replace cinematographers? Research on the cinematic language of generated video

Dec 16, 2024

Xiaozhe Li, Kai WU, Siyi Yang, YiZhan Qu, Guohua. Zhang, Zhiyu Chen, Jiayao Li, Jiangchuan Mu, Xiaobin Hu, Wen Fang(+5 more)

Figure 1 for Can video generation replace cinematographers? Research on the cinematic language of generated video

Figure 2 for Can video generation replace cinematographers? Research on the cinematic language of generated video

Figure 3 for Can video generation replace cinematographers? Research on the cinematic language of generated video

Figure 4 for Can video generation replace cinematographers? Research on the cinematic language of generated video

Abstract:Recent advancements in text-to-video (T2V) generation have leveraged diffusion models to enhance the visual coherence of videos generated from textual descriptions. However, most research has primarily focused on object motion, with limited attention given to cinematic language in videos, which is crucial for cinematographers to convey emotion and narrative pacing. To address this limitation, we propose a threefold approach to enhance the ability of T2V models to generate controllable cinematic language. Specifically, we introduce a cinematic language dataset that encompasses shot framing, angle, and camera movement, enabling models to learn diverse cinematic styles. Building on this, to facilitate robust cinematic alignment evaluation, we present CameraCLIP, a model fine-tuned on the proposed dataset that excels in understanding complex cinematic language in generated videos and can further provide valuable guidance in the multi-shot composition process. Finally, we propose CLIPLoRA, a cost-guided dynamic LoRA composition method that facilitates smooth transitions and realistic blending of cinematic language by dynamically fusing multiple pre-trained cinematic LoRAs within a single video. Our experiments demonstrate that CameraCLIP outperforms existing models in assessing the alignment between cinematic language and video, achieving an R@1 score of 0.81. Additionally, CLIPLoRA improves the ability for multi-shot composition, potentially bridging the gap between automatically generated videos and those shot by professional cinematographers.

* 13 pages

Via

Access Paper or Ask Questions

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Nov 30, 2024

Haicheng Wang, Chen Ju, Weixiong Lin, Shuai Xiao, Mengting Chen, Yixuan Huang, Chang Liu, Mingshuai Yao, Jinsong Lan, Ying Chen(+2 more)

Figure 1 for Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Figure 2 for Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Figure 3 for Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Figure 4 for Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Abstract:In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-training (CLIP) has made significant strides, becoming foundation for various downstream tasks. However, relying on one-to-one (image, text) contrastive paradigm to learn alignment from large-scale messy web data, CLIP faces a serious myopic dilemma, resulting in biases towards monotonous short texts and shallow visual expressivity. To overcome these issues, this paper advances CLIP into one novel holistic paradigm, by updating both diverse data and alignment optimization. To obtain colorful data with low cost, we use image-to-text captioning to generate multi-texts for each image, from multiple perspectives, granularities, and hierarchies. Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. As a result, diverse visual embeddings are learned for each image, bringing good interpretability and generalization. Extensive experiments and ablations across over ten benchmarks indicate that our holistic CLIP significantly outperforms existing myopic CLIP, including image-text retrieval, open-vocabulary classification, and dense visual tasks.

Via

Access Paper or Ask Questions

Branches, Assemble! Multi-Branch Cooperation Network for Large-Scale Click-Through Rate Prediction at Taobao

Nov 20, 2024

Xu Chen, Zida Cheng, Yuangang Pan, Shuai Xiao, Xiaoming Liu, Jinsong Lan, Qingwen Liu, Ivor W. Tsang

Figure 1 for Branches, Assemble! Multi-Branch Cooperation Network for Large-Scale Click-Through Rate Prediction at Taobao

Figure 2 for Branches, Assemble! Multi-Branch Cooperation Network for Large-Scale Click-Through Rate Prediction at Taobao

Figure 3 for Branches, Assemble! Multi-Branch Cooperation Network for Large-Scale Click-Through Rate Prediction at Taobao

Figure 4 for Branches, Assemble! Multi-Branch Cooperation Network for Large-Scale Click-Through Rate Prediction at Taobao

Abstract:Existing click-through rate (CTR) prediction works have studied the role of feature interaction through a variety of techniques. Each interaction technique exhibits its own strength, and solely using one type could constrain the model's capability to capture the complex feature relationships, especially for industrial large-scale data with enormous users and items. Recent research shows that effective CTR models often combine an MLP network with a dedicated feature interaction network in a two-parallel structure. However, the interplay and cooperative dynamics between different streams or branches remain under-researched. In this work, we introduce a novel Multi-Branch Cooperation Network (MBCnet) which enables multiple branch networks to collaborate with each other for better complex feature interaction modeling. Specifically, MBCnet consists of three branches: the Expert-based Feature Grouping and Crossing (EFGC) branch that promotes the model's memorization ability of specific feature fields, the low rank Cross Net branch and Deep branch to enhance both explicit and implicit feature crossing for improved generalization. Among branches, a novel cooperation scheme is proposed based on two principles: branch co-teaching and moderate differentiation. Branch co-teaching encourages well-learned branches to support poorly-learned ones on specific training samples. Moderate differentiation advocates branches to maintain a reasonable level of difference in their feature representations. The cooperation strategy improves learning through mutual knowledge sharing via co-teaching and boosts the discovery of diverse feature interactions across branches. Extensive experiments on large-scale industrial datasets and online A/B test demonstrate MBCnet's superior performance, delivering a 0.09 point increase in CTR, 1.49% growth in deals, and 1.62% rise in GMV. Core codes will be released soon.

* 10 pages

Via

Access Paper or Ask Questions

Field of View Expansion for Resonant Beam Information and Power Transfer

Aug 08, 2024

Shun Han, Wen Fang, Mingqing Liu, Mengyuan Xu, Shuaifan Xia, Qingwen Liu

Figure 1 for Field of View Expansion for Resonant Beam Information and Power Transfer

Figure 2 for Field of View Expansion for Resonant Beam Information and Power Transfer

Figure 3 for Field of View Expansion for Resonant Beam Information and Power Transfer

Figure 4 for Field of View Expansion for Resonant Beam Information and Power Transfer

Abstract:Simultaneous wireless information and power transfer (SWIPT) leverages lightwave as the wireless transmission medium, emerging as a promising technology in the future Internet of Things (IoT) scenarios. The use of retro-reflectors in constructing spatially separated laser resonators (SSLR) enables a self-aligning wireless transmission system with the self-reproducing resonant beam, i.e. resonant beam system (RBS). However, it's effective Field of View (FoV) is physically limited by the size of retroreflectors and still requires significant improvement. This restricts the transmitter from providing seamless wireless connectivity and power supply to receivers within a large dynamic movement range. In this paper, we propose an FoV-enlarged resonant beam system operating at a meter distance by incorporating a telescope. The telescope plays a crucial role in minimizing the extra loss inflicted on the gain medium, which typically arises from the deviation of the resonant beam within the cavity. Further, we construct the proposed telescope-based RBS and experimentally demonstrate that the design could expand the FoV to 28$^\circ$ over 1 m transmission distance is about triple that of the ordinary RBS design.

Via

Access Paper or Ask Questions

Resonant Beam Enabled DoA Estimation in Passive Positioning System

Aug 08, 2024

Yixuan Guo, Qingwei Jiang, Mengyuan Xu, Wen Fang, Qingwen Liu, Gang Yan, Qunhui Yang, Hai Lu

Figure 1 for Resonant Beam Enabled DoA Estimation in Passive Positioning System

Figure 2 for Resonant Beam Enabled DoA Estimation in Passive Positioning System

Figure 3 for Resonant Beam Enabled DoA Estimation in Passive Positioning System

Figure 4 for Resonant Beam Enabled DoA Estimation in Passive Positioning System

Abstract:The rapid advancement of the next generation of communications and internet of things (IoT) technologies has made the provision of location-based services for diverse devices an increasingly pressing necessity. Localizing devices with/without intelligent computing abilities, including both active and passive devices is essential, especially in indoor scenarios. For traditional RF positioning systems, aligning transmission signals and dealing with signal interference in complex environments are inevitable challenges. Therefore, this paper proposed a new passive positioning system, the RF-band resonant beam positioning system (RF-RBPS), which achieves energy concentration and beam alignment by amplifying echoes between the base station (BS) and the passive target (PT), without the need for complex channel estimation and time-consuming beamforming and provides high-precision direction of arrival (DoA) estimation for battery-free targets using the resonant mechanism. The direction information of the PT is estimated using the multiple signal classification (MUSIC) algorithm at the end of BS. The feasibility of the proposed system is validated through theoretical analysis and simulations. Results indicate that the proposed RF-RBPS surpasses RF-band active positioning system (RF-APS) in precision, achieving millimeter-level precision at 2m within an elevation angle of 35$^\circ$, and an error of less than 3cm at 2.5m within an elevation angle of 35$^\circ$.

Via

Access Paper or Ask Questions

NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

May 31, 2024

Kai Wu, Boyuan Jiang, Zhengkai Jiang, Qingdong He, Donghao Luo, Shengzhi Wang, Qingwen Liu, Chengjie Wang

Figure 1 for NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Figure 2 for NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Figure 3 for NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Figure 4 for NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Abstract:Multimodal large language models (MLLMs) contribute a powerful mechanism to understanding visual information building on large language models. However, MLLMs are notorious for suffering from hallucinations, especially when generating lengthy, detailed descriptions for images. Our analysis reveals that hallucinations stem from the inherent summarization mechanism of large language models, leading to excessive dependence on linguistic tokens while neglecting vision information. In this paper, we propose NoiseBoost, a broadly applicable and simple method for alleviating hallucinations for MLLMs through the integration of noise feature perturbations. Noise perturbation acts as a regularizer, facilitating a balanced distribution of attention weights among visual and linguistic tokens. Despite its simplicity, NoiseBoost consistently enhances the performance of MLLMs across common training strategies, including supervised fine-tuning and reinforcement learning. Further, NoiseBoost pioneerly enables semi-supervised learning for MLLMs, unleashing the power of unlabeled data. Comprehensive experiments demonstrate that NoiseBoost improves dense caption accuracy by 8.1% with human evaluation and achieves comparable results with 50% of the data by mining unlabeled data. Code and models are available at https://kaiwu5.github.io/noiseboost.

* 14 pages, 5 figures with supplementary material

Via

Access Paper or Ask Questions

Resonant Beam Communications: A New Design Paradigm and Challenges

Mar 25, 2024

Yuanming Tian, Dongxu Li, Chuan Huang, Qingwen Liu, Shengli Zhou

Figure 1 for Resonant Beam Communications: A New Design Paradigm and Challenges

Figure 2 for Resonant Beam Communications: A New Design Paradigm and Challenges

Figure 3 for Resonant Beam Communications: A New Design Paradigm and Challenges

Figure 4 for Resonant Beam Communications: A New Design Paradigm and Challenges

Abstract:Resonant beam communications (RBCom), which adopt oscillating photons between two separate retroreflectors for information transmission, exhibit potential advantages over other types of wireless optical communications (WOC). However, echo interference generated by the modulated beam reflected from the receiver affects the transmission of the desired information. To tackle this challenge, a synchronization-based point-to-point RBCom system is proposed to eliminate the echo interference, and the design for the transmitter and receiver is discussed. Subsequently, the performance of the proposed RBCom is evaluated and compared with that of visible light communications (VLC) and free space optical communications (FOC). Finally, future research directions are outlined and several implementation challenges of RBCom systems are highlighted.

Via

Access Paper or Ask Questions

Design and Performance of Resonant Beam Communications -- Part II: Mobile Scenario

Mar 25, 2024

Dongxu Li, Yuanming Tian, Chuan Huang, Qingwen Liu, Shengli Zhou

Figure 1 for Design and Performance of Resonant Beam Communications -- Part II: Mobile Scenario

Figure 2 for Design and Performance of Resonant Beam Communications -- Part II: Mobile Scenario

Figure 3 for Design and Performance of Resonant Beam Communications -- Part II: Mobile Scenario

Figure 4 for Design and Performance of Resonant Beam Communications -- Part II: Mobile Scenario

Abstract:This two-part paper focuses on the system design and performance analysis for a point-to-point resonant beam communication (RBCom) system under both the quasi-static and mobile scenarios. Part I of this paper proposes a synchronization-based information transmission scheme and derives the capacity upper and lower bounds for the quasi-static channel case. In Part II, we address the mobile scenario, where the receiver is in relative motion to the transmitter, and derive a mobile RBCom channel model that jointly considers the Doppler effect, channel variation, and echo interference. With the obtained channel model, we prove that the channel gain of the mobile RBCom decreases as the number of transmitted frames increases, and thus show that the considered mobile RBCom terminates after the transmitter sends a certain number of frames without frequency compensation. By deriving an upper bound on the number of successfully transmitted frames, we formulate the throughput maximization problem for the considered mobile RBCom system, and solve it via a sequential parametric convex approximation (SPCA) method. Finally, simulation results validate the analysis of our proposed method in some typical scenarios.

Via

Access Paper or Ask Questions

Design and Performance of Resonant Beam Communications -- Part I: Quasi-Static Scenario

Mar 25, 2024

Dongxu Li, Yuanming Tian, Chuan Huang, Qingwen Liu, Shengli Zhou

Figure 1 for Design and Performance of Resonant Beam Communications -- Part I: Quasi-Static Scenario

Figure 2 for Design and Performance of Resonant Beam Communications -- Part I: Quasi-Static Scenario

Figure 3 for Design and Performance of Resonant Beam Communications -- Part I: Quasi-Static Scenario

Figure 4 for Design and Performance of Resonant Beam Communications -- Part I: Quasi-Static Scenario

Abstract:This two-part paper studies a point-to-point resonant beam communication (RBCom) system, where two separately deployed retroreflectors are adopted to generate the resonant beam between the transmitter and the receiver, and analyzes the transmission rate of the considered system under both the quasi-static and mobile scenarios. Part I of this paper focuses on the quasi-static scenario where the locations of the transmitter and the receiver are relatively fixed. Specifically, we propose a new information-bearing scheme which adopts a synchronization-based amplitude modulation method to mitigate the echo interference caused by the reflected resonant beam. With this scheme, we show that the quasi-static RBCom channel is equivalent to a Markov channel and can be further simplified as an amplitude-constrained additive white Gaussian noise channel. Moreover, we develop an algorithm that jointly employs the bisection and exhaustive search to maximize its capacity upper and lower bounds. Finally, numerical results validate our analysis. Part II of this paper discusses the performance of the RBCom system under the mobile scenario.

Via

Access Paper or Ask Questions

ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks

Oct 17, 2023

Zejun Li, Ye Wang, Mengfei Du, Qingwen Liu, Binhao Wu, Jiwen Zhang, Chengxing Zhou, Zhihao Fan, Jie Fu, Jingjing Chen(+2 more)

Figure 1 for ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks

Figure 2 for ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks

Figure 3 for ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks

Figure 4 for ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks

Abstract:Recent years have witnessed remarkable progress in the development of large vision-language models (LVLMs). Benefiting from the strong language backbones and efficient cross-modal alignment strategies, LVLMs exhibit surprising capabilities to perceive visual signals and perform visually grounded reasoning. However, the capabilities of LVLMs have not been comprehensively and quantitatively evaluate. Most existing multi-modal benchmarks require task-oriented input-output formats, posing great challenges to automatically assess the free-form text output of LVLMs. To effectively leverage the annotations available in existing benchmarks and reduce the manual effort required for constructing new benchmarks, we propose to re-formulate existing benchmarks into unified LVLM-compatible formats. Through systematic data collection and reformulation, we present the ReForm-Eval benchmark, offering substantial data for evaluating various capabilities of LVLMs. Based on ReForm-Eval, we conduct extensive experiments, thoroughly analyze the strengths and weaknesses of existing LVLMs, and identify the underlying factors. Our benchmark and evaluation framework will be open-sourced as a cornerstone for advancing the development of LVLMs.

* 38 pages, 11 figures, 24 tables

Via

Access Paper or Ask Questions