Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhu Li

Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

Dec 15, 2024

Yujie Zhang, Bingyang Cui, Qi Yang, Zhu Li, Yiling Xu

Figure 1 for Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

Figure 2 for Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

Figure 3 for Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

Figure 4 for Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

Abstract:Text-to-3D generation has achieved remarkable progress in recent years, yet evaluating these methods remains challenging for two reasons: i) Existing benchmarks lack fine-grained evaluation on different prompt categories and evaluation dimensions. ii) Previous evaluation metrics only focus on a single aspect (e.g., text-3D alignment) and fail to perform multi-dimensional quality assessment. To address these problems, we first propose a comprehensive benchmark named MATE-3D. The benchmark contains eight well-designed prompt categories that cover single and multiple object generation, resulting in 1,280 generated textured meshes. We have conducted a large-scale subjective experiment from four different evaluation dimensions and collected 107,520 annotations, followed by detailed analyses of the results. Based on MATE-3D, we propose a novel quality evaluator named HyperScore. Utilizing hypernetwork to generate specified mapping functions for each evaluation dimension, our metric can effectively perform multi-dimensional quality assessment. HyperScore presents superior performance over existing metrics on MATE-3D, making it a promising metric for assessing and improving text-to-3D generation. The project is available at https://mate-3d.github.io/.

Via

Access Paper or Ask Questions

AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation

Dec 13, 2024

Xiyuan Gao, Shubhi Bansal, Kushaan Gowda, Zhu Li, Shekhar Nayak, Nagendra Kumar, Matt Coler

Figure 1 for AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation

Figure 2 for AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation

Figure 3 for AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation

Figure 4 for AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation

Abstract:Detecting sarcasm effectively requires a nuanced understanding of context, including vocal tones and facial expressions. The progression towards multimodal computational methods in sarcasm detection, however, faces challenges due to the scarcity of data. To address this, we present AMuSeD (Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating bi-modal Data augmentation). This approach utilizes the Multimodal Sarcasm Detection Dataset (MUStARD) and introduces a two-phase bimodal data augmentation strategy. The first phase involves generating varied text samples through Back Translation from several secondary languages. The second phase involves the refinement of a FastSpeech 2-based speech synthesis system, tailored specifically for sarcasm to retain sarcastic intonations. Alongside a cloud-based Text-to-Speech (TTS) service, this Fine-tuned FastSpeech 2 system produces corresponding audio for the text augmentations. We also investigate various attention mechanisms for effectively merging text and audio data, finding self-attention to be the most efficient for bimodal integration. Our experiments reveal that this combined augmentation and attention approach achieves a significant F1-score of 81.0% in text-audio modalities, surpassing even models that use three modalities from the MUStARD dataset.

* This is a preprint version of the paper, submitted and under review at the IEEE Transactions on Affective Computing

Via

Access Paper or Ask Questions

Nonparametric Instrumental Regression via Kernel Methods is Minimax Optimal

Nov 29, 2024

Dimitri Meunier, Zhu Li, Tim Christensen, Arthur Gretton

Abstract:We study the kernel instrumental variable algorithm of \citet{singh2019kernel}, a nonparametric two-stage least squares (2SLS) procedure which has demonstrated strong empirical performance. We provide a convergence analysis that covers both the identified and unidentified settings: when the structural function cannot be identified, we show that the kernel NPIV estimator converges to the IV solution with minimum norm. Crucially, our convergence is with respect to the strong $L_2$-norm, rather than a pseudo-norm. Additionally, we characterize the smoothness of the target function without relying on the instrument, instead leveraging a new description of the projected subspace size (this being closely related to the link condition in inverse learning literature). With the subspace size description and under standard kernel learning assumptions, we derive, for the first time, the minimax optimal learning rate for kernel NPIV in the strong $L_2$-norm. Our result demonstrates that the strength of the instrument is essential to achieve efficient learning. We also improve the original kernel NPIV algorithm by adopting a general spectral regularization in stage 1 regression. The modified regularization can overcome the saturation effect of Tikhonov regularization.

Via

Access Paper or Ask Questions

Face De-identification: State-of-the-art Methods and Comparative Studies

Nov 15, 2024

Jingyi Cao, Xiangyi Chen, Bo Liu, Ming Ding, Rong Xie, Li Song, Zhu Li, Wenjun Zhang

Figure 1 for Face De-identification: State-of-the-art Methods and Comparative Studies

Figure 2 for Face De-identification: State-of-the-art Methods and Comparative Studies

Figure 3 for Face De-identification: State-of-the-art Methods and Comparative Studies

Figure 4 for Face De-identification: State-of-the-art Methods and Comparative Studies

Abstract:The widespread use of image acquisition technologies, along with advances in facial recognition, has raised serious privacy concerns. Face de-identification usually refers to the process of concealing or replacing personal identifiers, which is regarded as an effective means to protect the privacy of facial images. A significant number of methods for face de-identification have been proposed in recent years. In this survey, we provide a comprehensive review of state-of-the-art face de-identification methods, categorized into three levels: pixel-level, representation-level, and semantic-level techniques. We systematically evaluate these methods based on two key criteria, the effectiveness of privacy protection and preservation of image utility, highlighting their advantages and limitations. Our analysis includes qualitative and quantitative comparisons of the main algorithms, demonstrating that deep learning-based approaches, particularly those using Generative Adversarial Networks (GANs) and diffusion models, have achieved significant advancements in balancing privacy and utility. Experimental results reveal that while recent methods demonstrate strong privacy protection, trade-offs remain in visual fidelity and computational complexity. This survey not only summarizes the current landscape but also identifies key challenges and future research directions in face de-identification.

Via

Access Paper or Ask Questions

Seeking the Sufficiency and Necessity Causal Features in Multimodal Representation Learning

Aug 29, 2024

Boyu Chen, Junjie Liu, Zhu Li, Mengyue yang

Abstract:Learning representations with a high Probability of Necessary and Sufficient Causes (PNS) has been shown to enhance deep learning models' ability. This task involves identifying causal features that are both sufficient (guaranteeing the outcome) and necessary (without which the outcome cannot occur). However, current research predominantly focuses on unimodal data, and extending PNS learning to multimodal settings presents significant challenges. The challenges arise as the conditions for PNS identifiability, Exogeneity and Monotonicity, need to be reconsidered in a multimodal context, where sufficient and necessary causal features are distributed across different modalities. To address this, we first propose conceptualizing multimodal representations as comprising modality-invariant and modality-specific components. We then analyze PNS identifiability for each component, while ensuring non-trivial PNS estimation. Finally, we formulate tractable optimization objectives that enable multimodal models to learn high-PNS representations, thereby enhancing their predictive performance. Experiments demonstrate the effectiveness of our method on both synthetic and real-world data.

Via

Access Paper or Ask Questions

A Functional Trade-off between Prosodic and Semantic Cues in Conveying Sarcasm

Aug 27, 2024

Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler

Abstract:This study investigates the acoustic features of sarcasm and disentangles the interplay between the propensity of an utterance being used sarcastically and the presence of prosodic cues signaling sarcasm. Using a dataset of sarcastic utterances compiled from television shows, we analyze the prosodic features within utterances and key phrases belonging to three distinct sarcasm categories (embedded, propositional, and illocutionary), which vary in the degree of semantic cues present, and compare them to neutral expressions. Results show that in phrases where the sarcastic meaning is salient from the semantics, the prosodic cues are less relevant than when the sarcastic meaning is not evident from the semantics, suggesting a trade-off between prosodic and semantic cues of sarcasm at the phrase level. These findings highlight a lessened reliance on prosodic modulation in semantically dense sarcastic expressions and a nuanced interaction that shapes the communication of sarcastic intent.

* accepted at Interspeech 2024

Via

Access Paper or Ask Questions

A Benchmark for Gaussian Splatting Compression and Quality Assessment Study

Jul 19, 2024

Qi Yang, Kaifa Yang, Yuke Xing, Yiling Xu, Zhu Li

Figure 1 for A Benchmark for Gaussian Splatting Compression and Quality Assessment Study

Figure 2 for A Benchmark for Gaussian Splatting Compression and Quality Assessment Study

Figure 3 for A Benchmark for Gaussian Splatting Compression and Quality Assessment Study

Figure 4 for A Benchmark for Gaussian Splatting Compression and Quality Assessment Study

Abstract:To fill the gap of traditional GS compression method, in this paper, we first propose a simple and effective GS data compression anchor called Graph-based GS Compression (GGSC). GGSC is inspired by graph signal processing theory and uses two branches to compress the primitive center and attributes. We split the whole GS sample via KDTree and clip the high-frequency components after the graph Fourier transform. Followed by quantization, G-PCC and adaptive arithmetic coding are used to compress the primitive center and attribute residual matrix to generate the bitrate file. GGSS is the first work to explore traditional GS compression, with advantages that can reveal the GS distortion characteristics corresponding to typical compression operation, such as high-frequency clipping and quantization. Second, based on GGSC, we create a GS Quality Assessment dataset (GSQA) with 120 samples. A subjective experiment is conducted in a laboratory environment to collect subjective scores after rendering GS into Processed Video Sequences (PVS). We analyze the characteristics of different GS distortions based on Mean Opinion Scores (MOS), demonstrating the sensitivity of different attributes distortion to visual quality. The GGSC code and the dataset, including GS samples, MOS, and PVS, are made publicly available at https://github.com/Qi-Yangsjtu/GGSC.

Via

Access Paper or Ask Questions

Explicit_NeRF_QA: A Quality Assessment Database for Explicit NeRF Model Compression

Jul 11, 2024

Yuke Xing, Qi Yang, Kaifa Yang, Yilin Xu, Zhu Li

Figure 1 for Explicit_NeRF_QA: A Quality Assessment Database for Explicit NeRF Model Compression

Figure 2 for Explicit_NeRF_QA: A Quality Assessment Database for Explicit NeRF Model Compression

Figure 3 for Explicit_NeRF_QA: A Quality Assessment Database for Explicit NeRF Model Compression

Figure 4 for Explicit_NeRF_QA: A Quality Assessment Database for Explicit NeRF Model Compression

Abstract:In recent years, Neural Radiance Fields (NeRF) have demonstrated significant advantages in representing and synthesizing 3D scenes. Explicit NeRF models facilitate the practical NeRF applications with faster rendering speed, and also attract considerable attention in NeRF compression due to its huge storage cost. To address the challenge of the NeRF compression study, in this paper, we construct a new dataset, called Explicit_NeRF_QA. We use 22 3D objects with diverse geometries, textures, and material complexities to train four typical explicit NeRF models across five parameter levels. Lossy compression is introduced during the model generation, pivoting the selection of key parameters such as hash table size for InstantNGP and voxel grid resolution for Plenoxels. By rendering NeRF samples to processed video sequences (PVS), a large scale subjective experiment with lab environment is conducted to collect subjective scores from 21 viewers. The diversity of content, accuracy of mean opinion scores (MOS), and characteristics of NeRF distortion are comprehensively presented, establishing the heterogeneity of the proposed dataset. The state-of-the-art objective metrics are tested in the new dataset. Best Person correlation, which is around 0.85, is collected from the full-reference objective metric. All tested no-reference metrics report very poor results with 0.4 to 0.6 correlations, demonstrating the need for further development of more robust no-reference metrics. The dataset, including NeRF samples, source 3D objects, multiview images for NeRF generation, PVSs, MOS, is made publicly available at the following location: https://github.com/LittlericeChloe/Explicit_NeRF_QA.

* 5 pages, 4 figures, 2 tables, conference

Via

Access Paper or Ask Questions

Optimal Rates for Vector-Valued Spectral Regularization Learning Algorithms

May 23, 2024

Dimitri Meunier, Zikai Shen, Mattes Mollenhauer, Arthur Gretton, Zhu Li

Abstract:We study theoretical properties of a broad class of regularized algorithms with vector-valued output. These spectral algorithms include kernel ridge regression, kernel principal component regression, various implementations of gradient descent and many more. Our contributions are twofold. First, we rigorously confirm the so-called saturation effect for ridge regression with vector-valued output by deriving a novel lower bound on learning rates; this bound is shown to be suboptimal when the smoothness of the regression function exceeds a certain level. Second, we present the upper bound for the finite sample risk general vector-valued spectral algorithms, applicable to both well-specified and misspecified scenarios (where the true regression function lies outside of the hypothesis space) which is minimax optimal in various regimes. All of our results explicitly allow the case of infinite-dimensional output variables, proving consistency of recent practical applications.

Via

Access Paper or Ask Questions

CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting

Apr 15, 2024

Xiangrui Liu, Xinju Wu, Pingping Zhang, Shiqi Wang, Zhu Li, Sam Kwong

Abstract:Gaussian splatting, renowned for its exceptional rendering quality and efficiency, has emerged as a prominent technique in 3D scene representation. However, the substantial data volume of Gaussian splatting impedes its practical utility in real-world applications. Herein, we propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS), which harnesses compact Gaussian primitives for faithful 3D scene modeling with a remarkably reduced data size. To ensure the compactness of Gaussian primitives, we devise a hybrid primitive structure that captures predictive relationships between each other. Then, we exploit a small set of anchor primitives for prediction, allowing the majority of primitives to be encapsulated into highly compact residual forms. Moreover, we develop a rate-constrained optimization scheme to eliminate redundancies within such hybrid primitives, steering our CompGS towards an optimal trade-off between bitrate consumption and representation efficacy. Experimental results show that the proposed CompGS significantly outperforms existing methods, achieving superior compactness in 3D scene representation without compromising model accuracy and rendering quality. Our code will be released on GitHub for further research.

* Submitted to a conference

Via

Access Paper or Ask Questions