Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (M$^3$AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the spoken and written words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of M$^3$AV makes it a challenging dataset.
Large Language Models (LLMs) have demonstrated remarkable proficiency in human interactions, yet their application within the medical field remains insufficiently explored. Previous works mainly focus on the performance of medical knowledge with examinations, which is far from the realistic scenarios, falling short in assessing the abilities of LLMs on clinical tasks. In the quest to enhance the application of Large Language Models (LLMs) in healthcare, this paper introduces the Automated Interactive Evaluation (AIE) framework and the State-Aware Patient Simulator (SAPS), targeting the gap between traditional LLM evaluations and the nuanced demands of clinical practice. Unlike prior methods that rely on static medical knowledge assessments, AIE and SAPS provide a dynamic, realistic platform for assessing LLMs through multi-turn doctor-patient simulations. This approach offers a closer approximation to real clinical scenarios and allows for a detailed analysis of LLM behaviors in response to complex patient interactions. Our extensive experimental validation demonstrates the effectiveness of the AIE framework, with outcomes that align well with human evaluations, underscoring its potential to revolutionize medical LLM testing for improved healthcare delivery.
Video-grounded dialogue generation (VDG) requires the system to generate a fluent and accurate answer based on multimodal knowledge. However, the difficulty in multimodal knowledge utilization brings serious hallucinations to VDG models in practice. Although previous works mitigate the hallucination in a variety of ways, they hardly take notice of the importance of the multimodal knowledge anchor answer tokens. In this paper, we reveal via perplexity that different VDG models experience varying hallucinations and exhibit diverse anchor tokens. Based on this observation, we propose M2K-VDG, a model-adaptive multimodal knowledge anchor enhancement framework for hallucination reduction. Furthermore, we introduce the counterfactual effect for more accurate anchor token detection. The experimental results on three popular benchmarks exhibit the superiority of our approach over state-of-the-art methods, demonstrating its effectiveness in reducing hallucinations.
Multimodal Large Language Models (MLLMs) have shown their remarkable abilities in visual perception and understanding recently. However, how to comprehensively evaluate the capabilities of MLLMs remains a challenge. Most of the existing benchmarks predominantly focus on assessing perception, cognition, and reasoning, neglecting the abilities of self-awareness, referring to the model's recognition of its own capability boundary. In our study, we focus on self-awareness in image perception and introduce the knowledge quadrant for MLLMs, which clearly defines the knowns and unknowns in perception. Based on this, we propose a novel benchmark specifically designed to evaluate the Self-Aware capabilities in Perception for MLLMs(MM-SAP). MM-SAP encompasses three distinct sub-datasets, each focusing on different aspects of self-awareness. We evaluated eight well-known MLLMs using MM-SAP, analyzing their self-awareness and providing detailed insights. Code and data are available at https://github.com/YHWmz/MM-SAP
We study the sample complexity of sample average approximation (SAA) and its simple variations, referred to as the regularized SAA (RSAA), in solving convex and strongly convex stochastic programming (SP) problems under heavy-tailed-ness, non-Lipschitz-ness, and/or high dimensionality. The presence of such irregularities underscores critical vacua in the literature. In response, this paper presents three sets of results: First, we show that the (R)SAA is effective even if the objective function is not necessarily Lipschitz and the underlying distribution admits some bounded central moments only at (near-)optimal solutions. Second, when the SP's objective function is the sum of a smooth term and a Lipschitz term, we prove that the (R)SAA's sample complexity is completely independent from any complexity measures (e.g., the covering number) of the feasible region. Third, we explicate the (R)SAA's sample complexities with regard to the dependence on dimensionality $d$: When some $p$th ($p\geq 2$) central moment of the underlying distribution is bounded, we show that the required sample size grows at a rate no worse than $\mathcal O\left(p d^{2/p}\right)$ under any one of the three structural assumptions: (i) strong convexity w.r.t. the $q$-norm ($q\geq 1$); (ii) the combination of restricted strong convexity and sparsity; and (iii) a dimension-insensitive $q$-norm of an optimal solution. In both cases of (i) and (iii), it is further required that $p\leq q/(q-1)$. As a direct implication, the (R)SAA's complexity becomes (poly-)logarithmic in $d$, whenever $p\geq c\cdot \ln d$ is admissible for some constant $c>0$. These new results deviate from the SAA's typical sample complexities that grow polynomially with $d$. Part of our proof is based on the average-replace-one (RO) stability, which appears to be novel for the (R)SAA's analyses.
Large language models (LLMs) have achieved significant success in interacting with human. However, recent studies have revealed that these models often suffer from hallucinations, leading to overly confident but incorrect judgments. This limits their application in the medical domain, where tasks require the utmost accuracy. This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations. Consultation tasks are designed to require LLMs to be aware of what they do not know, to inquire about missing medical information from patients, and to ultimately make diagnoses. To evaluate the performance of LLMs for these tasks, a benchmark is proposed by reformulating medical multiple-choice questions from the United States Medical Licensing Examinations (USMLE), and comprehensive evaluation metrics are developed and evaluated on three constructed test sets. A medical consultation training set is further constructed to improve the consultation ability of LLMs. The results of the experiments show that fine-tuning with the training set can alleviate hallucinations and improve LLMs' performance on the proposed benchmark. Extensive experiments and ablation studies are conducted to validate the effectiveness and robustness of the proposed framework.
This paper concerns a convex, stochastic zeroth-order optimization (S-ZOO) problem, where the objective is to minimize the expectation of a cost function and its gradient is not accessible directly. To solve this problem, traditional optimization techniques mostly yield query complexities that grow polynomially with dimensionality, i.e., the number of function evaluations is a polynomial function of the number of decision variables. Consequently, these methods may not perform well in solving massive-dimensional problems arising in many modern applications. Although more recent methods can be provably dimension-insensitive, almost all of them work with arguably more stringent conditions such as everywhere sparse or compressible gradient. Thus, prior to this research, it was unknown whether dimension-insensitive S-ZOO is possible without such conditions. In this paper, we give an affirmative answer to this question by proposing a sparsity-inducing stochastic gradient-free (SI-SGF) algorithm. It is proved to achieve dimension-insensitive query complexity in both convex and strongly convex cases when neither gradient sparsity nor gradient compressibility is satisfied. Our numerical results demonstrate the strong potential of the proposed SI-SGF compared with existing alternatives.
We propose a general learning based framework for solving nonsmooth and nonconvex image reconstruction problems. We model the regularization function as the composition of the $l_{2,1}$ norm and a smooth but nonconvex feature mapping parametrized as a deep convolutional neural network. We develop a provably convergent descent-type algorithm to solve the nonsmooth nonconvex minimization problem by leveraging the Nesterov's smoothing technique and the idea of residual learning, and learn the network parameters such that the outputs of the algorithm match the references in training data. Our method is versatile as one can employ various modern network structures into the regularization, and the resulting network inherits the guaranteed convergence of the algorithm. We also show that the proposed network is parameter-efficient and its performance compares favorably to the state-of-the-art methods in a variety of image reconstruction problems in practice.
Optimization algorithms for solving nonconvex inverse problem have attracted significant interests recently. However, existing methods require the nonconvex regularization to be smooth or simple to ensure convergence. In this paper, we propose a novel gradient descent type algorithm, by leveraging the idea of residual learning and Nesterov's smoothing technique, to solve inverse problems consisting of general nonconvex and nonsmooth regularization with provable convergence. Moreover, we develop a neural network architecture intimating this algorithm to learn the nonlinear sparsity transformation adaptively from training data, which also inherits the convergence to accommodate the general nonconvex structure of this learned transformation. Numerical results demonstrate that the proposed network outperforms the state-of-the-art methods on a variety of different image reconstruction problems in terms of efficiency and accuracy.