Abstract:Multi-modal sequential recommendation (SR) leverages multi-modal data to learn more comprehensive item features and user preferences than traditional SR methods, which has become a critical topic in both academia and industry. Existing methods typically focus on enhancing multi-modal information utility through adaptive modality fusion to capture the evolving of user preference from user-item interaction sequences. However, most of them overlook the interference caused by redundant interest-irrelevant information contained in rich multi-modal data. Additionally, they primarily rely on implicit temporal information based solely on chronological ordering, neglecting explicit temporal signals that could more effectively represent dynamic user interest over time. To address these limitations, we propose a Hierarchical time-aware Mixture of experts for multi-modal Sequential Recommendation (HM4SR) with a two-level Mixture of Experts (MoE) and a multi-task learning strategy. Specifically, the first MoE, named Interactive MoE, extracts essential user interest-related information from the multi-modal data of each item. Then, the second MoE, termed Temporal MoE, captures user dynamic interests by introducing explicit temporal embeddings from timestamps in modality encoding. To further address data sparsity, we propose three auxiliary supervision tasks: sequence-level category prediction (CP) for item feature understanding, contrastive learning on ID (IDCL) to align sequence context with user interests, and placeholder contrastive learning (PCL) to integrate temporal information with modalities for dynamic interest modeling. Extensive experiments on four public datasets verify the effectiveness of HM4SR compared to several state-of-the-art approaches.
Abstract:We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key shortcomings causing the loss of dynamic details, enhancing the lifelike qualities of human video animations. At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos. Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations. The code is available at https://github.com/bytedance/X-Dyna.




Abstract:We propose a clinical decision support system (CDSS) for mental health diagnosis that combines the strengths of large language models (LLMs) and constraint logic programming (CLP). Having a CDSS is important because of the high complexity of diagnostic manuals used by mental health professionals and the danger of diagnostic errors. Our CDSS is a software tool that uses an LLM to translate diagnostic manuals to a logic program and solves the program using an off-the-shelf CLP engine to query a patient's diagnosis based on the encoded rules and provided data. By giving domain experts the opportunity to inspect the LLM-generated logic program, and making modifications when needed, our CDSS ensures that the diagnosis is not only accurate but also interpretable. We experimentally compare it with two baseline approaches of using LLMs: diagnosing patients using the LLM-only approach, and using the LLM-generated logic program but without expert inspection. The results show that, while LLMs are extremely useful in generating candidate logic programs, these programs still require expert inspection and modification to guarantee faithfulness to the official diagnostic manuals. Additionally, ethical concerns arise from the direct use of patient data in LLMs, underscoring the need for a safer hybrid approach like our proposed method.




Abstract:Evolutionary algorithms (EAs) maintain populations through evolutionary operators to discover diverse solutions for complex tasks while gathering valuable knowledge, such as historical population data and fitness evaluations. However, traditional EAs face challenges in dynamically adapting to expanding knowledge bases, hindering the efficient exploitation of accumulated information and limiting adaptability to new situations. To address these issues, we introduce an Optimization Knowledge Adaptation Evolutionary Model (OKAEM), which features dynamic parameter adjustment using accumulated knowledge to enhance its optimization capabilities. OKAEM employs attention mechanisms to model the interactions among individuals, fitness landscapes, and genetic components separately, thereby parameterizing the evolutionary operators of selection, crossover, and mutation. These powerful learnable operators enable OKAEM to benefit from pre-learned extensive prior knowledge and self-tune with real-time evolutionary insights. Experimental results demonstrate that OKAEM: 1) exploits prior knowledge for significant performance gains across various knowledge transfer settings; 2) achieves competitive performance through self-tuning alone, even without prior knowledge; 3) outperforms state-of-the-art black-box baselines in a vision-language model tuning case; 4) can improve its optimization capabilities with growing knowledge; 5) is capable of emulating principles of natural selection and genetic recombination.




Abstract:Tensor Robust Principal Component Analysis (TRPCA) holds a crucial position in machine learning and computer vision. It aims to recover underlying low-rank structures and characterizing the sparse structures of noise. Current approaches often encounter difficulties in accurately capturing the low-rank properties of tensors and balancing the trade-off between low-rank and sparse components, especially in a mixed-noise scenario. To address these challenges, we introduce a Bayesian framework for TRPCA, which integrates a low-rank tensor nuclear norm prior and a generalized sparsity-inducing prior. By embedding the proposed priors within the Bayesian framework, our method can automatically determine the optimal tensor nuclear norm and achieve a balance between the nuclear norm and sparse components. Furthermore, our method can be efficiently extended to the weighted tensor nuclear norm model. Experiments conducted on synthetic and real-world datasets demonstrate the effectiveness and superiority of our method compared to state-of-the-art approaches.
Abstract:While consumer displays increasingly support more than 10 stops of dynamic range, most image assets such as internet photographs and generative AI content remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.




Abstract:Graph neural architecture search (GNAS) can customize high-performance graph neural network architectures for specific graph tasks or datasets. However, existing GNAS methods begin searching for architectures from a zero-knowledge state, ignoring the prior knowledge that may improve the search efficiency. The available knowledge base (e.g. NAS-Bench-Graph) contains many rich architectures and their multiple performance metrics, such as the accuracy (#Acc) and number of parameters (#Params). This study proposes exploiting such prior knowledge to accelerate the multi-objective evolutionary search on a new graph dataset, named knowledge-aware evolutionary GNAS (KEGNAS). KEGNAS employs the knowledge base to train a knowledge model and a deep multi-output Gaussian process (DMOGP) in one go, which generates and evaluates transfer architectures in only a few GPU seconds. The knowledge model first establishes a dataset-to-architecture mapping, which can quickly generate candidate transfer architectures for a new dataset. Subsequently, the DMOGP with architecture and dataset encodings is designed to predict multiple performance metrics for candidate transfer architectures on the new dataset. According to the predicted metrics, non-dominated candidate transfer architectures are selected to warm-start the multi-objective evolutionary algorithm for optimizing the #Acc and #Params on a new dataset. Empirical studies on NAS-Bench-Graph and five real-world datasets show that KEGNAS swiftly generates top-performance architectures, achieving 4.27% higher accuracy than advanced evolutionary baselines and 11.54% higher accuracy than advanced differentiable baselines. In addition, ablation studies demonstrate that the use of prior knowledge significantly improves the search performance.




Abstract:Realised volatility has become increasingly prominent in volatility forecasting due to its ability to capture intraday price fluctuations. With a growing variety of realised volatility estimators, each with unique advantages and limitations, selecting an optimal estimator may introduce challenges. In this thesis, aiming to synthesise the impact of various realised volatility measures on volatility forecasting, we propose an extension of the Realised GARCH model that incorporates an autoencoder-generated synthetic realised measure, combining the information from multiple realised measures in a nonlinear manner. Our proposed model extends existing linear methods, such as Principal Component Analysis and Independent Component Analysis, to reduce the dimensionality of realised measures. The empirical evaluation, conducted across four major stock markets from January 2000 to June 2022 and including the period of COVID-19, demonstrates both the feasibility of applying an autoencoder to synthesise volatility measures and the superior effectiveness of the proposed model in one-step-ahead rolling volatility forecasting. The model exhibits enhanced flexibility in parameter estimations across each rolling window, outperforming traditional linear approaches. These findings indicate that nonlinear dimension reduction offers further adaptability and flexibility in improving the synthetic realised measure, with promising implications for future volatility forecasting applications.
Abstract:Recently, implicit neural representations (INRs) have attracted increasing attention for multi-dimensional data recovery. However, INRs simply map coordinates via a multi-layer perception (MLP) to corresponding values, ignoring the inherent semantic information of the data. To leverage semantic priors from the data, we propose a novel Superpixel-informed INR (S-INR). Specifically, we suggest utilizing generalized superpixel instead of pixel as an alternative basic unit of INR for multi-dimensional data (e.g., images and weather data). The coordinates of generalized superpixels are first fed into exclusive attention-based MLPs, and then the intermediate results interact with a shared dictionary matrix. The elaborately designed modules in S-INR allow us to ingenuously exploit the semantic information within and across generalized superpixels. Extensive experiments on various applications validate the effectiveness and efficacy of our S-INR compared to state-of-the-art INR methods.




Abstract:With the development of large language models (LLMs), the ability to handle longer contexts has become a key capability for Web applications such as cross-document understanding and LLM-powered search systems. However, this progress faces two major challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues hinder the application of LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a model-agnostic, training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using Query-Key dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a small number of critical KV cache tokens in the attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we designed the Selection Cache based on observations of consecutive Query similarity and implemented efficient dot product kernel, significantly reducing the overhead of token selection. A comprehensive evaluation of TokenSelect demonstrates up to 23.84x speedup in attention computation and up to 2.28x acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.