Abstract:We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.
Abstract:We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.
Abstract:Cognitive Science has profoundly shaped disciplines such as Artificial Intelligence (AI), Philosophy, Psychology, Neuroscience, Linguistics, and Culture. Many breakthroughs in AI trace their roots to cognitive theories, while AI itself has become an indispensable tool for advancing cognitive research. This reciprocal relationship motivates a comprehensive review of the intersections between AI and Cognitive Science. By synthesizing key contributions from both perspectives, we observe that AI progress has largely emphasized practical task performance, whereas its cognitive foundations remain conceptually fragmented. We argue that the future of AI within Cognitive Science lies not only in improving performance but also in constructing systems that deepen our understanding of the human mind. Promising directions include aligning AI behaviors with cognitive frameworks, situating AI in embodiment and culture, developing personalized cognitive models, and rethinking AI ethics through cognitive co-evaluation.
Abstract:Session-based recommendation (SBR) is mainly based on anonymous user interaction sequences to recommend the items that the next user is most likely to click. Currently, the most popular and high-performing SBR methods primarily leverage graph neural networks (GNNs), which model session sequences as graph-structured data to effectively capture user intent. However, most GNNs-based SBR methods primarily focus on modeling the ID sequence information of session sequences, while neglecting the rich semantic information embedded within them. This limitation significantly hampers model's ability to accurately infer users' true intention. To address above challenge, this paper proposes a novel SBR approach called Integrating LLM-Derived Multi-Semantic Intent into Graph Model for Session-based Recommendation (LLM-DMsRec). The method utilizes a pre-trained GNN model to select the top-k items as candidate item sets and designs prompts along with a large language model (LLM) to infer multi-semantic intents from these candidate items. Specifically, we propose an alignment mechanism that effectively integrates the semantic intent inferred by the LLM with the structural intent captured by GNNs. Extensive experiments conducted on the Beauty and ML-1M datasets demonstrate that the proposed method can be seamlessly integrated into GNNs framework, significantly enhancing its recommendation performance.
Abstract:To reduce channel acquisition overhead, spatial, time, and frequency-domain channel extrapolation techniques have been widely studied. In this paper, we propose a novel deep learning-based Position-domain Channel Extrapolation framework (named PCEnet) for cell-free massive multiple-input multiple-output (MIMO) systems. The user's position, which contains significant channel characteristic information, can greatly enhance the efficiency of channel acquisition. In cell-free massive MIMO, while the propagation environments between different base stations and a specific user vary and their respective channels are uncorrelated, the user's position remains constant and unique across all channels. Building on this, the proposed PCEnet framework leverages the position as a bridge between channels to establish a mapping between the characteristics of different channels, thereby using one acquired channel to assist in the estimation and feedback of others. Specifically, this approach first utilizes neural networks (NNs) to infer the user's position from the obtained channel. {The estimated position, shared among BSs through a central processing unit (CPU)}, is then fed into an NN to design pilot symbols and concatenated with the feedback information to the channel reconstruction NN to reconstruct other channels, thereby significantly enhancing channel acquisition performance. Additionally, we propose a simplified strategy where only the estimated position is used in the reconstruction process without modifying the pilot design, thereby reducing latency. Furthermore, we introduce a position label-free approach that infers the relative user position instead of the absolute position, eliminating the need for ground truth position labels during the localization NN training. Simulation results demonstrate that the proposed PCEnet framework reduces pilot and feedback overheads by up to 50%.
Abstract:Reconfigurable intelligent surfaces (RISs) offer the unique capability to reshape the radio environment, thereby simplifying transmission schemes traditionally contingent on channel conditions. Joint spatial division and multiplexing (JSDM) emerges as a low-overhead transmission scheme for multi-user equipment (UE) scenarios, typically requiring complex matrix decomposition to achieve block-diagonalization of the effective channel matrix. In this study, we introduce an innovative JSDM design that leverages RISs to customize channels, thereby streamlining the overall procedures. By strategically positioning RISs at the discrete Fourier transform (DFT) directions of the base station (BS), we establish orthogonal line-of-sight links within the BS-RIS channel, enabling a straightforward pre-beamforming design. Based on UE grouping, we devise reflected beams of the RIS with optimized directions to mitigate inter-group interference in the RISs-UEs channel. An approximation of the channel cross-correlation coefficient is derived and serves as a foundation for the RISs-UEs association, further diminishing inter-group interference. Numerical results substantiate the efficacy of our RIS-customized JSDM in not only achieving effective channel block-diagonalization but also in significantly enhancing the sum spectral efficiency for multi-UE transmissions.
Abstract:The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.
Abstract:Integrated sensing and communication (ISAC) is a key feature of future cellular systems, enabling applications such as intruder detection, monitoring, and tracking using the same infrastructure. However, its potential for structural health monitoring (SHM), which requires the detection of slow and subtle structural changes, remains largely unexplored due to challenges such as multipath interference and the need for ultra-high sensing precision. This study introduces a novel theoretical framework for SHM via ISAC by leveraging reconfigurable intelligent surfaces (RIS) as reference points in collaboration with base stations and users. By dynamically adjusting RIS phases to generate distinct radio signals that suppress background multipath interference, measurement accuracy at these reference points is enhanced. We theoretically analyze RIS-aided collaborative sensing in three-dimensional cellular networks using Fisher information theory, demonstrating how increasing observation time, incorporating additional receivers (even with self-positioning errors), optimizing RIS phases, and refining collaborative node selection can reduce the position error bound to meet SHM's stringent accuracy requirements. Furthermore, we develop a Bayesian inference model to identify structural states and validate damage detection probabilities. Both theoretical and numerical analyses confirm ISAC's capability for millimeter-level deformation detection, highlighting its potential for high-precision SHM applications.
Abstract:Generative models have demonstrated strong performance in conditional settings and can be viewed as a form of data compression, where the condition serves as a compact representation. However, their limited controllability and reconstruction accuracy restrict their practical application to data compression. In this work, we propose an efficient latent diffusion framework that bridges this gap by combining a variational autoencoder with a conditional diffusion model. Our method compresses only a small number of keyframes into latent space and uses them as conditioning inputs to reconstruct the remaining frames via generative interpolation, eliminating the need to store latent representations for every frame. This approach enables accurate spatiotemporal reconstruction while significantly reducing storage costs. Experimental results across multiple datasets show that our method achieves up to 10 times higher compression ratios than rule-based state-of-the-art compressors such as SZ3, and up to 63 percent better performance than leading learning-based methods under the same reconstruction error.
Abstract:We investigate how large language models respond to prompts that differ only in their token-level realization but preserve the same semantic intent, a phenomenon we call prompt variance. We propose Prompt-Based Semantic Shift (PBSS), a diagnostic framework for measuring behavioral drift in LLMs under semantically equivalent prompt rewordings. Applied to ten constrained tasks, PBSS reveals consistent, model-specific response shifts, suggesting statistical regularities linked to tokenization and decoding. These results highlight an overlooked dimension of model evaluation stability under rephrasing and suggest that tokenization strategies and decoding dynamics may contribute to post-training quality of service instability.