Topic:Information Extraction
What is Information Extraction? Information extraction is the process of automatically extracting structured information from unstructured text data.
Papers and Code
Apr 29, 2025
Abstract:Deep neural networks have been applied to address electromagnetic inverse scattering problems (ISPs) and shown superior imaging performances, which can be affected by the training dataset, the network architecture and the applied loss function. Here, the quality of data samples is cared and valued by the defined quality factor. Based on the quality factor, the composition of the training dataset is optimized. The network architecture is integrated with the residual connections and channel attention mechanism to improve feature extraction. A loss function that incorporates data-fitting error, physical-information constraints and the desired feature of the solution is designed and analyzed to suppress the background artifacts and improve the reconstruction accuracy. Various numerical analysis are performed to demonstrate the superiority of the proposed quality-factor inspired deep neural network (QuaDNN) solver and the imaging performance is finally verified by experimental imaging test.
Via

Apr 30, 2025
Abstract:In action recognition tasks, feature diversity is essential for enhancing model generalization and performance. Existing methods typically promote feature diversity by expanding the training data in the sample space, which often leads to inefficiencies and semantic inconsistencies. To overcome these problems, we propose a novel Coarse-fine text co-guidance Diffusion model (CoCoDiff). CoCoDiff generates diverse yet semantically consistent features in the latent space by leveraging diffusion and multi-granularity textual guidance. Specifically, our approach feeds spatio-temporal features extracted from skeleton sequences into a latent diffusion model to generate diverse action representations. Meanwhile, we introduce a coarse-fine text co-guided strategy that leverages textual information from large language models (LLMs) to ensure semantic consistency between the generated features and the original inputs. It is noted that CoCoDiff operates as a plug-and-play auxiliary module during training, incurring no additional inference cost. Extensive experiments demonstrate that CoCoDiff achieves SOTA performance on skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and Kinetics-Skeleton.
Via

Apr 30, 2025
Abstract:Vision-and-Language Navigation (VLN) is a challenging task where an agent must understand language instructions and navigate unfamiliar environments using visual cues. The agent must accurately locate the target based on visual information from the environment and complete tasks through interaction with the surroundings. Despite significant advancements in this field, two major limitations persist: (1) Many existing methods input complete language instructions directly into multi-layer Transformer networks without fully exploiting the detailed information within the instructions, thereby limiting the agent's language understanding capabilities during task execution; (2) Current approaches often overlook the modeling of object relationships across different modalities, failing to effectively utilize latent clues between objects, which affects the accuracy and robustness of navigation decisions. We propose a Dual Object Perception-Enhancement Network (DOPE) to address these issues to improve navigation performance. First, we design a Text Semantic Extraction (TSE) to extract relatively essential phrases from the text and input them into the Text Object Perception-Augmentation (TOPA) to fully leverage details such as objects and actions within the instructions. Second, we introduce an Image Object Perception-Augmentation (IOPA), which performs additional modeling of object information across different modalities, enabling the model to more effectively utilize latent clues between objects in images and text, enhancing decision-making accuracy. Extensive experiments on the R2R and REVERIE datasets validate the efficacy of the proposed approach.
* Main paper (10 pages). Accepted for publication by ICMR(International
Conference on Multimedia Retrieval) 2025
Via

Apr 29, 2025
Abstract:A renaissance in radar-based sensing for mobile robotic applications is underway. Compared to cameras or lidars, millimetre-wave radars have the ability to `see' through thin walls, vegetation, and adversarial weather conditions such as heavy rain, fog, snow, and dust. In this paper, we propose a novel SE(2) odometry approach for spinning frequency-modulated continuous-wave radars. Our method performs scan-to-local-map registration of the incoming radar data in a direct manner using all the radar intensity information without the need for feature or point cloud extraction. The method performs locally continuous trajectory estimation and accounts for both motion and Doppler distortion of the radar scans. If the radar possesses a specific frequency modulation pattern that makes radial Doppler velocities observable, an additional Doppler-based constraint is formulated to improve the velocity estimate and enable odometry in geometrically feature-deprived scenarios (e.g., featureless tunnels). Our method has been validated on over 250km of on-road data sourced from public datasets (Boreas and MulRan) and collected using our automotive platform. With the aid of a gyroscope, it outperforms state-of-the-art methods and achieves an average relative translation error of 0.26% on the Boreas leaderboard. When using data with the appropriate Doppler-enabling frequency modulation pattern, the translation error is reduced to 0.18% in similar environments. We also benchmarked our algorithm using 1.5 hours of data collected with a mobile robot in off-road environments with various levels of structure to demonstrate its versatility. Our real-time implementation is publicly available: https://github.com/utiasASRL/dro.
* Accepted for presentation at RSS 2025
Via

Apr 29, 2025
Abstract:To support the Low Altitude Economy (LAE), precise unmanned aerial vehicles (UAVs) localization in urban areas where global positioning system (GPS) signals are unavailable. Vision-based methods offer a viable alternative but face severe bandwidth, memory and processing constraints on lightweight UAVs. Inspired by mammalian spatial cognition, we propose a task-oriented communication framework, where UAVs equipped with multi-camera systems extract compact multi-view features and offload localization tasks to edge servers. We introduce the Orthogonally-constrained Variational Information Bottleneck encoder (O-VIB), which incorporates automatic relevance determination (ARD) to prune non-informative features while enforcing orthogonality to minimize redundancy. This enables efficient and accurate localization with minimal transmission cost. Extensive evaluation on a dedicated LAE UAV dataset shows that O-VIB achieves high-precision localization under stringent bandwidth budgets. Code and dataset will be made publicly available: github.com/fangzr/TOC-Edge-Aerial.
* Code and dataset will be made publicly available:
https://github.com/fangzr/TOC-Edge-Aerial
Via

May 01, 2025
Abstract:Traditional Chinese medicine, as an essential component of traditional medicine, contains active ingredients that serve as a crucial source for modern drug development, holding immense therapeutic potential and development value. A multi-layered and complex network is formed from Chinese medicine to diseases and used to predict the potential associations between Chinese medicine ingredients and diseases. This study proposes an ingredient-disease association prediction model (Node2Vec-DGI-EL) based on hierarchical graph representation learning. First, the model uses the Node2Vec algorithm to extract node embedding vectors from the network as the initial features of the nodes. Next, the network nodes are deeply represented and learned using the DGI algorithm to enhance the model's expressive power. To improve prediction accuracy and robustness, an ensemble learning method is incorporated to achieve more accurate ingredient-disease association predictions. The effectiveness of the model is then evaluated through a series of theoretical verifications. The results demonstrated that the proposed model significantly outperformed existing methods, achieving an AUC of 0.9987 and an AUPR of 0.9545, thereby indicating superior predictive capability. Ablation experiments further revealed the contribution and importance of each module. Additionally, case studies explored potential associations, such as triptonide with hypertensive retinopathy and methyl ursolate with colorectal cancer. Molecular docking experiments validated these findings, showing the triptonide-PGR interaction and the methyl ursolate-NFE2L2 interaction can bind stable. In conclusion, the Node2Vec-DGI-EL model focuses on TCM datasets and effectively predicts ingredient-disease associations, overcoming the reliance on node semantic information.
Via

Apr 30, 2025
Abstract:With the development of distributed systems, microservices and cloud native technologies have become central to modern enterprise software development. Despite bringing significant advantages, these technologies also increase system complexity and operational challenges. Traditional root cause analysis (RCA) struggles to achieve automated fault response, heavily relying on manual intervention. In recent years, large language models (LLMs) have made breakthroughs in contextual inference and domain knowledge integration, providing new solutions for Artificial Intelligence for Operations (AIOps). However, Existing LLM-based approaches face three key challenges: text input constraints, dynamic service dependency hallucinations, and context window limitations. To address these issues, we propose a tool-assisted LLM agent with multi-modality observation data, namely TAMO, for fine-grained RCA. It unifies multi-modal observational data into time-aligned representations to extract consistent features and employs specialized root cause localization and fault classification tools for perceiving the contextual environment. This approach overcomes the limitations of LLM in handling real-time changing service dependencies and raw observational data and guides LLM to generate repair strategies aligned with system contexts by structuring key information into a prompt. Experimental results show that TAMO performs well in root cause analysis when dealing with public datasets characterized by heterogeneity and common fault types, demonstrating its effectiveness.
Via

Apr 24, 2025
Abstract:Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability to visual and multimodal domains remains unexplored. In this work, we extend MRE to the multimodal information extraction domain for the first time. Specifically, we introduce a new task: Multimodal Mutual Reinforcement Effect (M-MRE), and construct a corresponding dataset to support this task. To address the challenges posed by M-MRE, we further propose a Prompt Format Adapter (PFA) that is fully compatible with various Large Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can also be observed in the M-MRE task, a multimodal text-image understanding scenario. This provides strong evidence that MRE facilitates mutual gains across three interrelated tasks, confirming its generalizability beyond the textual domain.
Via

Apr 28, 2025
Abstract:Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.
Via

Apr 30, 2025
Abstract:To efficiently compress the sign information of images, we address a sign retrieval problem for the block-wise discrete cosine transformation~(DCT): reconstruction of the signs of DCT coefficients from their amplitudes. To this end, we propose a fast sign retrieval method on the basis of binary classification machine learning. We first introduce 3D representations of the amplitudes and signs, where we pack amplitudes/signs belonging to the same frequency band into a 2D slice, referred to as the sub-band block. We then retrieve the signs from the 3D amplitudes via binary classification, where each sign is regarded as a binary label. We implement a binary classification algorithm using convolutional neural networks, which are advantageous for efficiently extracting features in the 3D amplitudes. Experimental results demonstrate that our method achieves accurate sign retrieval with an overwhelmingly low computation cost.
Via
