Abstract:The Visual Question Answering (VQA) task requires the simultaneous understanding of image content and question semantics. However, existing methods often have difficulty handling complex reasoning scenarios due to insufficient cross-modal interaction and capturing the entity spatial relationships in the image.\cite{huang2023adaptive}\cite{liu2021comparing}\cite{guibas2021adaptive}\cite{zhang2022vsa}We studied a brand-new approach to replace the attention mechanism in order to enhance the reasoning ability of the model and its understanding of spatial relationships.Specifically, we propose a dynamic bidirectional spatial tower, which is divided into four layers to observe the image according to the principle of human gestalt vision. This naturally provides a powerful structural prior for the spatial organization between entities, enabling the model to no longer blindly search for relationships between pixels but make judgments based on more meaningful perceptual units. Change from "seeing images" to "perceiving and organizing image content".A large number of experiments have shown that our module can be used in any other multimodal model and achieve advanced results, demonstrating its potential in spatial relationship processing.Meanwhile, the multimodal visual question-answering model July trained by our method has achieved state-of-the-art results with only 3B parameters, especially on the question-answering dataset of spatial relations.
Abstract:Inverse-designed nanophotonic devices offer promising solutions for analog optical computation. High-density photonic integration is critical for scaling such architectures toward more complex computational tasks and large-scale applications. Here, we present an inverse-designed photonic neural network (PNN) accelerator on a high-index contrast material platform, enabling ultra-compact and energy-efficient optical computing. Our approach introduces a wave-based inverse-design method based on three dimensional finite-difference time-domain (3D-FDTD) simulations, exploiting the linearity of Maxwell's equations to reconstruct arbitrary spatial fields through optical coherence. By decoupling the forward-pass process into linearly separable simulations, our approach is highly amenable to computational parallelism, making it particularly well suited for acceleration using graphics processing units (GPUs) and other parallel computing platforms, thereby enhancing scalability across large problem domains. We fabricate and experimentally validate two inverse-designed PNN accelerators on the silicon-on-insulator platform, achieving on-chip MNIST and MedNIST classification accuracies of 89% and 90% respectively, within ultra-compact footprints of just 20 $\times$ 20 $\mu$m$^{2}$ and 30 $\times$ 20 $\mu$m$^{2}$. Our results establish a scalable and energy-efficient platform for analog photonic computing, effectively bridging inverse nanophotonic design with high-performance optical information processing.