The explosive growth of computation and energy cost of artificial intelligence has spurred strong interests in new computing modalities as potential alternatives to conventional electronic processors. Photonic processors that execute operations using photons instead of electrons, have promised to enable optical neural networks with ultra-low latency and power consumption. However, existing optical neural networks, limited by the underlying network designs, have achieved image recognition accuracy much lower than state-of-the-art electronic neural networks. In this work, we close this gap by introducing a large-kernel spatially-varying convolutional neural network learned via low-dimensional reparameterization techniques. We experimentally instantiate the network with a flat meta-optical system that encompasses an array of nanophotonic structures designed to induce angle-dependent responses. Combined with an extremely lightweight electronic backend with approximately 2K parameters we demonstrate a nanophotonic neural network reaches 73.80\% blind test classification accuracy on CIFAR-10 dataset, and, as such, the first time, an optical neural network outperforms the first modern digital neural network -- AlexNet (72.64\%) with 57M parameters, bringing optical neural network into modern deep learning era.
Today's commodity camera systems rely on compound optics to map light originating from the scene to positions on the sensor where it gets recorded as an image. To record images without optical aberrations, i.e., deviations from Gauss' linear model of optics, typical lens systems introduce increasingly complex stacks of optical elements which are responsible for the height of existing commodity cameras. In this work, we investigate \emph{flat nanophotonic computational cameras} as an alternative that employs an array of skewed lenslets and a learned reconstruction approach. The optical array is embedded on a metasurface that, at 700~nm height, is flat and sits on the sensor cover glass at 2.5~mm focal distance from the sensor. To tackle the highly chromatic response of a metasurface and design the array over the entire sensor, we propose a differentiable optimization method that continuously samples over the visible spectrum and factorizes the optical modulation for different incident fields into individual lenses. We reconstruct a megapixel image from our flat imager with a \emph{learned probabilistic reconstruction} method that employs a generative diffusion model to sample an implicit prior. To tackle \emph{scene-dependent aberrations in broadband}, we propose a method for acquiring paired captured training data in varying illumination conditions. We assess the proposed flat camera design in simulation and with an experimental prototype, validating that the method is capable of recovering images from diverse scenes in broadband with a single nanophotonic layer.
The Visual Turing Test is the ultimate goal to evaluate the realism of holographic displays. Previous studies have focused on addressing challenges such as limited \'etendue and image quality over a large focal volume, but they have not investigated the effect of pupil sampling on the viewing experience in full 3D holograms. In this work, we tackle this problem with a novel hologram generation algorithm motivated by matching the projection operators of incoherent Light Field and coherent Wigner Function light transport. To this end, we supervise hologram computation using synthesized photographs, which are rendered on-the-fly using Light Field refocusing from stochastically sampled pupil states during optimization. The proposed method produces holograms with correct parallax and focus cues, which are important for passing the Visual Turing Test. We validate that our approach compares favorably to state-of-the-art CGH algorithms that use Light Field and Focal Stack supervision. Our experiments demonstrate that our algorithm significantly improves the realism of the viewing experience for a variety of different pupil states.
Understanding the ambient scene is imperative for several applications such as autonomous driving and navigation. While obtaining real-world image data with per-pixel labels is challenging, existing accurate synthetic image datasets primarily focus on indoor spaces with fixed lighting and scene participants, thereby severely limiting their application to outdoor scenarios. In this work we introduce OmniHorizon, a synthetic dataset with 24,335 omnidirectional views comprising of a broad range of indoor and outdoor spaces consisting of buildings, streets, and diverse vegetation. Our dataset also accounts for dynamic scene components including lighting, different times of a day settings, pedestrians, and vehicles. Furthermore, we also demonstrate a learned synthetic-to-real cross-domain inference method for in-the-wild 3D scene depth and normal estimation method using our dataset. To this end, we propose UBotNet, an architecture based on a UNet and a Bottleneck Transformer, to estimate scene-consistent normals. We show that UBotNet achieves significantly improved depth accuracy (4.6%) and normal estimation (5.75%) compared to several existing networks such as U-Net with skip-connections. Finally, we demonstrate in-the-wild depth and normal estimation on real-world images with UBotNet trained purely on our OmniHorizon dataset, showing the promise of proposed dataset and network for scene understanding.
A large portion of today's world population suffer from vision impairments and wear prescription eyeglasses. However, eyeglasses causes additional bulk and discomfort when used with augmented and virtual reality headsets, thereby negatively impacting the viewer's visual experience. In this work, we remedy the usage of prescription eyeglasses in Virtual Reality (VR) headsets by shifting the optical complexity completely into software and propose a prescription-aware rendering approach for providing sharper and immersive VR imagery. To this end, we develop a differentiable display and visual perception model encapsulating display-specific parameters, color and visual acuity of human visual system and the user-specific refractive errors. Using this differentiable visual perception model, we optimize the rendered imagery in the display using stochastic gradient-descent solvers. This way, we provide prescription glasses-free sharper images for a person with vision impairments. We evaluate our approach on various displays, including desktops and VR headsets, and show significant quality and contrast improvements for users with vision impairments.
Computer-generated holographic (CGH) displays show great potential and are emerging as the next-generation displays for augmented and virtual reality, and automotive heads-up displays. One of the critical problems harming the wide adoption of such displays is the presence of speckle noise inherent to holography, that compromises its quality by introducing perceptible artifacts. Although speckle noise suppression has been an active research area, the previous works have not considered the perceptual characteristics of the Human Visual System (HVS), which receives the final displayed imagery. However, it is well studied that the sensitivity of the HVS is not uniform across the visual field, which has led to gaze-contingent rendering schemes for maximizing the perceptual quality in various computer-generated imagery. Inspired by this, we present the first method that reduces the "perceived speckle noise" by integrating foveal and peripheral vision characteristics of the HVS, along with the retinal point spread function, into the phase hologram computation. Specifically, we introduce the anatomical and statistical retinal receptor distribution into our computational hologram optimization, which places a higher priority on reducing the perceived foveal speckle noise while being adaptable to any individual's optical aberration on the retina. Our method demonstrates superior perceptual quality on our emulated holographic display. Our evaluations with objective measurements and subjective studies demonstrate a significant reduction of the human perceived noise.
Traditional high-quality 3D graphics requires large volumes of fine-detailed scene data for rendering. This demand compromises computational efficiency and local storage resources. Specifically, it becomes more concerning for future wearable and portable virtual and augmented reality (VR/AR) displays. Recent approaches to combat this problem include remote rendering/streaming and neural representations of 3D assets. These approaches have redefined the traditional local storage-rendering pipeline by distributed computing or compression of large data. However, these methods typically suffer from high latency or low quality for practical visualization of large immersive virtual scenes, notably with extra high resolution and refresh rate requirements for VR applications such as gaming and design. Tailored for the future portable, low-storage, and energy-efficient VR platforms, we present the first gaze-contingent 3D neural representation and view synthesis method. We incorporate the human psychophysics of visual- and stereo-acuity into an egocentric neural representation of 3D scenery. Furthermore, we jointly optimize the latency/performance and visual quality, while mutually bridging human perception and neural scene synthesis, to achieve perceptually high-quality immersive interaction. Both objective analysis and subjective study demonstrate the effectiveness of our approach in significantly reducing local storage volume and synthesis latency (up to 99% reduction in both data size and computational time), while simultaneously presenting high-fidelity rendering, with perceptual quality identical to that of fully locally stored and rendered high-quality imagery.