Brain signal visualization has emerged as an active research area, serving as a critical interface between the human visual system and computer vision models. Although diffusion models have shown promise in analyzing functional magnetic resonance imaging (fMRI) data, including reconstructing high-quality images consistent with original visual stimuli, their accuracy in extracting semantic and silhouette information from brain signals remains limited. In this regard, we propose a novel approach, referred to as Controllable Mind Visual Diffusion Model (CMVDM). CMVDM extracts semantic and silhouette information from fMRI data using attribute alignment and assistant networks. Additionally, a residual block is incorporated to capture information beyond semantic and silhouette features. We then leverage a control model to fully exploit the extracted information for image synthesis, resulting in generated images that closely resemble the visual stimuli in terms of semantics and silhouette. Through extensive experimentation, we demonstrate that CMVDM outperforms existing state-of-the-art methods both qualitatively and quantitatively.
Click-based interactive segmentation enables productive pixel-level annotation and image editing with simple user clicks, whereas target ambiguity remains a problem hindering precise segmentation. That is, in scenes with rich context, one click may refer to multiple potential targets residing in corresponding masks, while most interactive segmentors can only generate one single mask and fail to capture the rich context. To resolve target ambiguity, we here propose PiClick to produce semantically diversified masks. PiClick leverages a transformer network design wherein mutually interactive mask queries are integrated to infuse target priors. Moreover, a Target Reasoning Module is designed in PiClick to automatically imply the best-matched mask from all proposals, significantly relieving target ambiguity as well as extra human intervention. Extensive experiments conducted on all 9 interactive segmentation datasets not only demonstrate the state-of-the-art segmentation performance of PiClick, but also reduces human interventions with multiple proposal generation and target reasoning. To promote direct usage and future endeavors, we release the source code of PiClick together with a plug-and-play annotation tool at https://github.com/cilinyan/PiClick.
Fourier Ptychographic Microscopy (FPM) is a computational technique that achieves a large space-bandwidth product imaging. It addresses the challenge of balancing a large field of view and high resolution by fusing information from multiple images taken with varying illumination angles. Nevertheless, conventional FPM framework always suffers from long acquisition time and a heavy computational burden. In this paper, we propose a novel physical neural network that generates an adaptive illumination mode by incorporating temporally-encoded illumination modes as a distinct layer, aiming to improve the acquisition and calculation efficiency. Both simulations and experiments have been conducted to validate the feasibility and effectiveness of the proposed method. It is worth mentioning that, unlike previous works that obtain the intensity of a multiplexed illumination by post-combination of each sequentially illuminated and obtained low-resolution images, our experimental data is captured directly by turning on multiple LEDs with a coded illumination pattern. Our method has exhibited state-of-the-art performance in terms of both detail fidelity and imaging velocity when assessed through a multitude of evaluative aspects.
CLIP (Contrastive Language-Image Pretraining) is well-developed for open-vocabulary zero-shot image-level recognition, while its applications in pixel-level tasks are less investigated, where most efforts directly adopt CLIP features without deliberative adaptations. In this work, we first demonstrate the necessity of image-pixel CLIP feature adaption, then provide Multi-View Prompt learning (MVP-SEG) as an effective solution to achieve image-pixel adaptation and to solve open-vocabulary semantic segmentation. Concretely, MVP-SEG deliberately learns multiple prompts trained by our Orthogonal Constraint Loss (OCLoss), by which each prompt is supervised to exploit CLIP feature on different object parts, and collaborative segmentation masks generated by all prompts promote better segmentation. Moreover, MVP-SEG introduces Global Prompt Refining (GPR) to further eliminate class-wise segmentation noise. Experiments show that the multi-view prompts learned from seen categories have strong generalization to unseen categories, and MVP-SEG+ which combines the knowledge transfer stage significantly outperforms previous methods on several benchmarks. Moreover, qualitative results justify that MVP-SEG does lead to better focus on different local parts.
Video Instance Segmentation(VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset(LV-VIS), that contains well-annotated objects from 1,212 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Vision-Language Transformer, MindVLT, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of MindVLT on novel categories. We will release the dataset and code to facilitate future endeavors.
Online camera-to-ground calibration is to generate a non-rigid body transformation between the camera and the road surface in a real-time manner. Existing solutions utilize static calibration, suffering from environmental variations such as tire pressure changes, vehicle loading volume variations, and road surface diversity. Other online solutions exploit the usage of road elements or photometric consistency between overlapping views across images, which require continuous detection of specific targets on the road or assistance with multiple cameras to facilitate calibration. In our work, we propose an online monocular camera-to-ground calibration solution that does not utilize any specific targets while driving. We perform a coarse-to-fine approach for ground feature extraction through wheel odometry and estimate the camera-to-ground calibration parameters through a sliding-window-based factor graph optimization. Considering the non-rigid transformation of camera-to-ground while driving, we provide metrics to quantify calibration performance and stopping criteria to report/broadcast our satisfying calibration results. Extensive experiments using real-world data demonstrate that our algorithm is effective and outperforms state-of-the-art techniques.
In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, demonstrating strong generalization ability to novel attributes and categories.
Fourier ptychographic microscopy (FPM) can achieve quantitative phase imaging with a large space-bandwidth product by synthesizing a set of low-resolution intensity images captured under angularly varying illuminations. Determining accurate illumination angles is critical because the consistency between actual systematic parameters and those used in the recovery algorithm is essential for high-quality imaging. This paper presents a full-pose-parameter and physics-based method for calibrating illumination angles. Using a physics-based model constructed with general knowledge of the employed microscope and the brightfield-to-darkfield boundaries inside captured images, we can solve for the full-pose parameters of misplaced LED array, which consist of the distance between the sample and the LED array, two orthogonal lateral shifts, one in-plane rotation angle, and two tilt angles, to correct illumination angles precisely. The feasibility and effectiveness of the proposed method for recovering random or remarkable pose parameters have been demonstrated by both qualitative and quantitative experiments. Due to the completeness of the pose parameters, the clarity of the physical model, and the high robustness for arbitrary misalignments, our method can significantly facilitate the design, implementation, and application of concise and robust FPM platforms.
Fourier ptychography (FP), as a computational imaging method, is a powerful tool to improve imaging resolution. Camera-scanning Fourier ptychography extends the application of FP from micro to macro creatively. Due to the non-ideal scanning of the camera driven by the mechanical translation stage, the pose error of the camera occurs, greatly degrading the reconstruction quality, while a precise translation stage is expensive and not suitable for wide-range imaging. Here, to improve the imaging performance of camera-scanning Fourier ptychography, we propose a pose correction scheme based on camera calibration and homography transform approaches. The scheme realizes the accurate alignment of data set and location error correction in the frequency domain. Simulation and experimental results demonstrate this method can optimize the reconstruction results and realize high-quality imaging effectively. Combined with the feature recognition algorithm, the scheme provides the possibility for applying FP in remote sensing imaging and space imaging.
Fourier ptychography has attracted a wide range of focus for its ability of large space-bandwidth-produce, and quantative phase measurement. It is a typical computational imaging technique which refers to optimizing both the imaging hardware and reconstruction algorithms simultaneously. The data redundancy and inverse problem algorithms are the sources of FPM's excellent performance. But at the same time, this large amount of data processing and complex algorithms also greatly reduce the imaging speed. In this article, we propose a parallel Fourier ptychography reconstruction framework consisting of three levels of parallel computing parts and implemented it with both central processing unit (CPU) and compute unified device architecture (CUDA) platform. In the conventional FPM reconstruction framework, the sample image is divided into multiple sub-regions for separately processing because the illumination angles for different subregions are varied for the same LED and different subregions contain different defocus distances due to the non-planar distribution or non-ideal posture of biological sample. We first build a parallel computing sub-framework in spatial domain based on the above-mentioned characteristics. And then, by utilizing the sequential characteristics of different spectrum regions to update, a parallel computing sub-framework in the spectrum domain is carried out in our scheme. The feasibility of the proposed parallel FPM reconstruction framework is verified with different experimental results acquired with the system we built.