Abstract:Small Vision-Language Models (SVLMs) are efficient task controllers but often suffer from visual brittleness and poor tool orchestration. They typically require expensive supervised trajectory tuning to mitigate these deficits. In this work, we propose Self-supervised Perception Enabled by Cascaded Tool Rollout Alignment (SPECTRA), a supervision-free framework that bootstraps agentic capabilities via Coldstart Reinforcement Learning for SVLMs. SPECTRA enforces Soft Structured Multi-turn Rollouts, a topological constraint that directs agents to explicitly sequence tool derived evidence before synthesis, effectively grounding reasoning in visual observations. We employ a multi-objective reward signal that simultaneously maximizes task correctness, rollout structure, and tool utility, enabling agent to self-discover robust behaviors without human preference labels. We further introduce Tool Instrumental Utility (TIU), a novel metric to quantify tool efficacy in the absence of ground truth. Extensive evaluations across composite and out-of-distribution (MMMU-Pro) benchmarks demonstrate that SPECTRA boosts agentic trajectories, improving task accuracy by up to 5% and tool efficiency by 9%, enabling more efficient multimodal agents that learn effectively from environmental interaction alone.
Abstract:Inverse problems in imaging are typically ill-posed and are usually solved by employing regularized optimization techniques. The usage of appropriate constraints can restrict the solution space, thus making it feasible for a reconstruction algorithm to find a meaningful solution. In recent years, deep network based ideas aimed at learning the end-to-end mapping between the raw measurements and the target image have gained popularity. In the learning approach, the functional relationship between the measured raw data and the solution image are learned by training a deep network with prior examples. While this approach allows one to significantly increase the real-time operational speed, it does not change the nature of the underlying ill-posed inverse problem. It is well-known that availability of diverse non-redundant data via additional measurements can generically improve the robustness of the reconstruction algorithms. The multiple data measurements, however, typically demand additional hardware and complex system setups that are not desirable. In this work, we note that in both incoherent and coherent optical imaging, the irradiance patterns corresponding to two phase diverse measurements associated with the same test object have implicit local correlation which may be learned. A physics informed data augmentation scheme is then described where a trained network is used for generating a phase diverse pseudo-data based on a ground truth data frame. The true data along with the augmented pesudo-data are observed to provide high quality inverse solutions with simpler reconstruction algorithms. We validate this approach for both incoherent and coherent optical imaging (or phase retrieval) configurations with vortex phase as a diversity mechanism. Our results may open new avenues for leaner high-fidelity computational imaging systems across a broad range of applications.