Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ($\mathit{2^{128}}$). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic distillation process but also effectively addresses the optimization conflict between token entropy loss and commitment loss. We further propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios, such as those involving human faces and textual content. On ImageNet, UniWeTok achieves state-of-the-art image generation performance (FID: UniWeTok 1.38 vs. REPA 1.42) while requiring a remarkably low training compute (Training Tokens: UniWeTok 33B vs. REPA 262B). On general-domain, UniWeTok demonstrates highly competitive capabilities across a broad range of tasks, including multimodal understanding, image generation (DPG Score: UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84), and editing (GEdit Overall Score: UniWeTok 5.09 vs. OmniGen 5.06). We release code and models to facilitate community exploration of unified tokenizer and MLLM.
Spatial transcriptomics (ST) provides spatially resolved measurements of gene expression, enabling characterization of the molecular landscape of human tissue beyond histological assessment as well as localized readouts that can be aligned with morphology. Concurrently, the success of multimodal foundation models that integrate vision with complementary modalities suggests that morphomolecular coupling between local expression and morphology can be systematically used to improve histological representations themselves. We introduce Spatial Expression-Aligned Learning (SEAL), a vision-omics self-supervised learning framework that infuses localized molecular information into pathology vision encoders. Rather than training new encoders from scratch, SEAL is designed as a parameter-efficient vision-omics finetuning method that can be flexibly applied to widely used pathology foundation models. We instantiate SEAL by training on over 700,000 paired gene expression spot-tissue region examples spanning tumor and normal samples from 14 organs. Tested across 38 slide-level and 15 patch-level downstream tasks, SEAL provides a drop-in replacement for pathology foundation models that consistently improves performance over widely used vision-only and ST prediction baselines on slide-level molecular status, pathway activity, and treatment response prediction, as well as patch-level gene expression prediction tasks. Additionally, SEAL encoders exhibit robust domain generalization on out-of-distribution evaluations and enable new cross-modal capabilities such as gene-to-image retrieval. Our work proposes a general framework for ST-guided finetuning of pathology foundation models, showing that augmenting existing models with localized molecular supervision is an effective and practical step for improving visual representations and expanding their cross-modal utility.
This paper presents a comprehensive cryogenic analog signal processing architecture designed for superconducting qubit control and quantum state readout operating at 4 Kelvin. The proposed system implements a complete bidirectional signal path bridging room-temperature digital controllers with quantum processors at millikelvin stages. The control path incorporates a Phase-Locked Loop (PLL) for stable local oscillator generation, In-phase/Quadrature (I/Q) modulation for precise qubit gate operations, and a cryogenic power amplifier for signal conditioning. The readout path features a Low Noise Amplifier (LNA) with 14 dB gain and 8-Phase Shift Keying (8-PSK) demodulation for quantum state discrimination. All circuit blocks are designed and validated through SPICE simulations employing cryogenic MOSFET models at 180nm that account for carrier freeze-out, threshold voltage elevation, and enhanced mobility at 4 K. Simulation results demonstrate successful end-to-end signal integrity with I/Q phase error below 2°, image rejection ratio exceeding 35~dB, and symbol error rate below $10^{-6}$. This work provides a modular, simulation-validated framework for scalable cryogenic quantum control systems.
Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector--Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.
A key question for most applications involving reconfigurable linear wave systems is how accurately a desired linear operator can be realized by configuring the system's tunable elements. The relevance of this question spans from hybrid-MIMO analog combiners via computational meta-imagers to programmable wave-domain signal processing. Yet, no electromagnetically consistent bounds have been derived for the fidelity with which a desired operator can be realized in a real-world reconfigurable wave system. Here, we derive such bounds based on an electromagnetically consistent multiport-network model (capturing mutual coupling between tunable elements) and accounting for real-world hardware constraints (lossy, 1-bit-programmable elements). Specifically, we formulate the operator-synthesis task as a quadratically constrained fractional-quadratic problem and compute rigorous fidelity upper bounds based on semidefinite relaxation. We apply our technique to three distinct experimental setups. The first two setups are, respectively, a free-space and a rich-scattering $4\times 4$ MIMO channel at 2.45 GHz parameterized by a reconfigurable intelligent surface (RIS) comprising 100 1-bit-programmable elements. The third setup is a $4\times 4$ MIMO channel at 19 GHz from four feeds of a dynamic metasurface antenna (DMA) to four users. We systematically study how the achievable fidelity scales with the number of tunable elements, and we probe the tightness of our bounds by trying to find optimized configurations approaching the bounds with standard discrete-optimization techniques. We observe a strong influence of the coupling strength between tunable elements on our fidelity bound. For the two RIS-based setups, our bound attests to insufficient wave-domain flexibility for the considered operator synthesis.
Traditional mechanized chestnut harvesting is too costly for small producers, non-selective, and prone to damaging nuts. Accurate, reliable detection of chestnuts on the orchard floor is crucial for developing low-cost, vision-guided automated harvesting technology. However, developing a reliable chestnut detection system faces challenges in complex environments with shading, varying natural light conditions, and interference from weeds, fallen leaves, stones, and other foreign on-ground objects, which have remained unaddressed. This study collected 319 images of chestnuts on the orchard floor, containing 6524 annotated chestnuts. A comprehensive set of 29 state-of-the-art real-time object detectors, including 14 in the YOLO (v11-13) and 15 in the RT-DETR (v1-v4) families at varied model scales, was systematically evaluated through replicated modeling experiments for chestnut detection. Experimental results show that the YOLOv12m model achieves the best mAP@0.5 of 95.1% among all the evaluated models, while the RT-DETRv2-R101 was the most accurate variant among RT-DETR models, with mAP@0.5 of 91.1%. In terms of mAP@[0.5:0.95], the YOLOv11x model achieved the best accuracy of 80.1%. All models demonstrate significant potential for real-time chestnut detection, and YOLO models outperformed RT-DETR models in terms of both detection accuracy and inference, making them better suited for on-board deployment. Both the dataset and software programs in this study have been made publicly available at https://github.com/AgFood-Sensing-and-Intelligence-Lab/ChestnutDetection.
Enhancing the generalization capability of robotic learning to enable robots to operate effectively in diverse, unseen scenes is a fundamental and challenging problem. Existing approaches often depend on pretraining with large-scale data collection, which is labor-intensive and time-consuming, or on semantic data augmentation techniques that necessitate an impractical assumption of flawless upstream object detection in real-world scenarios. In this work, we propose RoboAug, a novel generative data augmentation framework that significantly minimizes the reliance on large-scale pretraining and the perfect visual recognition assumption by requiring only the bounding box annotation of a single image during training. Leveraging this minimal information, RoboAug employs pre-trained generative models for precise semantic data augmentation and integrates a plug-and-play region-contrastive loss to help models focus on task-relevant regions, thereby improving generalization and boosting task success rates. We conduct extensive real-world experiments on three robots, namely UR-5e, AgileX, and Tien Kung 2.0, spanning over 35k rollouts. Empirical results demonstrate that RoboAug significantly outperforms state-of-the-art data augmentation baselines. Specifically, when evaluating generalization capabilities in unseen scenes featuring diverse combinations of backgrounds, distractors, and lighting conditions, our method achieves substantial gains over the baseline without augmentation. The success rates increase from 0.09 to 0.47 on UR-5e, from 0.16 to 0.60 on AgileX, and from 0.19 to 0.67 on Tien Kung 2.0. These results highlight the superior generalization and effectiveness of RoboAug in real-world manipulation tasks. Our project is available at https://x-roboaug.github.io/.
Recent advances in deep learning (DL)-based joint source-channel coding (JSCC) have enabled efficient semantic communication in dynamic wireless environments. Among these approaches, vector quantization (VQ)-based JSCC effectively maps high-dimensional semantic feature vectors into compact codeword indices for digital modulation. However, existing methods, including universal JSCC (uJSCC), rely on fixed, modulation-specific encoders, decoders, and codebooks, limiting adaptability to fine-grained SNR variations. We propose an extended universal JSCC (euJSCC) framework that achieves SNR- and modulation-adaptive transmission within a single model. euJSCC employs a hypernetwork-based normalization layer for fine-grained feature vector normalization and a dynamic codebook generation (DCG) network that refines modulation-specific base codebooks according to block-wise SNR. To handle block fading channels, which consist of multiple coherence blocks, an inner-outer encoder-decoder architecture is adopted, where the outer encoder and decoder capture long-term channel statistics, and the inner encoder and decoder refine feature vectors to align with block-wise codebooks. A two-phase training strategy, i.e., pretraining on AWGN channels followed by finetuning on block fading channels, ensures stable convergence. Experiments on image transmission demonstrate that euJSCC consistently outperforms state-of-the-art channel-adaptive digital JSCC schemes under both block fading and AWGN channels.
Quasi-static human activities such as lying, standing or sitting produce very low Doppler shifts and highly spread radar signatures, making them difficult to detect with conventional constant-false-alarm rate (CFAR) detectors tuned for point targets. Moreover, privacy concerns and low lighting conditions limit the use of cameras in long-term care (LTC) facilities. This paper proposes a lightweight, non-visual image-based method for robust quasi-static human presence detection using a low-cost 60 GHz FMCW radar. On a dataset covering five semi-static activities, the proposed method improves average detection accuracy from 68.3% for Cell-Averaging CFAR (CA-CFAR) and 78.8% for Order-Statistics CFAR (OS-CFAR) to 93.24% for Subject 1, from 51.3%, 68.3% to 92.3% for Subject 2, and 57.72%, 69.94% to 94.82% for Subject 3, respectively. Finally, we benchmarked all three detectors across all activities on a Raspberry Pi 4B using a shared Range-Angle (RA) preprocessing pipeline. The proposed algorithm obtains an average 8.2 ms per frame, resulting in over 120 frames per second (FPS) and a 74 times speed-up over OS-CFAR. These results demonstrate that simple image-based processing can provide robust and deployable quasi-static human sensing in cluttered indoor environments.
Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.