Abstract:Side-scan sonar (SSS) mine classification is a challenging maritime vision problem characterized by extreme data scarcity and a large domain gap from natural images. While self-supervised learning (SSL) and general-purpose vision foundation models have shown strong performance in general vision and several specialized domains, their use in SSS remains largely unexplored. We present Mine-JEPA, the first in-domain SSL pipeline for SSS mine classification, using SIGReg, a regularization-based SSL loss, to pretrain on only 1,170 unlabeled sonar images. In the binary mine vs. non-mine setting, Mine-JEPA achieves an F1 score of 0.935, outperforming fine-tuned DINOv3 (0.922), a foundation model pretrained on 1.7B images. For 3-class mine-like object classification, Mine-JEPA reaches 0.820 with synthetic data augmentation, again outperforming fine-tuned DINOv3 (0.810). We further observe that applying in-domain SSL to foundation models degrades performance by 10--13 percentage points, suggesting that stronger pretrained models do not always benefit from additional domain adaptation. In addition, Mine-JEPA with a compact ViT-Tiny backbone achieves competitive performance while using 4x fewer parameters than DINOv3. These results suggest that carefully designed in-domain self-supervised learning is a viable alternative to much larger foundation models in data-scarce maritime sonar imagery.
Abstract:As large language models are deployed in high-stakes enterprise applications, from healthcare to finance, ensuring adherence to organization-specific policies has become essential. Yet existing safety evaluations focus exclusively on universal harms. We present COMPASS (Company/Organization Policy Alignment Assessment), the first systematic framework for evaluating whether LLMs comply with organizational allowlist and denylist policies. We apply COMPASS to eight diverse industry scenarios, generating and validating 5,920 queries that test both routine compliance and adversarial robustness through strategically designed edge cases. Evaluating seven state-of-the-art models, we uncover a fundamental asymmetry: models reliably handle legitimate requests (>95% accuracy) but catastrophically fail at enforcing prohibitions, refusing only 13-40% of adversarial denylist violations. These results demonstrate that current LLMs lack the robustness required for policy-critical deployments, establishing COMPASS as an essential evaluation framework for organizational AI safety.
Abstract:Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) methods. While modern ASR encoders can process up to 30 seconds of audio at once, AR decoders still generate tokens sequentially, creating a latency bottleneck. We propose Whisfusion, the first framework to fuse a pre-trained Whisper encoder with a text diffusion decoder. This NAR architecture resolves the AR latency bottleneck by processing the entire acoustic context in parallel at every decoding step. A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities. We also introduce a batch-parallel, multi-step decoding strategy that improves accuracy by increasing the number of candidates with minimal impact on speed. Fine-tuned solely on LibriSpeech (960h), Whisfusion achieves a lower WER than Whisper-tiny (8.3% vs. 9.7%), and offers comparable latency on short audio. For longer utterances (>20s), it is up to 2.6x faster than the AR baseline, establishing a new, efficient operating point for long-form ASR. The implementation and training scripts are available at https://github.com/taeyoun811/Whisfusion.