Abstract:Point cloud surface reconstruction has improved in accuracy with advances in deep learning, enabling applications such as infrastructure inspection. Recent approaches that reconstruct from small local regions rather than entire point clouds have attracted attention for their strong generalization capability. However, prior work typically places local regions uniformly and keeps their size fixed, limiting adaptability to variations in geometric complexity. In this study, we propose a method that improves reconstruction accuracy and efficiency by adaptively modulating the spacing and size of local regions based on the curvature of the input point cloud.
Abstract:Image Coding for Machines (ICM) has become increasingly important with the rapid integration of computer vision into real-world applications. However, most ICM frameworks utilize learned image compression (LIC) models that operate at a fixed rate and require separate training for each target bitrate, which may limit their practical applications. Existing variable rate LIC approaches mitigate this limitation but typically depend on training, increasing computational cost and deployment complexity. Moreover, variable rate control has not been thoroughly explored for ICM. To address these challenges, we propose a training-free, adaptive quantization step size control scheme that enables flexible bitrate adjustment. By leveraging both channel-wise entropy dependencies and spatial scale parameters predicted by the hyperprior network, the proposed method preserves semantically important regions while coarsely quantizing less critical areas. The bitrate can be continuously controlled through a single parameter. Experimental results demonstrate the effectiveness of our proposed method, achieving up to 11.07% BD-rate savings over the non-adaptive variable rate method.
Abstract:Recently, software has been developed that uses machine learning to mimic the style of a particular composer, such as J. S. Bach. However, since such software often adopts machine learning models with complex structures, it is difficult to analyze how the software understands the characteristics of the composer's music. In this study, we adopted J. S. Bach's music for training of a restricted Boltzmann machine (RBM). Since the structure of RBMs is simple, it allows us to investigate the internal states after learning. We found that the learned RBM is able to compose music.
Abstract:Bounding box regression (BBR) is fundamental to object detection, where the regression loss is crucial for accurate localization. Existing IoU-based losses often incorporate handcrafted geometric penalties to address IoU's non-differentiability in non-overlapping cases and enhance BBR performance. However, these penalties are sensitive to box shape, size, and distribution, often leading to suboptimal optimization for small objects and undesired behaviors such as bounding box enlargement due to misalignment with the IoU objective. To address these limitations, we propose InterpIoU, a novel loss function that replaces handcrafted geometric penalties with a term based on the IoU between interpolated boxes and the target. By using interpolated boxes to bridge the gap between predictions and ground truth, InterpIoU provides meaningful gradients in non-overlapping cases and inherently avoids the box enlargement issue caused by misaligned penalties. Simulation results further show that IoU itself serves as an ideal regression target, while existing geometric penalties are both unnecessary and suboptimal. Building on InterpIoU, we introduce Dynamic InterpIoU, which dynamically adjusts interpolation coefficients based on IoU values, enhancing adaptability to scenarios with diverse object distributions. Experiments on COCO, VisDrone, and PASCAL VOC show that our methods consistently outperform state-of-the-art IoU-based losses across various detection frameworks, with particularly notable improvements in small object detection, confirming their effectiveness.
Abstract:Scalable image compression is a technique that progressively reconstructs multiple versions of an image for different requirements. In recent years, images have increasingly been consumed not only by humans but also by image recognition models. This shift has drawn growing attention to scalable image compression methods that serve both machine and human vision (ICMH). Many existing models employ neural network-based codecs, known as learned image compression, and have made significant strides in this field by carefully designing the loss functions. In some cases, however, models are overly reliant on their learning capacity, and their architectural design is not sufficiently considered. In this paper, we enhance the coding efficiency and interpretability of ICMH framework by integrating an explicit residual compression mechanism, which is commonly employed in resolution scalable coding methods such as JPEG2000. Specifically, we propose two complementary methods: Feature Residual-based Scalable Coding (FR-ICMH) and Pixel Residual-based Scalable Coding (PR-ICMH). These proposed methods are applicable to various machine vision tasks. Moreover, they provide flexibility to choose between encoder complexity and compression performance, making it adaptable to diverse application requirements. Experimental results demonstrate the effectiveness of our proposed methods, with PR-ICMH achieving up to 29.57% BD-rate savings over the previous work.
Abstract:Implicit Neural Representations (INRs) have attracted significant interest for their ability to model complex signals by mapping spatial and temporal coordinates to signal values. In the context of neural video representation, several decoding strategies have been explored to balance compactness and reconstruction quality, including pixel-wise, frame-wise, and patch-wise methods. Patch-wise decoding aims to combine the flexibility of pixel-based models with the efficiency of frame-based approaches. However, conventional uniform patch division often leads to discontinuities at patch boundaries, as independently reconstructed regions may fail to form a coherent global structure. To address this limitation, we propose a neural video representation method based on Structure-Preserving Patches (SPPs). Our approach rearranges each frame into a set of spatially structured patch frames using a PixelUnshuffle-like operation. This rearrangement maintains the spatial coherence of the original frame while enabling patch-level decoding. The network learns to predict these rearranged patch frames, which supports a global-to-local fitting strategy and mitigates degradation caused by upsampling. Experiments on standard video datasets show that the proposed method improves reconstruction quality and compression performance compared to existing INR-based video representation methods.
Abstract:Implicit Neural Representations (INRs) have garnered significant attention for their ability to model complex signals across a variety of domains. Recently, INR-based approaches have emerged as promising frameworks for neural video compression. While conventional methods primarily focus on embedding video content into compact neural networks for efficient representation, they often struggle to reconstruct high-frequency details under stringent model size constraints, which are critical in practical compression scenarios. To address this limitation, we propose an INR-based video representation method that integrates a general-purpose super-resolution (SR) network. Motivated by the observation that high-frequency components exhibit low temporal redundancy across frames, our method entrusts the reconstruction of fine details to the SR network. Experimental results demonstrate that the proposed method outperforms conventional INR-based baselines in terms of reconstruction quality, while maintaining comparable model sizes.
Abstract:The development of video generation models has advanced significantly in recent years. For video frame interpolation, we adopt a pre-trained large-scale image-to-video diffusion model. To enable this adaptation, we propose a conditional encoder, which serves as a simple yet effective trainable module. By leveraging the first and last frames, we extract spatial and temporal features and input them into the conditional encoder. The computed features of the conditional encoder guide the video diffusion model in generating keyframe-guided video sequences. Our method demonstrates superior performance on the Fr\'echet Video Distance (FVD) metric compared to previous deterministic approaches in handling large-motion cases, highlighting advancements in generative-based methodologies.
Abstract:Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models' reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model's network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.


Abstract:Sign language is a visual language expressed through hand movements and non-manual markers. Non-manual markers include facial expressions and head movements. These expressions vary across different nations. Therefore, specialized analysis methods for each sign language are necessary. However, research on Japanese Sign Language (JSL) recognition is limited due to a lack of datasets. The development of recognition models that consider both manual and non-manual features of JSL is crucial for precise and smooth communication with deaf individuals. In JSL, sentence types such as affirmative statements and questions are distinguished by facial expressions. In this paper, we propose a JSL recognition method that focuses on facial expressions. Our proposed method utilizes a neural network to analyze facial features and classify sentence types. Through the experiments, we confirm our method's effectiveness by achieving a classification accuracy of 96.05%.