Abstract:Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose Efficient-LVSM, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed. Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong zero-shot generalization to unseen view counts, and enables incremental inference with KV-cache, thanks to its decoupled designs.
Abstract:SSF3D modified the semi-supervised 3D object detection (SS3DOD) framework, which designed specifically for point cloud data. Leveraging the characteristics of non-coincidence and weak correlation of target objects in point cloud, we adopt a strategy of retaining only the truth-determining pseudo labels and trimming the other fuzzy labels with points, instead of pursuing a balance between the quantity and quality of pseudo labels. Besides, we notice that changing the filter will make the model meet different distributed targets, which is beneficial to break the training bottleneck. Two mechanism are introduced to achieve above ideas: strict threshold and filter switching. The experiments are conducted to analyze the effectiveness of above approaches and their impact on the overall performance of the system. Evaluating on the KITTI dataset, SSF3D exhibits superior performance compared to the current state-of-the-art methods. The code will be released here.