Abstract:Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.
Abstract:Neural 3D scene reconstruction methods have achieved impressive performance when reconstructing complex geometry and low-textured regions in indoor scenes. However, these methods heavily rely on 3D data which is costly and time-consuming to obtain in real world. In this paper, we propose a novel neural reconstruction method that reconstructs scenes using sparse depth under the plane constraints without 3D supervision. We introduce a signed distance function field, a color field, and a probability field to represent a scene. We optimize these fields to reconstruct the scene by using differentiable ray marching with accessible 2D images as supervision. We improve the reconstruction quality of complex geometry scene regions with sparse depth obtained by using the geometric constraints. The geometric constraints project 3D points on the surface to similar-looking regions with similar features in different 2D images. We impose the plane constraints to make large planes parallel or vertical to the indoor floor. Both two constraints help reconstruct accurate and smooth geometry structures of the scene. Without 3D supervision, our method achieves competitive performance compared with existing methods that use 3D supervision on the ScanNet dataset.