Abstract:The ambiguity at the boundaries of different semantic classes in point cloud semantic segmentation often leads to incorrect decisions in intelligent perception systems, such as autonomous driving. Hence, accurate delineation of the boundaries is crucial for improving safety in autonomous driving. A novel spatial inter-correlation enhancement and spatially-embedded feature fusion network (SIESEF-FusionNet) is proposed in this paper, enhancing spatial inter-correlation by combining inverse distance weighting and angular compensation to extract more beneficial spatial information without causing redundancy. Meanwhile, a new spatial adaptive pooling module is also designed, embedding enhanced spatial information into semantic features for strengthening the context-awareness of semantic features. Experimental results demonstrate that 83.7% mIoU and 97.8% OA are achieved by SIESEF-FusionNet on the Toronto3D dataset, with performance superior to other baseline methods. A value of 61.1% mIoU is reached on the semanticKITTI dataset, where a marked improvement in segmentation performance is observed. In addition, the effectiveness and plug-and-play capability of the proposed modules are further verified through ablation studies.
Abstract:We propose Scalable Mechanistic Neural Network (S-MNN), an enhanced neural network framework designed for scientific machine learning applications involving long temporal sequences. By reformulating the original Mechanistic Neural Network (MNN) (Pervez et al., 2024), we reduce the computational time and space complexities from cubic and quadratic with respect to the sequence length, respectively, to linear. This significant improvement enables efficient modeling of long-term dynamics without sacrificing accuracy or interpretability. Extensive experiments demonstrate that S-MNN matches the original MNN in precision while substantially reducing computational resources. Consequently, S-MNN can drop-in replace the original MNN in applications, providing a practical and efficient tool for integrating mechanistic bottlenecks into neural network models of complex dynamical systems.
Abstract:The historical interaction sequences of users plays a crucial role in training recommender systems that can accurately predict user preferences. However, due to the arbitrariness of user behavior, the presence of noise in these sequences poses a challenge to predicting their next actions in recommender systems. To address this issue, our motivation is based on the observation that training noisy sequences and clean sequences (sequences without noise) with equal weights can impact the performance of the model. We propose a novel self-supervised Auxiliary Task Joint Training (ATJT) method aimed at more accurately reweighting noisy sequences in recommender systems. Specifically, we strategically select subsets from users' original sequences and perform random replacements to generate artificially replaced noisy sequences. Subsequently, we perform joint training on these artificially replaced noisy sequences and the original sequences. Through effective reweighting, we incorporate the training results of the noise recognition model into the recommender model. We evaluate our method on three datasets using a consistent base model. Experimental results demonstrate the effectiveness of introducing self-supervised auxiliary task to enhance the base model's performance.
Abstract:As inference on Large Language Models (LLMs) emerges as an important workload in machine learning applications, weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains open whether speedups are achievable also in \emph{batched} settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. This paper resolves this question positively by describing the design of Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be supported with close to maximum ($4\times$) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to end-to-end LLM inference speedups (of up to $2.8\times$) when integrated with the popular vLLM serving engine. Finally, MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.
Abstract:We introduce a novel method for acquiring boundary representations (B-Reps) of 3D CAD models which involves a two-step process: it first applies a spatial partitioning, referred to as the ``split``, followed by a ``fit`` operation to derive a single primitive within each partition. Specifically, our partitioning aims to produce the classical Voronoi diagram of the set of ground-truth (GT) B-Rep primitives. In contrast to prior B-Rep constructions which were bottom-up, either via direct primitive fitting or point clustering, our Split-and-Fit approach is top-down and structure-aware, since a Voronoi partition explicitly reveals both the number of and the connections between the primitives. We design a neural network to predict the Voronoi diagram from an input point cloud or distance field via a binary classification. We show that our network, coined NVD-Net for neural Voronoi diagrams, can effectively learn Voronoi partitions for CAD models from training data and exhibits superior generalization capabilities. Extensive experiments and evaluation demonstrate that the resulting B-Reps, consisting of parametric surfaces, curves, and vertices, are more plausible than those obtained by existing alternatives, with significant improvements in reconstruction quality. Code will be released on https://github.com/yilinliu77/NVDNet.
Abstract:The cold-start problem is a common challenge for most recommender systems. With extremely limited interactions of cold-start users, conventional recommender models often struggle to generate embeddings with sufficient expressivity. Moreover, the absence of auxiliary content information of users exacerbates the presence of challenges, rendering most cold-start methods difficult to apply. To address this issue, our motivation is based on the observation that if a model can generate expressive embeddings for existing users with relatively more interactions, who were also initially cold-start users, then we can establish a mapping from few initial interactions to expressive embeddings, simulating the process of generating embeddings for cold-start users. Based on this motivation, we propose a Variational Mapping approach for cold-start user Recommendation (VM-Rec). Firstly, we generate a personalized mapping function for cold-start users based on their initial interactions, and parameters of the function are generated from a variational distribution. For the sake of interpretability and computational efficiency, we model the personalized mapping function as a sparse linear model, where each parameter indicates the association to a specific existing user. Consequently, we use this mapping function to map the embeddings of existing users to an embedding of the cold-start user in the same space. The resulting embedding has similar expressivity to that of existing users and can be directly integrated into a pre-trained recommender model to predict click through rates or ranking scores. We evaluate our method based on three widely used recommender models as pre-trained base recommender models, outperforming four popular cold-start methods on two datasets under the same base model.
Abstract:The retrieval phase is a vital component in recommendation systems, requiring the model to be effective and efficient. Recently, generative retrieval has become an emerging paradigm for document retrieval, showing notable performance. These methods enjoy merits like being end-to-end differentiable, suggesting their viability in recommendation. However, these methods fall short in efficiency and effectiveness for large-scale recommendations. To obtain efficiency and effectiveness, this paper introduces a generative retrieval framework, namely SEATER, which learns SEmAntic Tree-structured item identifiERs via contrastive learning. Specifically, we employ an encoder-decoder model to extract user interests from historical behaviors and retrieve candidates via tree-structured item identifiers. SEATER devises a balanced k-ary tree structure of item identifiers, allocating semantic space to each token individually. This strategy maintains semantic consistency within the same level, while distinct levels correlate to varying semantic granularities. This structure also maintains consistent and fast inference speed for all items. Considering the tree structure, SEATER learns identifier tokens' semantics, hierarchical relationships, and inter-token dependencies. To achieve this, we incorporate two contrastive learning tasks with the generation task to optimize both the model and identifiers. The infoNCE loss aligns the token embeddings based on their hierarchical positions. The triplet loss ranks similar identifiers in desired orders. In this way, SEATER achieves both efficiency and effectiveness. Extensive experiments on three public datasets and an industrial dataset have demonstrated that SEATER outperforms state-of-the-art models significantly.
Abstract:Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website.
Abstract:This paper studies grading algorithms for randomized exams. In a randomized exam, each student is asked a small number of random questions from a large question bank. The predominant grading rule is simple averaging, i.e., calculating grades by averaging scores on the questions each student is asked, which is fair ex-ante, over the randomized questions, but not fair ex-post, on the realized questions. The fair grading problem is to estimate the average grade of each student on the full question bank. The maximum-likelihood estimator for the Bradley-Terry-Luce model on the bipartite student-question graph is shown to be consistent with high probability when the number of questions asked to each student is at least the cubed-logarithm of the number of students. In an empirical study on exam data and in simulations, our algorithm based on the maximum-likelihood estimator significantly outperforms simple averaging in prediction accuracy and ex-post fairness even with a small class and exam size.
Abstract:This paper introduces VESR-Net, a method for video enhancement and super-resolution (VESR). We design a separate non-local module to explore the relations among video frames and fuse video frames efficiently, and a channel attention residual block to capture the relations among feature maps for video frame reconstruction in VESR-Net. We conduct experiments to analyze the effectiveness of these designs in VESR-Net, which demonstrates the advantages of VESR-Net over previous state-of-the-art VESR methods. It is worth to mention that among more than thousands of participants for Youku video enhancement and super-resolution (Youku-VESR) challenge, our proposed VESR-Net beat other competitive methods and ranked the first place.