We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.
In this paper, we aim to improve the reasoning ability of large language models (LLMs) over knowledge graphs (KGs) to answer complex questions. Inspired by existing methods that design the interaction strategy between LLMs and KG, we propose an autonomous LLM-based agent framework, called KG-Agent, which enables a small LLM to actively make decisions until finishing the reasoning process over KGs. In KG-Agent, we integrate the LLM, multifunctional toolbox, KG-based executor, and knowledge memory, and develop an iteration mechanism that autonomously selects the tool then updates the memory for reasoning over KG. To guarantee the effectiveness, we leverage program language to formulate the multi-hop reasoning process over the KG, and synthesize a code-based instruction dataset to fine-tune the base LLM. Extensive experiments demonstrate that only using 10K samples for tuning LLaMA-7B can outperform state-of-the-art methods using larger LLMs or more data, on both in-domain and out-domain datasets. Our code and data will be publicly released.
In this work, we study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on LLMs. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. The same issue also holds for some reward models in RL. To address the challenges in both training and evaluation, we establish a more reliable evaluation protocol for comparing different training configurations, which inspects the trade-off between LLM evaluation score and response length obtained by varying training hyperparameters. Based on this evaluation, we conduct large-scale studies, where the results shed insights into the efficacy of hyperparameters and tricks used in RL on mitigating length bias. We further propose to improve the reward model by jointly training two linear heads on shared feature representations to predict the rewards, one trained to correlate with length, and the other trained to decorrelate with length and therefore focus more on the actual content. We then discard the length head in RL to prevent reward hacking on length. Experiments demonstrate that our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
We present LiRank, a large-scale ranking framework at LinkedIn that brings to production state-of-the-art modeling architectures and optimization methods. We unveil several modeling improvements, including Residual DCN, which adds attention and residual connections to the famous DCNv2 architecture. We share insights into combining and tuning SOTA architectures to create a unified model, including Dense Gating, Transformers and Residual DCN. We also propose novel techniques for calibration and describe how we productionalized deep learning based explore/exploit methods. To enable effective, production-grade serving of large ranking models, we detail how to train and compress models using quantization and vocabulary compression. We provide details about the deployment setup for large-scale use cases of Feed ranking, Jobs Recommendations, and Ads click-through rate (CTR) prediction. We summarize our learnings from various A/B tests by elucidating the most effective technical approaches. These ideas have contributed to relative metrics improvements across the board at LinkedIn: +0.5% member sessions in the Feed, +1.76% qualified job applications for Jobs search and recommendations, and +4.3% for Ads CTR. We hope this work can provide practical insights and solutions for practitioners interested in leveraging large-scale deep ranking systems.
Target detection is pivotal for modern urban computing applications. While image-based techniques are widely adopted, they falter under challenging environmental conditions such as adverse weather, poor lighting, and occlusion. To improve the target detection performance under complex real-world scenarios, this paper proposes an intelligent integrated optical camera and millimeter-wave (mmWave) radar system. Utilizing both physical knowledge and data-driven methods, a long-term robust radar-camera fusion algorithm is proposed to solve the heterogeneous data fusion problem for detection improvement. For the occlusion scenarios, the proposed algorithm can effectively detect occluded targets with the help of memory through performing long-term detection. For dark scenarios with low-light conditions, the proposed algorithm can effectively mark the target in the dark picture as well as provide rough stickman imaging. The above two innovative functions of the hybrid optical camera and mmWave radar system are tested in real-world scenarios. The results demonstrate the robustness and significant enhancement in the target detection performance of our integrated system.
With the increasing consumption of 3D displays and virtual reality, multi-view video has become a promising format. However, its high resolution and multi-camera shooting result in a substantial increase in data volume, making storage and transmission a challenging task. To tackle these difficulties, we propose an implicit-explicit integrated representation for multi-view video compression. Specifically, we first use the explicit representation-based 2D video codec to encode one of the source views. Subsequently, we propose employing the implicit neural representation (INR)-based codec to encode the remaining views. The implicit codec takes the time and view index of multi-view video as coordinate inputs and generates the corresponding implicit reconstruction frames.To enhance the compressibility, we introduce a multi-level feature grid embedding and a fully convolutional architecture into the implicit codec. These components facilitate coordinate-feature and feature-RGB mapping, respectively. To further enhance the reconstruction quality from the INR codec, we leverage the high-quality reconstructed frames from the explicit codec to achieve inter-view compensation. Finally, the compensated results are fused with the implicit reconstructions from the INR to obtain the final reconstructed frames. Our proposed framework combines the strengths of both implicit neural representation and explicit 2D codec. Extensive experiments conducted on public datasets demonstrate that the proposed framework can achieve comparable or even superior performance to the latest multi-view video compression standard MIV and other INR-based schemes in terms of view compression and scene modeling.
A swarm of robots has advantages over a single robot, since it can explore larger areas much faster and is more robust to single-point failures. Accurate relative positioning is necessary to successfully carry out a collaborative mission without collisions. When Visual Simultaneous Localization and Mapping (VSLAM) is used to estimate the poses of each robot, inter-agent loop closing is widely applied to reduce the relative positioning errors. This technique can mitigate errors using the feature points commonly observed by different robots. However, it requires significant computing and communication capabilities to detect inter-agent loops, and to process the data transmitted by multiple agents. In this paper, we propose Collaborative SLAM using Visual Odometry and Range measurements (CoVOR-SLAM) to overcome this challenge. In the framework of CoVOR-SLAM, robots only need to exchange pose estimates, covariances (uncertainty) of the estimates, and range measurements between robots. Since CoVOR-SLAM does not require to associate visual features and map points observed by different agents, the computational and communication loads are significantly reduced. The required range measurements can be obtained using pilot signals of the communication system, without requiring complex additional infrastructure. We tested CoVOR-SLAM using real images as well as real ultra-wideband-based ranges obtained with two rovers. In addition, CoVOR-SLAM is evaluated with a larger scale multi-agent setup exploiting public image datasets and ranges generated using a realistic simulation. The results show that CoVOR-SLAM can accurately estimate the robots' poses, requiring much less computational power and communication capabilities than the inter-agent loop closing technique.
* Submitted to the IEEE Transactions on Intelligent Transportation
Communication, Navigation and Surveillance (CNS) technologies are key enablers for future safe operation of drones in urban environments. However, the design of navigation technologies for these new applications is more challenging compared to e.g., civil aviation. On the one hand, the use cases and operations in urban environments are expected to have stringent requirements in terms of accuracy, integrity, continuity and availability. On the other hand, airborne sensors may not be based on high-quality equipment as in civil aviation and solutions need to rely on tighter multisensor solutions, whose safety is difficult to assess. In this work, we first provide some initial navigation requirements related to precision approach operations based on recently proposed vertiport designs. Then, we provide an overview of a possible multisensor navigation architecture solution able to support these types of operations and we comment on the challenges of each of the subsystems. Finally, initial proof of concept for some navigation sensor subsystems is presented based on flight trials performed during the German Aerospace Center (DLR) project HorizonUAM.
Generative large language models(LLMs) are proficient in solving general problems but often struggle to handle domain-specific tasks. This is because most of domain-specific tasks, such as personalized recommendation, rely on task-related information for optimal performance. Current methods attempt to supplement task-related information to LLMs by designing appropriate prompts or employing supervised fine-tuning techniques. Nevertheless, these methods encounter the certain issue that information such as community behavior pattern in RS domain is challenging to express in natural language, which limits the capability of LLMs to surpass state-of-the-art domain-specific models. On the other hand, domain-specific models for personalized recommendation which mainly rely on user interactions are susceptible to data sparsity due to their limited common knowledge capabilities. To address these issues, we proposes a method to bridge the information gap between the domain-specific models and the general large language models. Specifically, we propose an information sharing module which serves as an information storage mechanism and also acts as a bridge for collaborative training between the LLMs and domain-specific models. By doing so, we can improve the performance of LLM-based recommendation with the help of user behavior pattern information mined by domain-specific models. On the other hand, the recommendation performance of domain-specific models can also be improved with the help of common knowledge learned by LLMs. Experimental results on three real-world datasets have demonstrated the effectiveness of the proposed method.
By classifying infinite-width neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth-$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.