Abstract:Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages, largely due to limited exposure to these languages during pre-training. A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data. Among such methods, OFA (Liu et al., 2024a) proposes a similarity-based subword embedding initialization heuristic that is both effective and efficient. However, OFA restricts target-language token embeddings to be convex combinations of a fixed number of source-language embeddings, which may limit expressiveness. To overcome this limitation, we propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding initialization. The hypernetwork is trained to map from an external multilingual word vector space to the PLMs token embedding space using source-language tokens. Once trained, it can generate flexible embeddings for target-language tokens, serving as a good starting point for continual pretraining. Experiments demonstrate that HYPEROFA consistently outperforms random initialization baseline and matches or exceeds the performance of OFA in both continual pre-training convergence and downstream task performance. We make the code publicly available.
Abstract:The increasing applications of autonomous driving systems necessitates large-scale, high-quality datasets to ensure robust performance across diverse scenarios. Synthetic data has emerged as a viable solution to augment real-world datasets due to its cost-effectiveness, availability of precise ground-truth labels, and the ability to model specific edge cases. However, synthetic data may introduce distributional differences and biases that could impact model performance in real-world settings. To evaluate the utility and limitations of synthetic data, we conducted controlled experiments using multiple real-world datasets and a synthetic dataset generated by BIT Technology Solutions GmbH. Our study spans two sensor modalities, camera and LiDAR, and investigates both 2D and 3D object detection tasks. We compare models trained on real, synthetic, and mixed datasets, analyzing their robustness and generalization capabilities. Our findings demonstrate that the use of a combination of real and synthetic data improves the robustness and generalization of object detection models, underscoring the potential of synthetic data in advancing autonomous driving technologies.