Large Language Models have excelled in remarkable reasoning capabilities with advanced prompting techniques, but they fall short on tasks that require exploration, strategic foresight, and sequential decision-making. Recent works propose to utilize external programs to define search logic, such that LLMs can perform passive tree search to solve more challenging reasoning tasks. Though impressive results have been achieved, there are several fundamental limitations of these approaches. First, passive tree searches are not efficient as they usually require multiple rounds of LLM API calls to solve one single problem. Moreover, passive search methods are not flexible since they need task-specific program designs. Then a natural question arises: can we maintain the tree-search capability of LLMs without the aid of external programs, and can still generate responses that clearly demonstrate the process of a tree-structure search? To this end, we propose a new concept called autonomous tree-search ability of LLM, which can automatically generate a response containing search trajectories for the correct answer. Concretely, we perform search trajectories using capable LLM API via a fixed system prompt, allowing them to perform autonomous tree-search (ATS) right out of the box. Experiments on 4 puzzle games demonstrate our method can achieve huge improvements. The ATS-BFS method outperforms the Chain of Thought approach by achieving an average accuracy improvement of 33%. Compared to Tree of Thoughts, it requires 65.6% or 47.7% less GPT-api cost to attain a comparable level of accuracy. Moreover, we have collected data using the ATS prompt method and fine-tuned LLaMA. This approach yield a greater improvement compared to the ones fine-tuned on CoT data. Specifically, it outperforms CoT-tuned LLaMAs by an average of 40.6% and 38.5% for LLaMA2-7B and LLaMA2-13B, respectively.
* Due to the limitation "The abstract field cannot be longer than 1,920
characters", the abstract here is shorter than that in the PDF file
Large Language Models (LLMs) demonstrate remarkable performance on a variety of Natural Language Understanding (NLU) tasks, primarily due to their in-context learning ability. This ability is utilized in our proposed "CoThought" pipeline, which efficiently trains smaller "baby" language models (BabyLMs) by leveraging the Chain of Thought (CoT) prompting of LLMs. Our pipeline restructures a dataset of less than 100M in size using GPT-3.5-turbo, transforming it into task-oriented, human-readable texts that are comparable to the school texts for language learners. The BabyLM is then pretrained on this restructured dataset in a RoBERTa (Liu et al., 2019) fashion. In evaluations across 4 benchmarks, our BabyLM outperforms the RoBERTa-base in 10 linguistic, NLU, and question answering tasks by more than 3 points, showing superior ability to extract contextual information. These results suggest that compact LMs pretrained on small, LLM-restructured data can better understand tasks and achieve improved performance. The code for data processing and model training is available at: https://github.com/oooranz/Baby-CoThought.
Large Language Models (LLMs) have achieved remarkable results. But existing models are expensive to train and deploy, and it is also difficult to expand their knowledge beyond pre-training data without forgetting previous knowledge. This paper proposes a new neural network architecture, ModuleFormer, that leverages modularity to improve the efficiency and flexibility of large language models. ModuleFormer is based on the Sparse Mixture of Experts (SMoE). Unlike the previous SMoE-based modular language model [Gururangan et al., 2021], which requires domain-labeled data to learn domain-specific experts, ModuleFormer can induce modularity from uncurated data with its new load balancing and load concentration losses. ModuleFormer is a modular architecture that includes two different types of modules, new stick-breaking attention heads, and feedforward experts. Different modules are sparsely activated conditions on the input token during training and inference. In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task, and the task-unrelated modules could be easily pruned for a lightweight deployment.
Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLM-Sim, a new language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLM-Sim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.
Gradient Boosting Decision Tree (GBDT) has achieved remarkable success in a wide variety of applications. The split finding algorithm, which determines the tree construction process, is one of the most crucial components of GBDT. However, the split finding algorithm has long been criticized for its bias towards features with a large number of potential splits. This bias introduces severe interpretability and overfitting issues in GBDT. To this end, we provide a fine-grained analysis of bias in GBDT and demonstrate that the bias originates from 1) the systematic bias in the gain estimation of each split and 2) the bias in the split finding algorithm resulting from the use of the same data to evaluate the split improvement and determine the best split. Based on the analysis, we propose unbiased gain, a new unbiased measurement of gain importance using out-of-bag samples. Moreover, we incorporate the unbiased property into the split finding algorithm and develop UnbiasedGBM to solve the overfitting issue of GBDT. We assess the performance of UnbiasedGBM and unbiased gain in a large-scale empirical study comprising 60 datasets and show that: 1) UnbiasedGBM exhibits better performance than popular GBDT implementations such as LightGBM, XGBoost, and Catboost on average on the 60 datasets and 2) unbiased gain achieves better average performance in feature selection than popular feature importance methods. The codes are available at https://github.com/ZheyuAqaZhang/UnbiasedGBM.
Task-oriented dialog(TOD) aims to assist users in achieving specific goals through multi-turn conversation. Recently, good results have been obtained based on large pre-trained models. However, the labeled-data scarcity hinders the efficient development of TOD systems at scale. In this work, we constructed a weakly supervised dataset based on a teacher/student paradigm that leverages a large collection of unlabelled dialogues. Furthermore, we built a modular dialogue system and integrated coarse-to-fine grained classification for user intent detection. Experiments show that our method can reach the dialog goal with a higher success rate and generate more coherent responses.
* Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems
Co-located with EMNLP 2022, System Description Paper, 5 pages
This paper realizes the estimation of classroom occupancy by using the CO2 sensor and deep learning technique named Long-Short-Term Memory. As a case of connection with IoT and machine learning, I achieve the model to estimate the people number in the classroom based on the environmental data exported from the CO2 sensor, I also evaluate the performance of the model to show the feasibility to apply our module to the real environment.
The goal of automated feature generation is to liberate machine learning experts from the laborious task of manual feature generation, which is crucial for improving the learning performance of tabular data. The major challenge in automated feature generation is to efficiently and accurately identify useful features from a vast pool of candidate features. In this paper, we present OpenFE, an automated feature generation tool that provides competitive results against machine learning experts. OpenFE achieves efficiency and accuracy with two components: 1) a novel feature boosting method for accurately estimating the incremental performance of candidate features. 2) a feature-scoring framework for retrieving effective features from a large number of candidates through successive featurewise halving and feature importance attribution. Extensive experiments on seven benchmark datasets show that OpenFE outperforms existing baseline methods. We further evaluate OpenFE in two famous Kaggle competitions with thousands of data science teams participating. In one of the competitions, features generated by OpenFE with a simple baseline model can beat 99.3\% data science teams. In addition to the empirical results, we provide a theoretical perspective to show that feature generation is beneficial in a simple yet representative setting. The code is available at https://github.com/ZhangTP1996/OpenFE.
This paper presents a driver-specific risk recognition framework for autonomous vehicles that can extract inter-vehicle interactions. This extraction is carried out for urban driving scenarios in a driver-cognitive manner to improve the recognition accuracy of risky scenes. First, clustering analysis is applied to the operation data of drivers for learning the subjective assessment of risky scenes of different drivers and generating the corresponding risk label for each scene. Second, the graph representation model (GRM) is adopted to unify and construct the features of dynamic vehicles, inter-vehicle interactions and static traffic markings in real driving scenes into graphs. The driver-specific risk label provides ground truth to capture the risk evaluation criteria of different drivers. In addition, the graph model represents multiple features of the driving scenes. Therefore, the proposed framework can learn the risk-evaluating pattern of driving scenes of different drivers and establish driver-specific risk identifiers. Last, the performance of the proposed framework is evaluated via experiments conducted using real-world urban driving datasets collected by multiple drivers. The results show that the risks and their levels in real driving environments can be accurately recognized by the proposed framework.
* Submitted to IEEE Transactions on Vehicular Technology
In recent years, road safety has attracted significant attention from researchers and practitioners in the intelligent transport systems domain. As one of the most common and vulnerable groups of road users, pedestrians cause great concerns due to their unpredictable behavior and movement, as subtle misunderstandings in vehicle-pedestrian interaction can easily lead to risky situations or collisions. Existing methods use either predefined collision-based models or human-labeling approaches to estimate the pedestrians' risks. These approaches are usually limited by their poor generalization ability and lack of consideration of interactions between the ego vehicle and a pedestrian. This work tackles the listed problems by proposing a Pedestrian Risk Level Prediction system. The system consists of three modules. Firstly, vehicle-perspective pedestrian data are collected. Since the data contains information regarding the movement of both the ego vehicle and pedestrian, it can simplify the prediction of spatiotemporal features in an interaction-aware fashion. Using the long short-term memory model, the pedestrian trajectory prediction module predicts their spatiotemporal features in the subsequent five frames. As the predicted trajectory follows certain interaction and risk patterns, a hybrid clustering and classification method is adopted to explore the risk patterns in the spatiotemporal features and train a risk level classifier using the learned patterns. Upon predicting the spatiotemporal features of pedestrians and identifying the corresponding risk level, the risk patterns between the ego vehicle and pedestrians are determined. Experimental results verified the capability of the PRLP system to predict the risk level of pedestrians, thus supporting the collision risk assessment of intelligent vehicles and providing safety warnings to both vehicles and pedestrians.