This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models.
This study employs a uniform rectangular array (URA) sub-connected hybrid beamforming (SC-HBF) architecture to provide a novel self-interference (SI) suppression scheme in a full-duplex (FD) massive multiple-input multiple-output (mMIMO) system. Our primary objective is to mitigate the strong SI through the design of RF beamforming stages for uplink and downlink transmissions that utilize the spatial degrees of freedom provided due to the use of large array structures. We propose a non-constant modulus RF beamforming (NCM-BF-SIS) scheme that incorporates the gain controllers for both transmit (Tx) and receive (Rx) RF beamforming stages and optimizes the uplink and downlink beam directions jointly with gain controller coefficients. To solve this challenging non-convex optimization problem, we propose a swarm intelligence-based algorithmic solution that finds the optimal beam perturbations while also adjusting the Tx/Rx gain controllers to alleviate SI subject to the directivity degradation constraints for the beams. The data-driven analysis based on the measured SI channel in an anechoic chamber shows that the proposed NCM-BF-SIS scheme can suppress SI by around 80 dB in FD mMIMO systems.
Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. Existing long-context extension methods usually need additional training procedures to support corresponding long-context windows, where the long-context training data (e.g., 32k) is needed, and high GPU training costs are assumed. To address the aforementioned issues, we propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced computation cost, which also removes the need to collect long-context data. Concretely, first, the training data of our E 2 -LLM only requires a short length (e.g., 4k), which reduces the tuning cost greatly. Second, the training procedure on the short training context window is performed only once time, and we can support different evaluation context windows at inference. Third, in E 2 - LLM, based on RoPE position embeddings, we introduce two different augmentation methods on the scale and position index parameters for different samples in training. It aims to make the model more robust to the different relative differences when directly interpolating the arbitrary context length at inference. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.
High-concurrency asynchronous training upon parameter server (PS) architecture and high-performance synchronous training upon all-reduce (AR) architecture are the most commonly deployed distributed training modes for recommender systems. Although the synchronous AR training is designed to have higher training efficiency, the asynchronous PS training would be a better choice on training speed when there are stragglers (slow workers) in the shared cluster, especially under limited computing resources. To take full advantages of these two training modes, an ideal way is to switch between them upon the cluster status. We find two obstacles to a tuning-free approach: the different distribution of the gradient values and the stale gradients from the stragglers. In this paper, we propose Global Batch gradients Aggregation (GBA) over PS, which aggregates and applies gradients with the same global batch size as the synchronous training. A token-control process is implemented to assemble the gradients and decay the gradients with severe staleness. We provide the convergence analysis to demonstrate the robustness of GBA over the recommendation models against the gradient staleness. Experiments on three industrial-scale recommendation tasks show that GBA is an effective tuning-free approach for switching. Compared to the state-of-the-art derived asynchronous training, GBA achieves up to 0.2% improvement on the AUC metric, which is significant for the recommendation models. Meanwhile, under the strained hardware resource, GBA speeds up at least 2.4x compared to the synchronous training.
The development of personalized recommendation has significantly improved the accuracy of information matching and the revenue of e-commerce platforms. Recently, it has 2 trends: 1) recommender systems must be trained timely to cope with ever-growing new products and ever-changing user interests from online marketing and social network; 2) SOTA recommendation models introduce DNN modules to improve prediction accuracy. Traditional CPU-based recommender systems cannot meet these two trends, and GPU- centric training has become a trending approach. However, we observe that GPU devices in training recommender systems are underutilized, and they cannot attain an expected throughput improvement as what it has achieved in CV and NLP areas. This issue can be explained by two characteristics of these recommendation models: First, they contain up to a thousand input feature fields, introducing fragmentary and memory-intensive operations; Second, the multiple constituent feature interaction submodules introduce substantial small-sized compute kernels. To remove this roadblock to the development of recommender systems, we propose a novel framework named PICASSO to accelerate the training of recommendation models on commodity hardware. Specifically, we conduct a systematic analysis to reveal the bottlenecks encountered in training recommendation models. We leverage the model structure and data distribution to unleash the potential of hardware through our packing, interleaving, and caching optimization. Experiments show that PICASSO increases the hardware utilization by an order of magnitude on the basis of SOTA baselines and brings up to 6x throughput improvement for a variety of industrial recommendation models. Using the same hardware budget in production, PICASSO on average shortens the walltime of daily training tasks by 7 hours, significantly reducing the delay of continuous delivery.
Traffic flow forecasting is of great significance for improving the efficiency of transportation systems and preventing emergencies. Due to the highly non-linearity and intricate evolutionary patterns of short-term and long-term traffic flow, existing methods often fail to take full advantage of spatial-temporal information, especially the various temporal patterns with different period shifting and the characteristics of road segments. Besides, the globality representing the absolute value of traffic status indicators and the locality representing the relative value have not been considered simultaneously. This paper proposes a neural network model that focuses on the globality and locality of traffic networks as well as the temporal patterns of traffic data. The cycle-based dilated deformable convolution block is designed to capture different time-varying trends on each node accurately. Our model can extract both global and local spatial information since we combine two graph convolutional network methods to learn the representations of nodes and edges. Experiments on two real-world datasets show that the model can scrutinize the spatial-temporal correlation of traffic data, and its performance is better than the compared state-of-the-art methods. Further analysis indicates that the locality and globality of the traffic networks are critical to traffic flow prediction and the proposed TSSRGCN model can adapt to the various temporal traffic patterns.
Knowledge distillation has become increasingly important in model compression. It boosts the performance of a miniaturized student network with the supervision of the output distribution and feature maps from a sophisticated teacher network. Some recent works introduce multi-teacher distillation to provide more supervision to the student network. However, the effectiveness of multi-teacher distillation methods are accompanied by costly computation resources. To tackle with both the efficiency and the effectiveness of knowledge distillation, we introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework by extracting informative supervision from multiple teacher feature maps. Specifically, we introduce DFA, a two-stage Differentiable Feature Aggregation search method that motivated by DARTS in neural architecture search, to efficiently find the aggregations. In the first stage, DFA formulates the searching problem as a bi-level optimization and leverages a novel bridge loss, which consists of a student-to-teacher path and a teacher-to-student path, to find appropriate feature aggregations. The two paths act as two players against each other, trying to optimize the unified architecture parameters to the opposite directions while guaranteeing both expressivity and learnability of the feature aggregation simultaneously. In the second stage, DFA performs knowledge distillation with the derived feature aggregation. Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets under various teacher-student settings, verifying the effectiveness and robustness of the design.
User behavior and feature interactions are crucial in deep learning-based recommender systems. There has been a diverse set of behavior modeling and interaction exploration methods in the literature. Nevertheless, the design of task-aware recommender systems still requires feature engineering and architecture engineering from domain experts. In this work, we introduce AMER, namely Automatic behavior Modeling and interaction Exploration in Recommender systems with Neural Architecture Search (NAS). The core contributions of AMER include the three-stage search space and the tailored three-step searching pipeline. In the first step, AMER searches for residual blocks that incorporate commonly used operations in the block-wise search space of stage 1 to model sequential patterns in user behavior. In the second step, it progressively investigates useful low-order and high-order feature interactions in the non-sequential interaction space of stage 2. Finally, an aggregation multi-layer perceptron (MLP) with shortcut connection is selected from flexible dimension settings of stage~3 to combine features extracted from the previous steps. For efficient and effective NAS, AMER employs the one-shot random search in all three steps. Further analysis reveals that AMER's search space could cover most of the representative behavior extraction and interaction investigation methods, which demonstrates the universality of our design. The extensive experimental results over various scenarios reveal that AMER could outperform competitive baselines with elaborate feature engineering and architecture engineering, indicating both effectiveness and robustness of the proposed method.