Abstract:Large language model (LLM) inference increasingly depends on multi-GPU execution, yet existing inference parallelization strategies require layer-wise inter-rank synchronization, making end-to-end performance sensitive to workload imbalance. We present DWDP (Distributed Weight Data Parallelism), an inference parallelization strategy that preserves data-parallel execution while offloading MoE weights across peer GPUs and fetching missing experts on demand. By removing collective inter-rank synchronization, DWDP allows each GPU to progress independently. We further address the practical overheads of this design with two optimizations for split-weight management and asynchronous remote-weight prefetch. Implemented in TensorRT-LLM and evaluated with DeepSeek-R1 on GB200 NVL72, DWDP improves end-to-end output TPS/GPU by 8.8% at comparable TPS/user in the 20-100 TPS/user serving range under 8K input sequence length and 1K output sequence length.
Abstract:This paper presents a white-box intention-aware decision-making for the handling of interactions between a pedestrian and an automated vehicle (AV) in an unsignalized street crossing scenario. Moreover, a design framework has been developed, which enables automated parameterization of the decision-making. This decision-making is designed in such a manner that it can understand pedestrians in urban traffic and can react accordingly to their intentions. That way, a human-like response to the actions of the pedestrian is ensured, leading to a higher acceptance of AVs. The core notion of this paper is that the intention prediction of the pedestrian to cross the street and decision-making are divided into two subsystems. On the one hand, the intention detection is a data-driven, black-box model. Thus, it can model the complex behavior of the pedestrians. On the other hand, the decision-making is a white-box model to ensure traceability and to enable a rapid verification and validation of AVs. This white-box decision-making provides human-like behavior and a guaranteed prevention of deadlocks. An additional benefit is that the proposed decision-making requires low computational resources only enabling real world usage. The automated parameterization uses a particle swarm optimization and compares two different models of the pedestrian: The social force model and the Markov decision process model. Consequently, a rapid design of the decision-making is possible and different pedestrian behaviors can be taken into account. The results reinforce the applicability of the proposed intention-aware decision-making.