



Abstract:We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.
Abstract:Federated medical relation extraction enables multiple clients to train a deep network collaboratively without sharing their raw medical data. In order to handle the heterogeneous label distribution across clients, most of the existing works only involve enforcing regularization between local and global models during optimization. In this paper, we fully utilize the models of all clients and propose a novel concept of \textit{major classifier vectors}, where a group of class vectors is obtained in an ensemble rather than the weighted average method on the server. The major classifier vectors are then distributed to all clients and the local training of each client is Contrasted with Major Classifier vectors (FedCMC), so the local model is not prone to overfitting to the local label distribution. FedCMC requires only a small amount of additional transfer of classifier parameters without any leakage of raw data, extracted representations, and label distributions. Our extensive experiments show that FedCMC outperforms the other state-of-the-art FL algorithms on three medical relation extraction datasets.