Beyond-diagonal reconfigurable intelligent surface (BD-RIS) generalizes the conventional diagonal RIS (D-RIS) by introducing tunable inter-element connections, offering enhanced wave manipulation capabilities. However, realizing the advantages of BD-RIS requires accurate channel state information (CSI), whose acquisition becomes significantly more challenging due to the increased number of channel coefficients, leading to prohibitively large pilot training overhead in BD-RIS-aided multi-user multiple-input multiple-output (MU-MIMO) systems. Existing studies reduce pilot overhead by exploiting the channel correlations induced by the Kronecker-product or multi-linear structure of BD-RIS-aided channels, which neglect the spatial correlation among antennas and the statistical correlation across RIS-user channels. In this paper, we propose a learning-based channel estimation framework, namely the joint training scattering matrix learning and channel estimation framework (JTSMLCEF), which jointly optimizes the BD-RIS training scattering matrix and estimates the cascaded channels in an end-to-end manner to achieve accurate channel estimation and reduce the pilot overhead. The proposed JTSMLCEF follows a two-phase channel estimation protocol to enable adaptive training scattering matrix optimization with a training scattering matrix optimizer (TSMO) and cascaded channel estimation with a dual-attention channel estimator (DACE). Specifically, the DACE is designed with intra-user and inter-user attention modules to capture the multi-dimensional correlations in multi-user cascaded channels. Simulation results demonstrate the superiority of JTSMLCEF. Compared with the current state-of-the-art method, it reduces the pilot overhead by $80\%$ while further reducing the normalized mean squared error (NMSE) by $82.6\%$ and $92.5\%$ in indoor and urban micro-cell (UMi) scenarios, respectively.