Non-diagonal reconfigurable intelligent surfaces (RIS) offer enhanced wireless signal manipulation over conventional RIS by enabling the incident signal on any of its $M$ elements to be reflected from another element via an $M \times M$ switch array. To fully exploit this flexible configuration, the acquisition of individual channel state information (CSI) is essential. However, due to the passive nature of the RIS, cascaded channel estimation is performed, as the RIS itself lacks signal processing capabilities. This entails estimating the CSI for all $M \times M$ switch array permutations, resulting in a total of $M!$ possible configurations, to identify the optimal one that maximizes the channel gain. This process leads to long uplink training intervals, which degrade spectral efficiency and increase uplink energy consumption. In this paper, we propose a low-complexity channel estimation protocol that substantially reduces the need for exhaustive $M!$ permutations by utilizing only three configurations to optimize the non-diagonal RIS switch array and beamforming for single-input single-output (SISO) and multiple-input single-output (MISO) systems. Specifically, our three-stage pilot-based protocol estimates scaled versions of the user-RIS and RIS-base-station (BS) channels in the first two stages using the least square (LS) estimator and the commonly used ON/OFF protocol from conventional RIS. In the third stage, the cascaded user-RIS-BS channels are estimated to enable efficient beamforming optimization. Complexity analysis shows that our proposed protocol significantly reduces the BS computational load from $\mathcal{O}(NM\times M!)$ to $\mathcal{O}(NM)$, where $N$ is the number of BS antennas. This complexity is similar to the conventional ON/OFF-based LS estimation for conventional diagonal RIS.