Brains learn to represent information from a large set of stimuli, typically by weak supervision. Unsupervised learning is therefore a natural approach for exploring the design of biological neural networks and their computations. Accordingly, redundancy reduction has been suggested as a prominent design principle of neural encoding, but its ``mechanistic'' biological implementation is unclear. Analogously, unsupervised training of artificial neural networks yields internal representations that allow for accurate stimulus classification or decoding, but typically rely on biologically-implausible implementations. We suggest that interactions between parallel subnetworks in the brain may underlie such learning: we present a model of representation learning by ensembles of neural networks, where each network learns to encode stimuli into an abstract representation space by cross-supervising interactions with other networks, for inputs they receive simultaneously or in close temporal proximity. Aiming for biological plausibility, each network has a small ``receptive field'', thus receiving a fixed part of the external input, and the networks do not share weights. We find that for different types of network architectures, and for both visual or neuronal stimuli, these cross-supervising networks learn semantic representations that are easily decodable and that decoding accuracy is comparable to supervised networks -- both at the level of single networks and the ensemble. We further show that performance is optimal for small receptive fields, and that sparse connectivity between networks is nearly as accurate as all-to-all interactions, with far fewer computations. We thus suggest a sparsely interacting collective of cross-supervising networks as an algorithmic framework for representational learning and collective computation in the brain.