Abstract:We introduce MOMO, the first multi-sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large-scale, high-quality corpus of $\sim 12$ million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars-Bench. MOMO achieves better overall performance compared to ImageNet pre-trained, earth observation foundation model, sensor-specific pre-training, and fully-supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi-resolution data. The model weights, pretraining code, pretraining data, and evaluation code are available at: https://github.com/kerner-lab/MOMO.
Abstract:In today's world, abundant digital content like e-books, movies, videos and articles are available for consumption. It is daunting to review everything accessible and decide what to watch next. Consequently, digital media providers want to capitalise on this confusion and tackle it to increase user engagement, eventually leading to higher revenues. Content providers often utilise recommendation systems as an efficacious approach for combating such information overload. This paper concentrates on developing a synthetic approach for recommending movies. Traditionally, movie recommendation systems use either collaborative filtering, which utilises user interaction with the media, or content-based filtering, which makes use of the movie's available metadata. Technological advancements have also introduced a hybrid technique that integrates both systems. However, our approach deals solely with content-based recommendations, further enhancing it with a ranking algorithm based on content similarity metrics. The three metrics contributing to the ranking are similarity in metadata, visual content, and user reviews of the movies. We use text vectorization followed by cosine similarity for metadata, feature extraction by a pre-trained VGG19 followed by K-means clustering for visual content, and a comparison of sentiments for user reviews. Such a system allows viewers to know movies that "feel" the same.