Abstract:Deep Neural Networks (DNNs) have been proven to be exceptionally effective and have been applied across diverse domains within deep learning. However, as DNN models increase in complexity, the demand for reduced computational costs and memory overheads has become increasingly urgent. Sparsity has emerged as a leading approach in this area. The robustness of sparse Multi-layer Perceptrons (MLPs) for supervised feature selection, along with the application of Sparse Evolutionary Training (SET), illustrates the feasibility of reducing computational costs without compromising accuracy. Moreover, it is believed that the SET algorithm can still be improved through a structural optimization method called motif-based optimization, with potential efficiency gains exceeding 40% and a performance decline of under 4%. This research investigates whether the structural optimization of Sparse Evolutionary Training applied to Multi-layer Perceptrons (SET-MLP) can enhance performance and to what extent this improvement can be achieved.
Abstract:Multi-object tracking (MOT) in computer vision has made significant advancements, yet tracking small fish in underwater environments presents unique challenges due to complex 3D motions and data noise. Traditional single-view MOT models often fall short in these settings. This thesis addresses these challenges by adapting state-of-the-art single-view MOT models, FairMOT and YOLOv8, for underwater fish detecting and tracking in ecological studies. The core contribution of this research is the development of a multi-view framework that utilizes stereo video inputs to enhance tracking accuracy and fish behavior pattern recognition. By integrating and evaluating these models on underwater fish video datasets, the study aims to demonstrate significant improvements in precision and reliability compared to single-view approaches. The proposed framework detects fish entities with a relative accuracy of 47% and employs stereo-matching techniques to produce a novel 3D output, providing a more comprehensive understanding of fish movements and interactions
Abstract:Multiple Object Tracking (MOT) is a computer vision task that has been employed in a variety of sectors. Some common limitations in MOT are varying object appearances, occlusions, or crowded scenes. To address these challenges, machine learning methods have been extensively deployed, leveraging large datasets, sophisticated models, and substantial computational resources. Due to practical limitations, access to the above is not always an option. However, with the recent release of foundation models by prominent AI companies, pretrained models have been trained on vast datasets and resources using state-of-the-art methods. This work tries to leverage one such foundation model, called DINOv2, through using knowledge distillation. The proposed method uses a teacher-student architecture, where DINOv2 is the teacher and the FairMOT backbone HRNetv2 W18 is the student. The results imply that although the proposed method shows improvements in certain scenarios, it does not consistently outperform the original FairMOT model. These findings highlight the potential and limitations of applying foundation models in knowledge
Abstract:Encounters between predator and prey play an essential role in ecosystems, but their rarity makes them difficult to detect in video recordings. Although advances in action recognition (AR) and temporal action detection (AD), especially transformer-based models and vision foundation models, have achieved high performance on human action datasets, animal videos remain relatively under-researched. This thesis addresses this gap by proposing the model MARINE, which utilizes motion-based frame selection designed for fast animal actions and DINOv2 feature extraction with a trainable classification head for action recognition. MARINE outperforms VideoMAE in identifying predator attacks in videos of fish, both on a small and specific coral reef dataset (81.53\% against 52.64\% accuracy), and on a subset of the more extensive Animal Kingdom dataset (94.86\% against 83.14\% accuracy). In a multi-label setting on a representative sample of Animal Kingdom, MARINE achieves 23.79\% mAP, positioning it mid-field among existing benchmarks. Furthermore, in an AD task on the coral reef dataset, MARINE achieves 80.78\% AP (against VideoMAE's 34.89\%) although at a lowered t-IoU threshold of 25\%. Therefore, despite room for improvement, MARINE offers an effective starter framework to apply to AR and AD tasks on animal recordings and thus contribute to the study of natural ecosystems.