Object detection is a computer vision task in which the goal is to detect and locate objects of interest in an image or video. The task involves identifying the position and boundaries of objects in an image, and classifying the objects into different categories. It forms a crucial part of vision recognition, alongside image classification and retrieval.
The purpose of this project is to assess how well defenders can detect DNS-over-HTTPS (DoH) file exfiltration, and which evasion strategies can be used by attackers. While providing a reproducible toolkit to generate, intercept and analyze DoH exfiltration, and comparing Machine Learning vs threshold-based detection under adversarial scenarios. The originality of this project is the introduction of an end-to-end, containerized pipeline that generates configurable file exfiltration over DoH using several parameters (e.g., chunking, encoding, padding, resolver rotation). It allows for file reconstruction at the resolver side, while extracting flow-level features using a fork of DoHLyzer. The pipeline contains a prediction side, which allows the training of machine learning models based on public labelled datasets and then evaluates them side-by-side with threshold-based detection methods against malicious and evasive DNS-Over-HTTPS traffic. We train Random Forest, Gradient Boosting and Logistic Regression classifiers on a public DoH dataset and benchmark them against evasive DoH exfiltration scenarios. The toolkit orchestrates traffic generation, file capture, feature extraction, model training and analysis. The toolkit is then encapsulated into several Docker containers for easy setup and full reproducibility regardless of the platform it is run on. Future research regarding this project is directed at validating the results on mixed enterprise traffic, extending the protocol coverage to HTTP/3/QUIC request, adding a benign traffic generation, and working on real-time traffic evaluation. A key objective is to quantify when stealth constraints make DoH exfiltration uneconomical and unworthy for the attacker.




The source detection problem arises when an epidemic process unfolds over a contact network, and the objective is to identify its point of origin, i.e., the source node. Research on this problem began with the seminal work of Shah and Zaman in 2010, who formally defined it and introduced the notion of rumor centrality. With the emergence of Graph Neural Networks (GNNs), several studies have proposed GNN-based approaches to source detection. However, some of these works lack methodological clarity and/or are hard to reproduce. As a result, it remains unclear (to us, at least) whether GNNs truly outperform more traditional source detection methods across comparable settings. In this paper, we first review existing GNN-based methods for source detection, clearly outlining the specific settings each addresses and the models they employ. Building on this research, we propose a principled GNN architecture tailored to the source detection task. We also systematically investigate key questions surrounding this problem. Most importantly, we aim to provide a definitive assessment of how GNNs perform relative to other source detection methods. Our experiments show that GNNs substantially outperform all other methods we test across a variety of network types. Although we initially set out to challenge the notion of GNNs as a solution to source detection, our results instead demonstrate their remarkable effectiveness for this task. We discuss possible reasons for this strong performance. To ensure full reproducibility, we release all code and data on GitHub. Finally, we argue that epidemic source detection should serve as a benchmark task for evaluating GNN architectures.
While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.




In this paper we propose MECAD, a novel approach for continual anomaly detection using a multi-expert architecture. Our system dynamically assigns experts to object classes based on feature similarity and employs efficient memory management to preserve the knowledge of previously seen classes. By leveraging an optimized coreset selection and a specialized replay buffer mechanism, we enable incremental learning without requiring full model retraining. Our experimental evaluation on the MVTec AD dataset demonstrates that the optimal 5-expert configuration achieves an average AUROC of 0.8259 across 15 diverse object categories while significantly reducing knowledge degradation compared to single-expert approaches. This framework balances computational efficiency, specialized knowledge retention, and adaptability, making it well-suited for industrial environments with evolving product types.
Multimodal large language models (MLLMs) are proficient in perception and instruction-following, but they still struggle with spatial reasoning: the ability to mentally track and manipulate objects across multiple views and over time. Spatial reasoning is a key component of human intelligence, but most existing benchmarks focus on static images or final outputs, failing to account for the sequential and viewpoint-dependent nature of this skill. To close this gap, we introduce GamiBench, a benchmark designed to evaluate spatial reasoning and 2D-to-3D planning in MLLMs through origami-inspired folding tasks. GamiBench includes 186 regular and 186 impossible 2D crease patterns paired with their corresponding 3D folded shapes, produced from six distinct viewpoints across three visual question-answering (VQA) tasks: predicting 3D fold configurations, distinguishing valid viewpoints, and detecting impossible patterns. Unlike previous benchmarks that assess only final predictions, GamiBench holistically evaluates the entire reasoning process--measuring cross-view consistency, physical feasibility through impossible-fold detection, and interpretation of intermediate folding steps. It further introduces new diagnostic metrics--viewpoint consistency (VC) and impossible fold selection rate (IFSR)--to measure how well models handle folds of varying complexity. Our experiments show that even leading models such as GPT-5 and Gemini-2.5-Pro struggle on single-step spatial understanding. These contributions establish a standardized framework for evaluating geometric understanding and spatial reasoning in MLLMs. Dataset and code: https://github.com/stvngo/GamiBench.




This paper studies the use of Conflict-Driven Clause Learning (CDCL) with VSIDS heuristics as a computational engine for discrete facility layout problems. The facility layout problem is modeled as a combinatorial assignment problem with dense logical structure arising from adjacency, separation, and slot-availability constraints. We develop a CNF-based formulation for layout feasibility and compare CDCL-based SAT solving against CP-SAT and MILP formulations under a unified benchmarking framework. Empirical results show that CDCL exhibits near-constant runtime behavior for feasibility detection across increasing problem sizes and constraint densities, while CP-SAT and MILP display polynomial and exponential scaling respectively. To address the limitation of CDCL in objective optimization, we introduce two hybrid architectures that combine CDCL-based feasibility search with CP-SAT optimization. The first architecture rapidly enumerates feasible layouts to trade optimality for speed, while the second uses CDCL to generate warm-start solutions that accelerate exact optimization. The results demonstrate that hybrid approaches can significantly reduce time-to-solution while preserving correctness guarantees, clarifying the algorithmic trade-offs between clause-learning search and exact optimization methods in large-scale discrete layout problems.
Context: Exhaustive fuzzing of modern JavaScript engines is infeasible due to the vast number of program states and execution paths. Coverage-guided fuzzers waste effort on low-risk inputs, often ignoring vulnerability-triggering ones that do not increase coverage. Existing heuristics proposed to mitigate this require expert effort, are brittle, and hard to adapt. Objective: We propose a data-centric, LLM-boosted alternative that learns from historical vulnerabilities to automatically identify minimal static (code) and dynamic (runtime) features for detecting high-risk inputs. Method: Guided by historical V8 bugs, iterative prompting generated 115 static and 49 dynamic features, with the latter requiring only five trace flags, minimizing instrumentation cost. After feature selection, 41 features remained to train an XGBoost model to predict high-risk inputs during fuzzing. Results: Combining static and dynamic features yields over 85% precision and under 1% false alarms. Only 25% of these features are needed for comparable performance, showing that most of the search space is irrelevant. Conclusion: This work introduces feature-guided fuzzing, an automated data-driven approach that replaces coverage with data-directed inference, guiding fuzzers toward high-risk states for faster, targeted, and reproducible vulnerability discovery. To support open science, all scripts and data are available at https://github.com/KKGanguly/DataCentricFuzzJS .




Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.
In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.
Objective: We develop a channel-adaptive (CA) architecture that seamlessly processes multi-variate time-series with an arbitrary number of channels, and in particular intracranial electroencephalography (iEEG) recordings. Methods: Our CA architecture first processes the iEEG signal using state-of-the-art models applied to each single channel independently. The resulting features are then fused using a vector-symbolic algorithm which reconstructs the spatial relationship using a trainable scalar per channel. Finally, the fused features are accumulated in a long-term memory of up to 2 minutes to perform the classification. Each CA-model can then be pre-trained on a large corpus of iEEG recordings from multiple heterogeneous subjects. The pre-trained model is personalized to each subject via a quick fine-tuning routine, which uses equal or lower amounts of data compared to existing state-of-the-art models, but requiring only 1/5 of the time. Results: We evaluate our CA-models on a seizure detection task both on a short-term (~20 hours) and a long-term (~2500 hours) dataset. In particular, our CA-EEGWaveNet is trained on a single seizure of the tested subject, while the baseline EEGWaveNet is trained on all but one. Even in this challenging scenario, our CA-EEGWaveNet surpasses the baseline in median F1-score (0.78 vs 0.76). Similarly, CA-EEGNet based on EEGNet, also surpasses its baseline in median F1-score (0.79 vs 0.74). Conclusion and significance: Our CA-model addresses two issues: first, it is channel-adaptive and can therefore be trained across heterogeneous subjects without loss of performance; second, it increases the effective temporal context size to a clinically-relevant length. Therefore, our model is a drop-in replacement for existing models, bringing better characteristics and performance across the board.