Abstract:Big data research in Indonesia is constrained by a fundamental fragmentation: relevant data is scattered across social media, news portals, e-commerce platforms, review sites, and academic databases, each with different formats, access methods, and noise characteristics. Researchers must independently build collection pipelines, clean heterogeneous data, and assemble separate analysis tools, a process that often overshadows the research itself. We present SocialX, a modular platform for multi-source big data research that integrates heterogeneous data collection, language-aware preprocessing, and pluggable analysis into a unified, source-agnostic pipeline. The platform separates concerns into three independent layers (collection, preprocessing, and analysis) connected by a lightweight job-coordination mechanism. This modularity allows each layer to grow independently: new data sources, preprocessing methods, or analysis tools can be added without modifying the existing pipeline. We describe the design principles that enable this extensibility, detail the preprocessing methodology that addresses challenges specific to Indonesian text across registers, and demonstrate the platform's utility through a walkthrough of a typical research workflow. SocialX is publicly accessible as a web-based platform at https://www.socialx.id.
Abstract:Determining whether a piece of text is relevant to a given topic is a fundamental task in natural language processing, yet it remains largely unexplored for Bahasa Indonesia. Unlike sentiment analysis or named entity recognition, relevancy classification requires the model to reason about the relationship between two inputs simultaneously: a topical context and a candidate text. We introduce IndoBERT-Relevancy, a context-conditioned relevancy classifier built on IndoBERT Large (335M parameters) and trained on a novel dataset of 31,360 labeled pairs spanning 188 topics. Through an iterative, failure-driven data construction process, we demonstrate that no single data source is sufficient for robust relevancy classification, and that targeted synthetic data can effectively address specific model weaknesses. Our final model achieves an F1 score of 0.948 and an accuracy of 96.5%, handling both formal and informal Indonesian text. The model is publicly available at HuggingFace.