Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Mit Benchmark

EffGen: Enabling Small Language Models as Capable Autonomous Agents

Jan 31, 2026

Gaurav Srivastava, Aafiya Hussain, Chi Wang, Yingyan Celine Lin, Xuan Wang

Abstract:Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls. While powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce effGen, an open-source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment (pip install effgen). effGen makes four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses contexts by 70-80% while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity-based routing using five factors to make smart pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage. Additionally, effGen unifies multiple agent protocols (MCP, A2A, ACP) for cross-protocol communication. Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. effGen (https://effgen.org/) is released under the MIT License, ensuring broad accessibility for research and commercial use. Our framework code is publicly available at https://github.com/ctrl-gaurav/effGen.

Via

Access Paper or Ask Questions

AgenticSimLaw: A Juvenile Courtroom Multi-Agent Debate Simulation for Explainable High-Stakes Tabular Decision Making

Jan 29, 2026

Jon Chun, Kathrine Elkins, Yong Suk Lee

Abstract:We introduce AgenticSimLaw, a role-structured, multi-agent debate framework that provides transparent and controllable test-time reasoning for high-stakes tabular decision-making tasks. Unlike black-box approaches, our courtroom-style orchestration explicitly defines agent roles (prosecutor, defense, judge), interaction protocols (7-turn structured debate), and private reasoning strategies, creating a fully auditable decision-making process. We benchmark this framework on young adult recidivism prediction using the NLSY97 dataset, comparing it against traditional chain-of-thought (CoT) prompting across almost 90 unique combinations of models and strategies. Our results demonstrate that structured multi-agent debate provides more stable and generalizable performance compared to single-agent reasoning, with stronger correlation between accuracy and F1-score metrics. Beyond performance improvements, AgenticSimLaw offers fine-grained control over reasoning steps, generates complete interaction transcripts for explainability, and enables systematic profiling of agent behaviors. While we instantiate this framework in the criminal justice domain to stress-test reasoning under ethical complexity, the approach generalizes to any deliberative, high-stakes decision task requiring transparency and human oversight. This work addresses key LLM-based multi-agent system challenges: organization through structured roles, observability through logged interactions, and responsibility through explicit non-deployment constraints for sensitive domains. Data, results, and code will be available on github.com under the MIT license.

* 18 pages, 5 figures

Via

Access Paper or Ask Questions

CausationEntropy: Pythonic Optimal Causation Entropy

Jan 19, 2026

Kevin Slote, Jeremie Fish, Erik Bollt

Abstract:Optimal Causation Entropy (oCSE) is a robust causal network modeling technique that reveals causal networks from dynamical systems and coupled oscillators, distinguishing direct from indirect paths. CausationEntropy is a Python package that implements oCSE and several of its significant optimizations and methodological extensions. In this paper, we introduce the version 1.1 release of CausationEntropy, which includes new synthetic data generators, plotting tools, and several advanced information-theoretical causal network discovery algorithms with criteria for estimating Gaussian, k-nearest neighbors (kNN), geometric k-nearest neighbors (geometric-kNN), kernel density (KDE) and Poisson entropic estimators. The package is easy to install from the PyPi software repository, is thoroughly documented, supplemented with extensive code examples, and is modularly structured to support future additions. The entire codebase is released under the MIT license and is available on GitHub and through PyPi Repository. We expect this package to serve as a benchmark tool for causal discovery in complex dynamical systems.

Via

Access Paper or Ask Questions

Inter-patient ECG Arrhythmia Classification with LGNs and LUTNs

Jan 16, 2026

Wout Mommen, Lars Keuninckx, Paul Detterer, Achiel Colpaert, Piet Wambacq

Abstract:Deep Differentiable Logic Gate Networks (LGNs) and Lookup Table Networks (LUTNs) are demonstrated to be suitable for the automatic classification of electrocardiograms (ECGs) using the inter-patient paradigm. The methods are benchmarked using the MIT-BIH arrhythmia data set, achieving up to 94.28% accuracy and a $jκ$ index of 0.683 on a four-class classification problem. Our models use between 2.89k and 6.17k FLOPs, including preprocessing and readout, which is three to six orders of magnitude less compared to SOTA methods. A novel preprocessing method is utilized that attains superior performance compared to existing methods for both the mixed-patient and inter-patient paradigms. In addition, a novel method for training the Lookup Tables (LUTs) in LUTNs is devised that uses the Boolean equation of a multiplexer (MUX). Additionally, rate coding was utilized for the first time in these LGNs and LUTNs, enhancing the performance of LGNs. Furthermore, it is the first time that LGNs and LUTNs have been benchmarked on the MIT-BIH arrhythmia dataset using the inter-patient paradigm. Using an Artix 7 FPGA, between 2000 and 2990 LUTs were needed, and between 5 to 7 mW (i.e. 50 pJ to 70 pJ per inference) was estimated for running these models. The performance in terms of both accuracy and $jκ$-index is significantly higher compared to previous LGN results. These positive results suggest that one can utilize LGNs and LUTNs for the detection of arrhythmias at extremely low power and high speeds in heart implants or wearable devices, even for patients not included in the training set.

Via

Access Paper or Ask Questions

HCFT: Hierarchical Convolutional Fusion Transformer for EEG Decoding

Jan 18, 2026

Haodong Zhang, Jiapeng Zhu, Yitong Chen, Hongqi Li

Abstract:Electroencephalography (EEG) decoding requires models that can effectively extract and integrate complex temporal, spectral, and spatial features from multichannel signals. To address this challenge, we propose a lightweight and generalizable decoding framework named Hierarchical Convolutional Fusion Transformer (HCFT), which combines dual-branch convolutional encoders and hierarchical Transformer blocks for multi-scale EEG representation learning. Specifically, the model first captures local temporal and spatiotemporal dynamics through time-domain and time-space convolutional branches, and then aligns these features via a cross-attention mechanism that enables interaction between branches at each stage. Subsequently, a hierarchical Transformer fusion structure is employed to encode global dependencies across all feature stages, while a customized Dynamic Tanh normalization module is introduced to replace traditional Layer Normalization in order to enhance training stability and reduce redundancy. Extensive experiments are conducted on two representative benchmark datasets, BCI Competition IV-2b and CHB-MIT, covering both event-related cross-subject classification and continuous seizure prediction tasks. Results show that HCFT achieves 80.83% average accuracy and a Cohen's kappa of 0.6165 on BCI IV-2b, as well as 99.10% sensitivity, 0.0236 false positives per hour, and 98.82% specificity on CHB-MIT, consistently outperforming over ten state-of-the-art baseline methods. Ablation studies confirm that each core component of the proposed framework contributes significantly to the overall decoding performance, demonstrating HCFT's effectiveness in capturing EEG dynamics and its potential for real-world BCI applications.

* Submitted to IEEE Journals

Via

Access Paper or Ask Questions

Mi:dm 2.0 Korea-centric Bilingual Language Models

Jan 14, 2026

Donghoon Shin, Sejung Lee, Soonmin Bae, Hwijung Ryu, Changwon Ok, Hoyoun Jung, Hyesung Ji, Jeehyun Lim, Jehoon Lee, Ji-Eun Han(+56 more)

Abstract:We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI. This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage. To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks. Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks. The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use. By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence. Our models are available at https://huggingface.co/K-intelligence. For technical inquiries, please contact midm-llm@kt.com.

Via

Access Paper or Ask Questions

Design and Development of a Low-Cost Scalable GSM-IoT Smart Pet Feeder with a Remote Mobile Application

Jan 13, 2026

Md. Rakibul Hasan Nishat, S. M. Khalid Bin Zahid, Abdul Hasib, T. M. Mehrab Hasan, Mohammad Arman, A. S. M. Ahsanul Sarkar Akib

Abstract:Pet ownership is increasingly common in modern households, yet maintaining a consistent feeding schedule remains challenging for the owners particularly those who live in cities and have busy lifestyles. This paper presents the design, development, and validation of a low-cost, scalable GSM-IoT smart pet feeder that enables remote monitoring and control through cellular communication. The device combines with an Arduino microcontroller, a SIM800L GSM module for communication, an ultrasonic sensor for real-time food-level assessment, and a servo mechanism for accurate portion dispensing. A dedicated mobile application was developed using MIT App Inventor which allows owners to send feeding commands and receive real-time status updates. Experimental results demonstrate a 98\% SMS command success rate, consistent portion dispensing with $\pm 2.67$\% variance, and reliable autonomous operation. Its modular, energy-efficient design makes it easy to use in a wide range of households, including those with limited resources. This work pushes forward the field of accessible pet care technology by providing a practical, scalable, and completely internet-independent solution for personalized pet feeding. In doing so, it sets a new benchmark for low-cost, GSM-powered automation in smart pet products.

Via

Access Paper or Ask Questions

Self-Attention with State-Object Weighted Combination for Compositional Zero Shot Learning

Dec 22, 2025

Cheng-Hong Chang, Pei-Hsuan Tsai

Abstract:Object recognition has become prevalent across various industries. However, most existing applications are limited to identifying objects alone, without considering their associated states. The ability to recognize both the state and object simultaneously remains less common. One approach to address this is by treating state and object as a single category during training. However, this approach poses challenges in data collection and training since it requires comprehensive data for all possible combinations. Compositional Zero-shot Learning (CZSL) emerges as a viable solution by treating the state and object as distinct categories during training. CZSL facilitates the identification of novel compositions even in the absence of data for every conceivable combination. The current state-of-the-art method, KG-SP, addresses this issue by training distinct classifiers for states and objects, while leveraging a semantic model to evaluate the plausibility of composed compositions. However, KG-SP's accuracy in state and object recognition can be further improved, and it fails to consider the weighting of states and objects during composition. In this study, we propose SASOW, an enhancement of KG-SP that considers the weighting of states and objects while improving composition recognition accuracy. First, we introduce self-attention mechanisms into the classifiers for states and objects, leading to enhanced accuracy in recognizing both. Additionally, we incorporate the weighting of states and objects during composition to generate more reasonable and accurate compositions. Our validation process involves testing SASOW on three established benchmark datasets. Experimental outcomes affirm when compared against OW-CZSL approach, KG-SP, SASOW showcases improvements of 2.1%, 1.7%, and 0.4% in terms of accuracy for unseen compositions across the MIT-States, UT Zappos, and C-GQA datasets, respectively.

Via

Access Paper or Ask Questions

SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

Dec 12, 2025

Sergio Burdisso, Séverin Baroudi, Yanis Labrak, David Grunert, Pawel Cyrta, Yiyang Chen, Srikanth Madikeri, Esaú Villatoro-Tello, Thomas Schaaf, Ricard Marxer(+1 more)

Abstract:We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built around a standardized \texttt{Dialog} representation, SDialog provides: (1) persona-driven multi-agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed-backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.

* We have an old version of the paper at arXiv:2506.10622, we will update this old version instead (even though the authors and the title have changed since this first old version)

Via

Access Paper or Ask Questions

Interpretable temporal fusion network of multi- and multi-class arrhythmia classification

Nov 19, 2025

Yun Kwan Kim

Abstract:Clinical decision support systems (CDSSs) have been widely utilized to support the decisions made by cardiologists when detecting and classifying arrhythmia from electrocardiograms. However, forming a CDSS for the arrhythmia classification task is challenging due to the varying lengths of arrhythmias. Although the onset time of arrhythmia varies, previously developed methods have not considered such conditions. Thus, we propose a framework that consists of (i) local and global extraction and (ii) local-global information fusion with attention to enable arrhythmia detection and classification within a constrained input length. The framework's performance was evaluated in terms of 10-class and 4-class arrhythmia detection, focusing on identifying the onset and ending point of arrhythmia episodes and their duration using the MIT-BIH arrhythmia database (MITDB) and the MIT-BIH atrial fibrillation database (AFDB). Duration, episode, and Dice score performances resulted in overall F1-scores of 96.45%, 82.05%, and 96.31% on the MITDB and 97.57%, 98.31%, and 97.45% on the AFDB, respectively. The results demonstrated statistically superior performance compared to those of the benchmark models. To assess the generalization capability of the proposed method, an MITDB-trained model and MIT-BIH malignant ventricular arrhythmia database-trained model were tested AFDB and MITDB, respectively. Superior performance was attained compared with that of a state-of-the-art model. The proposed method effectively captures both local and global information and dynamics without significant information loss. Consequently, arrhythmias can be detected with greater accuracy, and their occurrence times can be precisely determined, enabling the clinical field to develop more accurate treatment plans based on the proposed method.

* [Doctoral dissertation, Korea University, 2025]

Via

Access Paper or Ask Questions

Topic:Mit Benchmark

Papers and Code