Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tianyi Yang

PosterHarness: Turning Scientific Poster Generation into an Auditable Instruction-Following Benchmark

Jul 03, 2026

Tianyi Yang, Dawei Fu, Youpeng Wu, Zixun Kou, Linrui Chen, Ruobing Jiang, Zijian Wang, Qiang Li

Abstract:Text-rich image models can now design poster-scale layouts, but we lack ways to measure whether they honor scientific communication contracts: legible labels, prescribed aspect ratios, and -- above all -- abstaining from fabricated scientific figures. We present POSTERHARNESS, an auditable harness reframing poster generation as measurable instruction-following tasks, with a pilot benchmark and failure taxonomy. POSTERHARNESS uses a placeholder-first contract to separate two jobs models otherwise conflate. The model performs visual-summary design: typography, reading path, color, and background -- but never draws data-bearing figures. Every figure region must be an empty labeled placeholder; a deterministic compositor inserts real source-paper figures at detected coordinates. This makes properties measurable: placeholder count and ID accuracy, blankness, aspect-ratio compliance, abstention from synthesized graphics, public-text hygiene, and source-figure provenance -- with failures logged as explicit rejections, not hidden in plausible-looking output. We instantiate the harness on 12 papers (6 HEP, 6 AI/ML-adjacent) and report three findings. (i) A counterfactual probe shows the placeholder contract drives VLM-counted synthesized figures from 34 to 0 across three papers. (ii) A failure taxonomy identifies blocking contracts: placeholder geometry, placeholder QA, template critic, and public text. (iii) Comparison with Paper2Poster shows a trade-off: PosterHarness yields higher-resolution artifacts, lower white-canvas fraction, and stronger VLM visual preference; the deterministic baseline retains slightly more PosterQuiz-style information and runs faster. We report this as regime characterization, not a superiority claim. All artifacts, prompts, manifests, and audit scripts are released as a reusable evaluation component.

Via

Access Paper or Ask Questions

NEUROSYMLAND: Neuro-Symbolic Landing-Site Assessment for Robust and Edge-Deployable UAV Autonomy

Jul 02, 2026

Weixian Qian, Tianyi Yang, Sebastian Schroder, Yao Deng, Jiaohong Yao, Xiao Cheng, Richard Han, Xi Zheng

Abstract:Safe landing-site assessment in unstructured environments remains a key challenge for autonomous UAV deployment, as vision-only learning approaches often degrade under terrain variability and provide limited transparency in safety decisions. We present NEUROSYMLAND, a neuro-symbolic landing-site assessment system that integrates lightweight perception with explicit safety reasoning. The framework constructs a probabilistic semantic scene graph from onboard visual input and evaluates candidate landing regions using symbolic constraints capturing terrain flatness, obstacle clearance, and spatial consistency, enabling structured reasoning under perceptual uncertainty while maintaining edge-feasible execution. Across 72 simulated landing scenarios spanning diverse terrains, NEUROSYMLAND achieves 61 successful assessments, outperforming four competitive baselines (37-57 successes). To evaluate deployability, we further conduct 100 hardware-in-the-loop trials with randomized initial poses, profiling end-to-end latency, stage-wise execution time, and system-level metrics including CPU/GPU utilization, memory footprint, and power consumption. Results demonstrate improved robustness and interpretability with bounded edge-resource usage. Profiling shows that symbolic reasoning contributes only a small fraction of end-to-end latency, while the main computational cost arises from perception and PSSG construction. These results demonstrate the feasibility of deploying the landing-site assessment stack on edge-constrained UAV hardware, and all source code, datasets, prompts, and symbolic rule refinement examples are released in an open-source repository

* roceedings of the ACM on Software Engineering, Vol. 3, FSE, Article FSE146, July 2026. Article FSE146 (July 2026)
* Accepted to the IROS 2026

Via

Access Paper or Ask Questions

Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

Jun 09, 2026

Ruobing Jiang, Dawei Fu, Cheng Jiang, Tianyi Yang, Zijian Wang, Youpeng Wu, Yong Ban, Yajun Mao, Qiang Li

Abstract:Muon collider research spans accelerator physics, detector instrumentation, and high-energy phenomenology, with relevant evidence scattered across a rapidly expanding and heterogeneous body of scientific literature. As high-energy physics (HEP) increasingly explores agent-assisted analysis workflows, efficiently locating, integrating, and verifying scientific evidence becomes an essential capability. While retrieval-augmented generation (RAG) offers a promising framework for scientific question answering, integrating agentic reasoning without compromising retrieval precision remains a key challenge. In this work, we present agentic hybrid RAG, an evidence-grounded RAG framework for muon collider research. The framework combines a hybrid retriever, integrating sparse lexical and dense semantic retrieval, with an agentic reasoning module for query decomposition, evidence expansion, and grounded answer generation. To enable systematic evaluation, we construct the first benchmark for retrieval-augmented scientific question answering in the muon collider domain, comprising a curated literature corpus together with dedicated retrieval and answer-generation benchmarks covering major detector and physics research topics. Extensive evaluation shows that hybrid retrieval provides the strongest retrieval backbone, while agentic reasoning is most effective for controlled evidence expansion and answer synthesis. Built on this principle, agentic hybrid RAG consistently outperforms representative retrieval and RAG baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding. Together, the benchmark and framework provide a foundation for evidence-grounded scientific question answering and future HEP analysis agents operating over large-scale scientific literature.

* 22 pages, 5 figures, and 6 tables

Via

Access Paper or Ask Questions

Beyond Isolation: A Unified Benchmark for General-Purpose Navigation

May 10, 2026

Samson Sun, Tianyi Yang, Tengyue Wang, Yikai Xue, Zhengjie Xu, Lingming Zhang, Qichen Zhang, Chao Liang, Zhipeng Zhang

Abstract:The pursuit of general-purpose embodied agents is hindered by fragmented evaluation protocols that isolate navigation skills and fixate on specific robot morphologies, failing to reflect real-world scenarios where agents must orchestrate diverse behaviors across varying embodiments. To bridge this gap, we introduce OmniNavBench, a benchmark for cross-skill coordination and cross-embodiment generalization. OmniNavBench introduces three paradigm shifts: (1) Compositional Complexity. We propose composite instructions that interleave sub-tasks from 6 categories (PointNav, VLN, ObjectNav, SocialNav, Human Following and EQA), compelling agents to transition between exploration, interaction, and social compliance within a single episode. (2) Morphological Universality and Sensor Flexibility. We present a simulation platform that breaks the reliance on single-morphology evaluation, enabling generalization tests across humanoid, quadrupedal, and wheeled robots, with a modular sensor interface and 170 environments blending synthetic assets with real-world scans. (3) Demonstrations Quality. Moving beyond shortest-path algorithms, we curate 1779 expert trajectories via human teleoperation, capturing behavioral nuances such as exploratory glance and anticipatory avoidance. Extensive evaluations demonstrate that current methods, despite their claimed unified design, struggle with the complex, interleaved nature of general-purpose navigation. This exposes a critical disparity between existing capabilities and real-world deployment demands, underscoring OmniNavBench as a testbed for the next generation of generalist navigators. Dataset, code, and leaderboard are available at http://omninavbench.cloud-ip.cc.

* Accepted at RSS 2026

Via

Access Paper or Ask Questions

Augmenting Question Answering with A Hybrid RAG Approach

Jan 19, 2026

Tianyi Yang, Nashrah Haque, Vaishnave Jonnalagadda, Yuya Jeremy Ong, Zhehui Chen, Yanzhao Wu, Lei Yu, Divyesh Jadav, Wenqi Wei

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful technique for enhancing the quality of responses in Question-Answering (QA) tasks. However, existing approaches often struggle with retrieving contextually relevant information, leading to incomplete or suboptimal answers. In this paper, we introduce Structured-Semantic RAG (SSRAG), a hybrid architecture that enhances QA quality by integrating query augmentation, agentic routing, and a structured retrieval mechanism combining vector and graph based techniques with context unification. By refining retrieval processes and improving contextual grounding, our approach improves both answer accuracy and informativeness. We conduct extensive evaluations on three popular QA datasets, TruthfulQA, SQuAD and WikiQA, across five Large Language Models (LLMs), demonstrating that our proposed approach consistently improves response quality over standard RAG implementations.

* 10 pages, 5 tables, 2 figures; presented at IEEE CogMI 2025

Via

Access Paper or Ask Questions

Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Jun 10, 2024

Junlin Wang, Tianyi Yang, Roy Xie, Bhuwan Dhingra

Figure 1 for Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Figure 2 for Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Figure 3 for Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Figure 4 for Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications

Abstract:With the proliferation of LLM-integrated applications such as GPT-s, millions are deployed, offering valuable services through proprietary instruction prompts. These systems, however, are prone to prompt extraction attacks through meticulously designed queries. To help mitigate this problem, we introduce the Raccoon benchmark which comprehensively evaluates a model's susceptibility to prompt extraction attacks. Our novel evaluation method assesses models under both defenseless and defended scenarios, employing a dual approach to evaluate the effectiveness of existing defenses and the resilience of the models. The benchmark encompasses 14 categories of prompt extraction attacks, with additional compounded attacks that closely mimic the strategies of potential attackers, alongside a diverse collection of defense templates. This array is, to our knowledge, the most extensive compilation of prompt theft attacks and defense mechanisms to date. Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected. This paper aims to establish a more systematic benchmark for assessing LLM robustness against prompt extraction attacks, offering insights into their causes and potential countermeasures. Resources of Raccoon are publicly available at https://github.com/M0gician/RaccoonBench.

Via

Access Paper or Ask Questions

Large-scale Outdoor Cell-free mMIMO Channel Measurement in an Urban Scenario at 3.5 GHz

May 31, 2024

Yuning Zhang, Thomas Choi, Zihang Cheng, Issei Kanno, Masaaki Ito, Jorge Gomez-Ponce, Hussein Hammoud, Bowei Wu, Ashwani Pradhan, Kelvin Arana(+7 more)

Figure 1 for Large-scale Outdoor Cell-free mMIMO Channel Measurement in an Urban Scenario at 3.5 GHz

Figure 2 for Large-scale Outdoor Cell-free mMIMO Channel Measurement in an Urban Scenario at 3.5 GHz

Figure 3 for Large-scale Outdoor Cell-free mMIMO Channel Measurement in an Urban Scenario at 3.5 GHz

Figure 4 for Large-scale Outdoor Cell-free mMIMO Channel Measurement in an Urban Scenario at 3.5 GHz

Abstract:The design of cell-free massive MIMO (CF-mMIMO) systems requires accurate, measurement-based channel models. This paper provides the first results from the by far most extensive outdoor measurement campaign for CF-mMIMO channels in an urban environment. We measured impulse responses between over 20,000 potential access point (AP) locations and 80 user equipments (UEs) at 3.5 GHz with 350 MHz bandwidth (BW). Measurements use a "virtual array" approach at the AP and a hybrid switched/virtual approach at the UE. This paper describes the sounder design, measurement environment, data processing, and sample results, particularly the evolution of the power-delay profiles (PDPs) as a function of the AP locations, and its relation to the propagation environment.

* 6 pages, 6 figures, conference: VTC 2024-Fall

Via

Access Paper or Ask Questions

ICE-SEARCH: A Language Model-Driven Feature Selection Approach

Mar 09, 2024

Tianze Yang, Tianyi Yang, Shaoshan Liu, Fuyuan Lvu, Xue Liu

Figure 1 for ICE-SEARCH: A Language Model-Driven Feature Selection Approach

Figure 2 for ICE-SEARCH: A Language Model-Driven Feature Selection Approach

Figure 3 for ICE-SEARCH: A Language Model-Driven Feature Selection Approach

Figure 4 for ICE-SEARCH: A Language Model-Driven Feature Selection Approach

Abstract:This study unveils the In-Context Evolutionary Search (ICE-SEARCH) method, the first work that melds language models (LMs) with evolutionary algorithms for feature selection (FS) tasks and demonstrates its effectiveness in Medical Predictive Analytics (MPA) applications. ICE-SEARCH harnesses the crossover and mutation capabilities inherent in LMs within an evolutionary framework, significantly improving FS through the model's comprehensive world knowledge and its adaptability to a variety of roles. Our evaluation of this methodology spans three crucial MPA tasks: stroke, cardiovascular disease, and diabetes, where ICE-SEARCH outperforms traditional FS methods in pinpointing essential features for medical applications. ICE-SEARCH achieves State-of-the-Art (SOTA) performance in stroke prediction and diabetes prediction; the Decision-Randomized ICE-SEARCH ranks as SOTA in cardiovascular disease prediction. Our results not only demonstrate the efficacy of ICE-SEARCH in medical FS but also underscore the versatility, efficiency, and scalability of integrating LMs in FS tasks. The study emphasizes the critical role of incorporating domain-specific insights, illustrating ICE-SEARCH's robustness, generalizability, and swift convergence. This opens avenues for further research into comprehensive and intricate FS landscapes, marking a significant stride in the application of artificial intelligence in medical predictive analytics.

Via

Access Paper or Ask Questions

Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

Aug 19, 2023

Jinyang Liu, Tianyi Yang, Zhuangbin Chen, Yuxin Su, Cong Feng, Zengyin Yang, Michael R. Lyu

Figure 1 for Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

Figure 2 for Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

Figure 3 for Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

Figure 4 for Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

Abstract:As modern software systems continue to grow in terms of complexity and volume, anomaly detection on multivariate monitoring metrics, which profile systems' health status, becomes more and more critical and challenging. In particular, the dependency between different metrics and their historical patterns plays a critical role in pursuing prompt and accurate anomaly detection. Existing approaches fall short of industrial needs for being unable to capture such information efficiently. To fill this significant gap, in this paper, we propose CMAnomaly, an anomaly detection framework on multivariate monitoring metrics based on collaborative machine. The proposed collaborative machine is a mechanism to capture the pairwise interactions along with feature and temporal dimensions with linear time complexity. Cost-effective models can then be employed to leverage both the dependency between monitoring metrics and their historical patterns for anomaly detection. The proposed framework is extensively evaluated with both public data and industrial data collected from a large-scale online service system of Huawei Cloud. The experimental results demonstrate that compared with state-of-the-art baseline models, CMAnomaly achieves an average F1 score of 0.9494, outperforming baselines by 6.77% to 10.68%, and runs 10X to 20X faster. Furthermore, we also share our experience of deploying CMAnomaly in Huawei Cloud.

* This paper has been accepted by the 34th IEEE International Symposium on Software Reliability Engineering (ISSRE'2023)

Via

Access Paper or Ask Questions

Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention

Feb 14, 2023

Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, Yongqiang Yang, Michael R. Lyu

Figure 1 for Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention

Figure 2 for Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention

Figure 3 for Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention

Figure 4 for Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention

Abstract:Prompt and accurate detection of system anomalies is essential to ensure the reliability of software systems. Unlike manual efforts that exploit all available run-time information, existing approaches usually leverage only a single type of monitoring data (often logs or metrics) or fail to make effective use of the joint information among different types of data. Consequently, many false predictions occur. To better understand the manifestations of system anomalies, we conduct a systematical study on a large amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates that logs and metrics can manifest system anomalies collaboratively and complementarily, and neither of them only is sufficient. Thus, integrating heterogeneous data can help recover the complete picture of a system's health status. In this context, we propose Hades, the first end-to-end semi-supervised approach to effectively identify system anomalies based on heterogeneous data. Our approach employs a hierarchical architecture to learn a global representation of the system status by fusing log semantics and metric patterns. It captures discriminative features and meaningful interactions from heterogeneous data via a cross-modal attention module, trained in a semi-supervised manner. We evaluate Hades extensively on large-scale simulated data and datasets from Huawei Cloud. The experimental results present the effectiveness of our model in detecting system anomalies. We also release the code and the annotated dataset for replication and future research.

* In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). arXiv admin note: substantial text overlap with arXiv:2207.02918

Via

Access Paper or Ask Questions