Alert button
Picture for Yu Huang

Yu Huang

Alert button

Developing A Fair Individualized Polysocial Risk Score (iPsRS) for Identifying Increased Social Risk of Hospitalizations in Patients with Type 2 Diabetes (T2D)

Sep 05, 2023
Yu Huang, Jingchuan Guo, William T Donahoo, Zhengkang Fan, Ying Lu, Wei-Han Chen, Huilin Tang, Lori Bilello, Elizabeth A Shenkman, Jiang Bian

Background: Racial and ethnic minority groups and individuals facing social disadvantages, which often stem from their social determinants of health (SDoH), bear a disproportionate burden of type 2 diabetes (T2D) and its complications. It is therefore crucial to implement effective social risk management strategies at the point of care. Objective: To develop an EHR-based machine learning (ML) analytical pipeline to identify the unmet social needs associated with hospitalization risk in patients with T2D. Methods: We identified 10,192 T2D patients from the EHR data (from 2012 to 2022) from the University of Florida Health Integrated Data Repository, including contextual SDoH (e.g., neighborhood deprivation) and individual-level SDoH (e.g., housing stability). We developed an electronic health records (EHR)-based machine learning (ML) analytic pipeline, namely individualized polysocial risk score (iPsRS), to identify high social risk associated with hospitalizations in T2D patients, along with explainable AI (XAI) techniques and fairness assessment and optimization. Results: Our iPsRS achieved a C statistic of 0.72 in predicting 1-year hospitalization after fairness optimization across racial-ethnic groups. The iPsRS showed excellent utility for capturing individuals at high hospitalization risk; the actual 1-year hospitalization rate in the top 5% of iPsRS was ~13 times as high as the bottom decile. Conclusion: Our ML pipeline iPsRS can fairly and accurately screen for patients who have increased social risk leading to hospitalization in T2D patients.

Viaarxiv icon

An Overview about Emerging Technologies of Autonomous Driving

Jun 28, 2023
Yu Huang, Yue Chen, Zijiang Yang

Figure 1 for An Overview about Emerging Technologies of Autonomous Driving
Figure 2 for An Overview about Emerging Technologies of Autonomous Driving
Figure 3 for An Overview about Emerging Technologies of Autonomous Driving
Figure 4 for An Overview about Emerging Technologies of Autonomous Driving

Since DARPA started Grand Challenges in 2004 and Urban Challenges in 2007, autonomous driving has been the most active field of AI applications. This paper gives an overview about technical aspects of autonomous driving technologies and open problems. We investigate the major fields of self-driving systems, such as perception, mapping and localization, prediction, planning and control, simulation, V2X and safety etc. Especially we elaborate on all these issues in a framework of data closed loop, a popular platform to solve the long tailed autonomous driving problems.

Viaarxiv icon

ClimSim: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators

Jun 16, 2023
Sungduk Yu, Walter M. Hannah, Liran Peng, Mohamed Aziz Bhouri, Ritwik Gupta, Jerry Lin, Björn Lütjens, Justus C. Will, Tom Beucler, Bryce E. Harrop, Benjamin R. Hillman, Andrea M. Jenney, Savannah L. Ferretti, Nana Liu, Anima Anandkumar, Noah D. Brenowitz, Veronika Eyring, Pierre Gentine, Stephan Mandt, Jaideep Pathak, Carl Vondrick, Rose Yu, Laure Zanna, Ryan P. Abernathey, Fiaz Ahmed, David C. Bader, Pierre Baldi, Elizabeth A. Barnes, Gunnar Behrens, Christopher S. Bretherton, Julius J. M. Busecke, Peter M. Caldwell, Wayne Chuang, Yilun Han, Yu Huang, Fernando Iglesias-Suarez, Sanket Jantre, Karthik Kashinath, Marat Khairoutdinov, Thorsten Kurth, Nicholas J. Lutsko, Po-Lun Ma, Griffin Mooers, J. David Neelin, David A. Randall, Sara Shamekh, Akshay Subramaniam, Mark A. Taylor, Nathan M. Urban, Janni Yuval, Guang J. Zhang, Tian Zheng, Michael S. Pritchard

Figure 1 for ClimSim: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators
Figure 2 for ClimSim: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators
Figure 3 for ClimSim: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators
Figure 4 for ClimSim: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate simulators

Modern climate projections lack adequate spatial and temporal resolution due to computational constraints. A consequence is inaccurate and imprecise prediction of critical processes such as storms. Hybrid methods that combine physics with machine learning (ML) have introduced a new generation of higher fidelity climate simulators that can sidestep Moore's Law by outsourcing compute-hungry, short, high-resolution simulations to ML emulators. However, this hybrid ML-physics simulation approach requires domain-specific treatment and has been inaccessible to ML experts because of lack of training data and relevant, easy-to-use workflows. We present ClimSim, the largest-ever dataset designed for hybrid ML-physics research. It comprises multi-scale climate simulations, developed by a consortium of climate scientists and ML researchers. It consists of 5.7 billion pairs of multivariate input and output vectors that isolate the influence of locally-nested, high-resolution, high-fidelity physics on a host climate simulator's macro-scale physical state. The dataset is global in coverage, spans multiple years at high sampling frequency, and is designed such that resulting emulators are compatible with downstream coupling into operational climate simulators. We implement a range of deterministic and stochastic regression baselines to highlight the ML challenges and their scoring. The data (https://huggingface.co/datasets/LEAP/ClimSim_high-res) and code (https://leap-stc.github.io/ClimSim) are released openly to support the development of hybrid ML-physics and high-fidelity climate simulations for the benefit of science and society.

Viaarxiv icon

CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset

May 25, 2023
Hanchong Zhang, Jieyu Li, Lu Chen, Ruisheng Cao, Yunyan Zhang, Yu Huang, Yefeng Zheng, Kai Yu

Figure 1 for CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
Figure 2 for CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
Figure 3 for CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
Figure 4 for CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset

The cross-domain text-to-SQL task aims to build a system that can parse user questions into SQL on complete unseen databases, and the single-domain text-to-SQL task evaluates the performance on identical databases. Both of these setups confront unavoidable difficulties in real-world applications. To this end, we introduce the cross-schema text-to-SQL task, where the databases of evaluation data are different from that in the training data but come from the same domain. Furthermore, we present CSS, a large-scale CrosS-Schema Chinese text-to-SQL dataset, to carry on corresponding studies. CSS originally consisted of 4,340 question/SQL pairs across 2 databases. In order to generalize models to different medical systems, we extend CSS and create 19 new databases along with 29,280 corresponding dataset examples. Moreover, CSS is also a large corpus for single-domain Chinese text-to-SQL studies. We present the data collection approach and a series of analyses of the data statistics. To show the potential and usefulness of CSS, benchmarking baselines have been conducted and reported. Our dataset is publicly available at \url{https://huggingface.co/datasets/zhanghanchong/css}.

Viaarxiv icon

Conflict-driven Structural Learning Towards Higher Coverage Rate in ATPG

Mar 04, 2023
Hui-Ling Zhen, Naixing Wang, Junhua Huang, Xinyue Huang, Mingxuan Yuan, Yu Huang

Figure 1 for Conflict-driven Structural Learning Towards Higher Coverage Rate in ATPG
Figure 2 for Conflict-driven Structural Learning Towards Higher Coverage Rate in ATPG
Figure 3 for Conflict-driven Structural Learning Towards Higher Coverage Rate in ATPG
Figure 4 for Conflict-driven Structural Learning Towards Higher Coverage Rate in ATPG

Due to the increasing challenges posed by the relentless rise in the design complexity of integrated circuits, Boolean Satisfiability (SAT) has emerged as a robust alternative to structural APTG techniques. However, the high cost of transforming a circuit testing problem to a Conjunctive Normal Form (CNF) limits the application of SAT in industrial ATPG scenarios, resulting in a loss of test coverage. In Order to address this problem, this paper proposes a conflict-driven structural learning (CDSL) ATPG algorithm firstly, in which the conflict-driven heuristic methods in modern SAT solver are implemented on the logic cone of fault propagation and activation directly. The proposed CDSL algorithm is composed of three parts: (1) According to the implication graph, various conflict constraints have been learned to prune search space. (2) Conflict-driven implication and justification have been applied to increase decision accuracy and solving efficiency. (3) A conflict-based diagnosis method is further proposed in the case of low coverage debug, leading to making the aborted faults testable by relaxing or modifying some constraints on primary inputs. Extensive experimental results on industrial circuits demonstrate the effectiveness and efficiency of the proposed CDSL algorithm. It is shown that compared with the SAT-based ATPG, the proposed CDSL can on average decrease $25.6\%$ aborted faults with $94.51\%$ less run time. With a two-stage computational flow, it has shown that the proposed CDSL can lead to $46.37\%$ less aborted faults than a one-stage structural algorithm, further with the $3.19\%$ improvement on fault coverage. In addition, the conflict diagnosis can lead to $8.89\%$ less aborted faults on average, and $0.271\%$ improvement in fault coverage rate.

Viaarxiv icon

Soft Sensing Regression Model: from Sensor to Wafer Metrology Forecasting

Jan 21, 2023
Angzhi Fan, Yu Huang, Fei Xu, Sthitie Bom

Figure 1 for Soft Sensing Regression Model: from Sensor to Wafer Metrology Forecasting
Figure 2 for Soft Sensing Regression Model: from Sensor to Wafer Metrology Forecasting
Figure 3 for Soft Sensing Regression Model: from Sensor to Wafer Metrology Forecasting
Figure 4 for Soft Sensing Regression Model: from Sensor to Wafer Metrology Forecasting

The semiconductor industry is one of the most technology-evolving and capital-intensive market sectors. Effective inspection and metrology are necessary to improve product yield, increase product quality and reduce costs. In recent years, many semiconductor manufacturing equipments are equipped with sensors to facilitate real-time monitoring of the production process. These production-state and equipment-state sensor data provide an opportunity to practice machine-learning technologies in various domains, such as anomaly/fault detection, maintenance scheduling, quality prediction, etc. In this work, we focus on the task of soft sensing regression, which uses sensor data to predict impending inspection measurements that used to be measured in wafer inspection and metrology systems. We proposed an LSTM-based regressor and designed two loss functions for model training. Although engineers may look at our prediction errors in a subjective manner, a new piece-wise evaluation metric was proposed for assessing model accuracy in a mathematical way. The experimental results demonstrated that the proposed model can achieve accurate and early prediction of various types of inspections in complicated manufacturing processes.

Viaarxiv icon

COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning

Oct 11, 2022
Yifan Zhang, Chen Huang, Yueke Zhang, Kevin Cao, Scott Thomas Andersen, Huajie Shao, Kevin Leach, Yu Huang

Figure 1 for COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning
Figure 2 for COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning
Figure 3 for COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning
Figure 4 for COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning

Compiled software is delivered as executable binary code. Developers write source code to express the software semantics, but the compiler converts it to a binary format that the CPU can directly execute. Therefore, binary code analysis is critical to applications in reverse engineering and computer security tasks where source code is not available. However, unlike source code and natural language that contain rich semantic information, binary code is typically difficult for human engineers to understand and analyze. While existing work uses AI models to assist source code analysis, few studies have considered binary code. In this paper, we propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning. Specifically, we present three components in COMBO: (1) a primary contrastive learning method for cold-start pre-training, (2) a simplex interpolation method to incorporate source code, comments, and binary code, and (3) an intermediate representation learning algorithm to provide binary code embeddings. Finally, we evaluate the effectiveness of the pre-trained representations produced by COMBO using three indicative downstream tasks relating to binary code: algorithmic functionality classification, binary code similarity, and vulnerability detection. Our experimental results show that COMBO facilitates representation learning of binary code visualized by distribution analysis, and improves the performance on all three downstream tasks by 5.45% on average compared to state-of-the-art large-scale language representation models. To the best of our knowledge, COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning and unifies multiple tasks for binary code analysis.

Viaarxiv icon

Low Complexity First: Duration-Centric ISI Mitigation in Molecular Communication via Diffusion

Jul 19, 2022
Xuan Chen, Fei Ji, Miaowen Wen, Yu Huang, Yuankun Tang, Andrew W. Eckford

Figure 1 for Low Complexity First: Duration-Centric ISI Mitigation in Molecular Communication via Diffusion
Figure 2 for Low Complexity First: Duration-Centric ISI Mitigation in Molecular Communication via Diffusion
Figure 3 for Low Complexity First: Duration-Centric ISI Mitigation in Molecular Communication via Diffusion
Figure 4 for Low Complexity First: Duration-Centric ISI Mitigation in Molecular Communication via Diffusion

In this paper, we propose a novel inter-symbol interference (ISI) mitigation scheme for molecular communication via diffusion (MCvD) systems with the optimal detection interval. Its rationale is to exploit the discarded duration (i.e., the symbol duration outside this optimal interval) to relieve ISI in the target system. Following this idea, we formulate an objective function to quantify the impact of the discarded time on bit error rate (BER) performance. Besides, an optimally reusable interval within the discarded duration is derived in closed form, which applies to both the absorbing and passive receivers. Finally, numerical results validate our analysis and show that for the considered MCvD system, significant BER improvements can be achieved by using the derived reusable duration.

* 5 pages, 5 figures. arXiv admin note: text overlap with arXiv:2204.08636 
Viaarxiv icon