Data-driven machine learning methods promote evidence-based decision-making in multiple industries, including healthcare, where machine learning techniques have improved the accuracy of predicting and preventing health complications. This research focuses on utilising machine learning techniques to predict severe acute pancreatitis. Currently, clinical scoring systems are used by health professionals in hospitals to quantify the severity of acute pancreatitis. However, these scoring systems have limitations. This research uses routinely collected patient data to train, predict, and validate several machine learning models. It is shown that the accuracy of predicted results from machine learning algorithms is better than those of the clinical scoring systems.
Machine Learning to Predict Severe Acute Pancreatitis
Machine Learning and Healthcare | Natalie Lau
Acute pancreatitis (AP), an inflammatory disorder of the pancreas associated with substantial morbidity and mortality, is one of the most common gastrointestinal causes of hospital admission in the USA [1]. In New Zealand, AP continues to have a high incidence rate, with Māori patients reporting the highest incidence of AP (per 100,000 people per year) worldwide [2]. While AP-related mortality has decreased over the past decade, likely due to technological advancements and improvements in timely and accurate diagnoses, morbidity and its consequences remain substantial.
To manage, evaluate, and predict the severity and mortality of AP, current clinician practices include evaluating clinical data. This involves assessing organ function, conducting laboratory tests and imaging, and utilising clinical scoring systems. Examples of clinical scoring systems to assess the severity of AP include Acute Physiology and Chronic Health Evaluation II and the Glasgow-Imrie Criteria. While existing literature provides insight into the quantification of the severity of AP based on these clinical scoring systems, studies have shown that these systems only offer some- encompassing prediction methods, leaving much room for improvement [3]. Evidence from several studies indicates significant differences in the accuracy of these scoring systems in forecasting the severity, local complications, organ failure, and associated mortality of AP [4]. The lack of uniform standards and the degree of inconsistency among the scoring systems provides an opportunity to develop new, more robust scoring systems.
Data-driven machine learning methods promote evidence-based decision-making in multiple industries, including healthcare, where machine learning techniques have improved the accuracy of predicting and preventing health complications. In particular, studies have been conducted regarding using machine learning in predicting the severity of AP. For example, Andersson et al. [5] present a machine learning-based study using data from 200 patients with 23 potential risk variables. Additionally, EASY-APP [6] is a machine learning-based, user-friendly web application developed using an international cohort of nearly 5000 patients. It is the first model made available and used by clinicians.
The severity level of AP is classified using the Revised Atlanta Classification (RAC). RAC is considered the global consensus classification of AP, which classifies the severity of AP as mild, moderate, or severe [7]. Mild AP describes no organ failure or local or systemic complications and is usually resolved within the first week. Moderate AP describes the presence of transient organ failure, local complications, or exacerbation of co-morbid disease. Severe AP describes persistent organ failure.
Method
Figure 1 provides an overview of the machine learning pipeline used in this research. The components are described in the following subsections.
Target Variables
To predict the severity of AP, this research uses the definitions of RAC which categorise the disease into mild, moderate, or severe. Binary classification tasks with two classes were formed by combining the RAC subgroups, where moderate and severe are one class, and mild is the second class. The formation of binary classes provides the opportunity to compare with the clinical scoring systems directly.
Machine Learning Algorithms and Evaluations
This research uses three machine learning algorithms: support vector machine (SVM), random forest (RF), and XGBoost (XGB). SVM is an algorithm that assigns labels to objects from learning by example. SVM uses kernel functions to transform the input data into a higher-dimensional feature space [8]. RF is made of multiple decision trees, combining their outputs to gain a single result. RF uses bagging, also known as bootstrapping, which refers to sampling with replacement [9]. XGBoost is a scalable boosting system that builds on decision trees, ensemble learning, and gradient boosting [10].
Three evaluation metrics are discussed in this research: sensitivity (SEN), specificity (SPE), and accuracy (ACC). SEN describes a model’s probability of gaining a positive result, where a high SEN indicates a reduced likelihood of missing a positive diagnosis or a lower false negative rate [11]. SPE describes a model’s probability of gaining a negative result, where a high SPE indicates a reduced likelihood of misdiagnosing a negative classification, leading to a lower false positive rate [11]. ACC measures the model’s ability to correctly diagnose patients, where a high ACC may indicate a useful classifier. It is a proportion of correctly classified samples in the total samples [12].
Experimental Setup
This research was implemented using Python. The sci-kit library implements the SVM and RF classifiers, and hyperparameter tuning. The XGBoost classifier, meanwhile, runs based on the XGBoost library. All experiments employ nested stratified cross-validation, reporting the average performance of each run. We defined a fixed random seed to ensure experiments and results are reproducible. The code written for this research was executed on Jupyter Notebook.
Results
Table 1 presents results of machine learning algorithms based on RAC using binary classification. The SVM and RF models have high SPE compared to XGBoost, indicating a reduced likelihood of misdiagnosing a negative classification. On the other hand, the XGBoost model excels in SEN, suggesting that XGBoost is particularly apt for screening purposes, and can accurately pinpoint patients who are not severe cases. Overall, all machine learning models demonstrate a high level of accuracy, indicating good discrimination.
Table 2 presents the results of severe AP patients classified using clinical scoring systems. In comparison to machine learning models, as shown in Table 1, the accuracy and sensitivity of clinical scoring systems are poor, and the specificity scores are similar or slightly better. In terms of overall scoring, machine learning models appear to be comparable or better than clinical scoring systems indicating the potential of machine learning algorithms for predicting the severity of acute pancreatitis.
Discussion
This research provided experimental evidence showcasing the capabilities of machine learning algorithms predicting severe AP cases using routinely collected hospital data. We also provide a direct comparison with clinical scoring systems, where the performance of machine learning models are consistently better. This research has many future directions, including obtaining data from other hospitals for further validation of the ML models, discussions with experts to design a deployment plan, and incorporating clinician feedback.
Table 2: Clinical Scoring Systems for predicting severe AP. Mean values with standard deviations are presented.
Figure 1: An overview of the machine learning pipeline used in this research.
Data
This study uses a retrospective dataset comprising patient data from 2009 to 2013 in a mainland Chinese hospital. Adult patients admitted to the hospital and diagnosed with AP, as per the 2012 updated Atlanta criteria, were included in the study. The dataset contains 2,581 deidentified patient data with more than 50 variables, including diagnoses, decisions, vital signs, routine laboratory test results, and calculations using scoring systems. The dataset also includes the patient level of AP for patients, categorised using RAC, and comprises 65.2% male and 34.8% female patients, with an average age of 47.2 years and an age range of 18 to 80 years. Figure 2 provides an overview of the distribution of data based on the age and gender of patients.
Among 50+ variables, 34 features were selected through expert recommendation. These included risk factors such as patient’s gender, age, and abdominal pain onset time; triage information such as high dependency unit and intensive care unit; and vital signs including temperature, respiration, and heart rate. Information on organ failure, the site of the organ failure (lung, kidney, heart, etc.), and selected variables from the laboratory test results were also included.
Figure 2: Distribution plot of the total number of patients based on their age group and gender.
Table 1: Machine learning models for predicting severe AP based on RAC using binary classification. Mean values with standard deviations are presented. The best scores for each evaluation metric are bolded.
Acknowledgements
I am incredibly grateful for the guidance and support provided by my supervisors, Professor Gill Dobbie, Dr Vithya Yogarajan, and Professor John Windsor, throughout my summer research scholarship. Their continuous assistance and feedback made the project a whole lot less daunting, which I undoubtedly appreciated. I’d also like to thank the Master’s student I worked with, Xiaoheng Ji, for being so kind and receptive to all of my questions.
[1] P. J. Lee and G.I. Papachristou, “New insights into acute pancreatitis”, Nat Rev Gastroenterol Hepatol, vol. 16, pp. 479-496, 2019, doi: https://doi.org/10.1038/s41575-019-0158-2
[2] S. A. Pendharkar et al., “Ethnic and geographic variations in the incidence of pancreatitis and postpancreatitis diabetes mellitus in New Zealand: a nationwide populationbased study,” The New Zealand Medical Journal, vol. 130, pp. 55-68, 2017. [Online]. Available: http://ezproxy.auckland.ac.nz/login?url=https://www.proquest.com/scholarly-journals/ethnic-geographic-variations incidence/docview/1870216798/se-2
[3] R. Thapa et al., “Early prediction of severe acute pancreatitis using machine learning,” Pancreatology, vol. 22, no. 1, pp. 43-50, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1424390321006062
[4] J. X. Hu et al., “Acute pancreatitis: A review of diagnosis, severity prediction and prognosis assessment from imaging technology, scoring system and artificial intelligence,” World Journal of Gastroenterology, vol. 29, no. 37, pp. 5261-5291, 2023. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10600804/
[5] B. Andersson, R. Andersson, M. Ohlsson, and J. Nilsson, “Prediction of Severe Acute Pancreatitis at Admission to Hospital Using Artificial Neural Networks,” Pancreatology, vol. 11, no. 3, pp. 328-335, 2011. [Online]. Available: https://karger.com/pan/article-abstract/11/3/328/264321/Prediction-of-Severe-Acute-Pancreatitis-at
[6] B. Kui et al., “EASY-APP: An artificial intelligence model and application for early and easy prediction of severity in acute pancreatitis,” Clinical and translational medicine, vol. 12, no. 6, e842, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/full/10.1002/ctm2.842
[7] P. A. Banks et al., “Classification of acute pancreatitis—2012: revision of the Atlanta classification and definitions by international consensus,” Gut, vol. 62, no. 1, pp. 102-111, 2013. [Online]. Available: https://gut.bmj.com/content/62/1/102.short
[8] W. S. Noble, “What is a support vector machine?,” Nature biotechnology, vol. 24, no. 12, pp. 1565-1567, 2006, doi: https://doi.org/10.1038/nbt1206-
1565
[9] S. J. Rigatti, “Random forest,” Journal of Insurance Medicine, vol. 47, no.1, pp. 31-39, 2017, doi: https://doi.org/10.17849/insm-47-01-31-39.1
[10] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785-794, 2016. [Online]. Available: https://dl.acm.org/doi/abs/10.1145/2939672.2939785
[11] K. J. Van Stralen et al., “Diagnostic methods I: sensitivity, specificity, and other measures of accuracy,” Kidney international, vol. 75, no. 12, pp. 1257-1263, 2009. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0085253815536521
[12] J. Tohka and M. Van Gils, “Evaluation of machine learning algorithms for health and wellness applications: A tutorial,” Computers in Biology and Medicine, vol. 132, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0010482521001189
Natalie is a 4th year Bachelor of Engineering (Honours) and Bachelor of Arts student, majoring in Engineering Science and Asian Studies. She conducted her summer research scholarship with the Faculty of Computer Science.