Back to Journals » Journal of Inflammation Research » Volume 18

Development and Validation of Predictive Models for Inflammatory Bowel Disease Diagnosis: A Machine Learning and Nomogram-Based Approach

Authors Dong R, Wang Y, Yao H, Chen T, Zhou Q, Zhao B, Xu J 

Received 10 December 2024

Accepted for publication 21 March 2025

Published 15 April 2025 Volume 2025:18 Pages 5115—5131

DOI https://doi.org/10.2147/JIR.S378069

Checked for plagiarism Yes

Review by Single anonymous peer review

Peer reviewer comments 3

Editor who approved publication: Dr Tara Strutt



Rongrong Dong,1,* Yiting Wang,2,* Han Yao,1 Taoran Chen,1 Qi Zhou,3 Bo Zhao,4 Jiancheng Xu1

1Department of Laboratory Medicine, First Hospital of Jilin University, Changchun, 130021, People’s Republic of China; 2Department of Laboratory Medicine, Second Hospital of Jilin University, Changchun, 130022, People’s Republic of China; 3Department of Pediatrics, First Hospital of Jilin University, Changchun, 130021, People’s Republic of China; 4Department of Laboratory Medicine, Meihekou Central Hospital, Meihekou, 135000, People’s Republic of China

*These authors contributed equally to this work

Correspondence: Jiancheng Xu, Department of Laboratory Medicine, First Hospital of Jilin University, Xinmin Street, No. 1, Changchun City, 130021, People’s Republic of China, Tel +86-431-8878-2595, Fax +86-431-8878-6169, Email [email protected]

Background: Inflammatory bowel disease (IBD) is a chronic, incurable gastrointestinal disease without a gold standard for diagnosis. This study aimed to develop predictive models for diagnosing IBD, Crohn’s disease (CD), and Ulcerative colitis (UC) by combining two approaches: machine learning (ML) and traditional nomogram models.
Methods:  Cohorts 1 and 2 comprised data from the UK Biobank (UKB), and the First Hospital of Jilin University, respectively, which represented the initial laboratory tests upon admission for 1135 and 237 CD patients, 2192 and 326 UC patients, and 1798 and 298 non-IBD patients. Cohorts 1 and 2 were used to create predictive models. The parameters of the machine learning model established by Cohorts 1 and 2 were merged, and nomogram models were developed using Logistic regression. Cohort 3 collected initial laboratory tests from 117 CD patients, 197 UC patients, and 241 non IBD patients at a tertiary hospital in different regions of China for external testing of three nomogram models.
Results: For Cohort 1, ML-IBD-1, ML-CD-1 and ML-UC-1 models developed using the LightGBM algorithm demonstrated exceptional discrimination (ML-IBD-1: AUC = 0.788; ML-CD-1: AUC = 0.772; ML-UC-1: AUC = 0.841). For Cohort 2, ML-IBD-2, ML-CD-2, and ML-UC-2 models developed using XGBoost and Logistic Regression algorithms demonstrated exceptional discrimination (ML-IBD-2: AUC = 0.894; ML-CD-2: AUC = 0.932; ML-UC-2: AUC = 0.778). The nomogram model exhibits good diagnostic capability (nomogram-IBD: AUC=0.778, 95% CI (0.688– 0.868); nomogram-CD: AUC=0.744, 95% CI (0.710– 0.778); nomogram-UC, AUC=0.702, 95% CI (0.591– 0.814)). The predictive ability of the three models was validated in cohort 3 (nomogram-IBD: AUC=0.758, 95% CI (0.683– 0.832); nomogram-CD: AUC=0.791, 95% CI (0.717– 0.865); nomogram-UC, AUC=0.817, 95% CI (0.702– 0.932)).
Conclusion: This study utilized three cohorts and developed risk prediction models for IBD, CD, and UC with good diagnostic capability, based on conventional laboratory data using ML and nomogram.

Keywords: inflammatory bowel disease, Crohn’s disease, ulcerative colitis, machine learning, nomogram

Introduction

Inflammatory bowel disease (IBD) is a chronic, incurable gastrointestinal condition that primarily includes Crohn’s disease (CD) and ulcerative colitis (UC).1 UC primarily affects the colon, with common symptoms including rectal bleeding or mucus secretion, frequent bowel movements, and lower abdominal pain.2 CD may involve any part of the gastrointestinal tract and often manifests as abdominal pain, chronic diarrhea (which may be accompanied by significant bleeding), fatigue, weight loss, and fever.3 The pathogenesis of IBD remains unclear, with some studies suggesting it is closely related to intestinal microbiota,4 immune responses,5 and genetic factors.6 IBD is now recognized as a global health concern. According to statistics, there were 6.8 million people with IBD worldwide in 2017, making it the fourth most common digestive disease.7 The incidence of IBD has continued to rise in recent years, causing numerous negative effects on people’s physical and mental health as well as their daily lives. This trend underscores the urgent need to address IBD as a significant public health issue.8 A six-year multicenter prospective study found that IBD is a risk factor for colon cancer, and its severity is positively associated with the risk of developing cancer.9 IBD increases the risk for various psychiatric disorders, primarily by increasing the incidence of substance misuse disorders, depression disorder, anxiety disorder, and PTSD.10 A study found that the mediating role played by blood-cell-based biomarkers in the relationship between IBD and the risk of psychiatric disorders. Among them, six mediating variables have the strongest mediating effect: RDW, neutrophil count, CRP, albumin, RBC, and SII.11 Therefore, early and accurate diagnosis of IBD is essential for effective treatment of the disease.

Figure 1 The flow chart for constructing nomogram models for IBD, CD, and UC.

Currently, there is no gold standard for diagnosing IBD, which is primarily based on a comprehensive analysis of clinical symptoms, endoscopy, imaging, histopathological examination, and laboratory tests.12 Among these methods, endoscopy plays a crucial role in the assessment and diagnosis of IBD. However, its high cost and invasive nature limit patient acceptance, thereby hindering its widespread clinical application.13 Recently, positron emission tomography-computed tomography (PET-CT) and Positron emission tomography/magnetic resonance imaging (PET/MRI) have proven useful in assessing the disease activity of UC, offering diagnostic performance comparable to endoscopy.14 However, their clinical application is limited by factors such as high cost and radiation exposure.15 Consequently, there is an urgent need to develop non-invasive, rapid, and straightforward diagnostic methods for IBD.

The range of diagnostic tools for IBD is expanding, with serological markers gaining attention due to their convenience, non-invasiveness, and cost-effectiveness compared to imaging and histopathological tests.16 Laboratory parameters, including high-sensitivity C-reactive protein (hsCRP), complete blood count, serum albumin, and bilirubin, have been associated with IBD and potentially reflect systemic inflammation.17,18 The most intensively evaluated marker is hsCRP. hsCRP is typically induced by acute inflammation and secreted by hepatocytes. Due to its simplicity and rapidity of detection, it has been consistently used for the assessment of IBD.19

Machine learning (ML) spans multiple disciplines, providing deep insights into data, enhancing data utilization, and supporting clinical decision-making.20 Numerous studies have historically employed ML algorithms to develop IBD prediction models focusing on diagnosis, severity, inflammation, treatment, and prognosis.21–24 These studies primarily relied on medical imaging or omics datasets, encountering limitations such as high costs and clinical implementation challenges.25–27 Explorations of the relationship between electronic health record data and IBD via ML have confirmed the feasibility of non-invasive diagnostic methods.17 ML can manage a broader array of variables, often yielding more accurate and precise results compared to traditional modeling approaches.28

The nomogram, a traditional calculation tool comprising variables and corresponding scoring lines, offers simplicity, applicability, and a graphical representation of logistic or Cox regression models. Currently, the nomogram is employed in gastrointestinal disease management for diagnosis,29 prognosis assessment,30 and recurrence prediction.31 While sophisticated ML methods provide more accurate results, the nomogram remains favored among clinicians for its simplicity and visual representation.32 Both traditional nomogram and ML models serve as valuable tools for clinicians in diagnosing diseases and assessing progression.

Consequently, this study aimed to develop beneficial and non-invasive predictive models for IBD, CD, and UC based on routine laboratory data, combining nomogram and ML methods, to distinguish between IBD and non-IBD patients. Data from the UK Biobank (UKB) and two tertiary hospitals in China were utilized to evaluate the models’ generalized predictive capacity across diverse patient subgroups and regions.

Methods

Study Subjects

Cohort 1 and Cohort 2 (model building and internal validation): Cohort 1 included patients diagnosed with IBD and non-IBD (benign colon polyps) from the UKB, a population-based cohort of 500,000 volunteers in the United Kingdom, who were selected to develop a diagnostic model.33 Diagnoses for IBD and non-IBD were identified using International Classification of Diseases-10 (ICD-10) codes (K50 for CD, K51 for UC, and K635 for benign colon polyps) from primary care records, death registers, inpatient diagnoses, and self-reports. Patients who were pregnant, had uncertain diagnoses, or had tumors were excluded.

Cohort 2: Retrospective data for patients with IBD at the time of admission were collected from the laboratory information systems of the First Hospital of Jilin University between June 2018 and August 2022. Simultaneously, data from non-IBD patients (patients with benign colon polyps) admitted during the same period served as controls. IBD diagnoses were based on clinical, biochemical, stool, endoscopic, cross-sectional imaging, and histological criteria, following European consensus guidelines.34

Cohort 3 (model for external validation): Retrospective data for patients with IBD at the time of admission were collected from the laboratory information systems of Meihekou Central Hospital between June 2018 and April 2024. Simultaneously, data from non-IBD patients admitted during the same period served as controls.

Data Collection

Clinical data collection encompassed: (1) General information: gender, age; (2) Clinical symptoms: abdominal pain, duration of abdominal pain, severity of abdominal pain, abdominal distension, severity of abdominal distension; (3) Laboratory test results upon admission: WBC, NE, NE1, LY, LY1, MO, MO1, EO, EO1, BA, BA1, RBC, HCT, HGB, MCV) MCH, MCHC, RDW, PLT, PCT, MPV, PDW, Glu, Cr, Ur, TP, ALB, AST, ALT, ALP, GGT, TBIL, DBIL, Ca. Data exclusion criteria and imputation strategies were applied: tests with missing rates exceeding 30% were excluded. Imputation techniques (median, mean, mode) were selected based on the distribution characteristics of each variable to best represent its central tendency.

Research Design

Figure 1 depicts the flowchart for constructing nomogram models for IBD, CD, and UC. Variables that were statistically significant in Cohorts 1 and 2 were analyzed for covariance to mitigate potential multicollinearity effects on model accuracy. Generally, a Variance Inflation Factor (VIF) greater than 5 suggested potential multicollinearity among independent variables, warranting their exclusion. The Least Absolute Shrinkage and Selection Operations (LASSO) method was employed to screen characteristics in Cohorts 1 and 2. A 5-fold cross-validation approach was used to train the model, with four folds for training and the remaining fold for internal validation. Nine machine learning algorithms—Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Light Gradient Boosting Machine (LightGBM), Random Forest (RF), Adaptive Boosting Algorithm (AdaBoost), Decision Tree, Gaussian Naive Bayes (GNB), Neural Networks (MLP), and Support Vector Machines (SVM)—were deployed to construct prediction models, with their performance compared using internal 5-fold cross-validation. The algorithm demonstrating superior performance was chosen for further machine learning model development and validation. The parameters from the machine learning models established by Cohorts 1 and 2 were merged, and nomogram models were developed using LR. The backward elimination method was applied to fit the multivariate model, subsequently generating a nomogram. Cohort 3, used as an external test set, facilitated the evaluation of the final optimal model’s diagnostic effectiveness.

Statistical Analysis

Non-normally distributed variables were expressed as median (Q25, Q75), and the Mann–Whitney U-test was used to compare their distributions between groups. Categorical variables were represented as composition ratios, and the Chi-square test was utilized to compare distributions between groups, with a two-sided P<0.05 considered indicative of a statistically significant difference. The receiver operating characteristic (ROC) curve was used to assess the classification effectiveness of the model. The calibration curve was used to evaluate the agreement between the model’s predicted probabilities and the observed probabilities.The decision curve analysis (DCA) was employed to determine the clinical benefit of the model. LR was employed to develop the nomogram. Data were stored and managed using Excel 2016, and statistical analysis was performed using SPSS 22.0. The model construction and online presentation were supported by the Deepwise & Beckman Coulter DxAI platform (http://dxonline.deepwise.com).

Results

Baseline Characterization

Cohorts 1 and 2 of the study included 3327 IBD patients (1135 CD patients and 2192 UC patients) and 1798 non-IBD patients, and 563 IBD patients (237 CD patients and 326 UC patients) and 298 non-IBD patients, respectively. The median age at diagnosis was 59 and 38 years for patients in Cohorts 1 and 2 IBD groups, respectively, and 60 and 61 years for patients in the non-IBD groups, respectively. In Cohorts 1 and 2, 49.26% and 39.43% of IBD patients were female, and 49.28% and 22.15% of patients in the non-IBD groups were female. Tables 1 and 2 display the baseline and laboratory characteristics of patients in Cohorts 1 and 2, respectively.

Table 1 Baseline and Laboratory Characteristics of Patients in Cohort 1

Table 2 Baseline and Laboratory Characteristics of Patients in Cohort 2

Establishment of ML Model

Variables with a VIF greater than 5 were excluded from those with a P value < 0.05 in Tables 1 and 2. Following the LASSO feature selection method, 15 and 5 features for IBD, 10 and 8 features for CD, and 11 and 4 features for UC were finally selected for modeling in Cohorts 1 and 2, respectively: Cohort 1: IBD: WBC, MCHC, LY1, EO1, MO, DBIL, AST, ALB, RDW, severity of abdominal pain, severity of abdominal distension, ALP, hsCRP, age, abdominal pain day; CD: abdominal pain days, severity of abdominal pain, hsCRP, RDW, NE, MO, MPV, ALB, ALP, Ca; UC: severity of abdominal distension, TP, age, RDW, AST, NE, LY, MO, EO, DBIL, abdominal pain days; Cohort 2: IBD: severity of abdominal pain, severity of abdominal distension, age, PLT, MCH; CD: severity of abdominal distension, age, PLT, RDW, AST, LY, BA, Cr; UC: severity of abdominal pain, severity of abdominal distension, PLT, ALB (Supplementary Figure 1).

Internal 5-fold cross-validation across nine algorithms revealed that LightGBM achieved the highest Area Under the Curve (AUC) values for IBD, CD, and UC in Cohort 1, with AUC values of 0.788, 0.722, and 0.841, respectively. In Cohort 2, XgBoost had the highest AUC value of 0.894 for IBD, while LR had the highest AUC values of 0.932 and 0.778 for CD and UC, respectively. The algorithms used in the subsequent modeling were LightGBM, XGBoost, and LR (Table 3). The ten variables with the highest importance in Cohort 1 according to the LightGBM algorithm (5-fold cross-validation) were selected for subsequent modeling: ML-IBD-1: DBIL, ALB, age, hsCRP, ALP, RDW, AST, EO1, abdominal pain days, severity of abdominal distension; ML-CD-1: abdominal pain days, ALB, RDW, NE, hsCRP, severity of abdominal pain, ALP, MPV, Ca, MO; and ML-UC-1: DBIL, TP, age, LY, RDW, NE, AST, MO, EO, abdominal pain days. Variables for subsequent modeling in Cohort 2 were selected based on the XGBoost and Logistic algorithms (5-fold cross-validation): ML-IBD-2: severity of abdominal distension, age, MCH, PLT, severity of abdominal pain; ML-CD-2: severity of abdominal distension, age, PLT, Cr, AST, RDW, LY, BA; and ML-UC-2: severity of abdominal distension, ALB, PLT, severity of abdominal pain (Figure 2). The performance of each model is shown in Supplementary Figure 2 and Supplementary Figure 3.

Table 3 Diagnostic Efficacy of Nine Classifiers in the Validation Set for 5-Fold Cross-Validation

Figure 2 Importance of the top ten features of two Cohort ML models. The top ten features of Cohort 1 and Cohort 2 ML models. (a) ML-IBD-1 model. (b) ML-CD-1 model. (c) ML-UC-1 model. (d) ML-IBD-2 model. (e) ML-CD-2 model. (f) ML-UC-2 model.

Establishment of Nomogram Models

The parameters of ML-IBD-1 and ML-IBD-2, ML-CD-1 and ML-CD-2, and ML-UC-1 and ML-UC-2 were merged, respectively. Using multifactorial LR with the backward elimination method, the nomogram model for IBD included age, DBIL, ALP, RDW, abdominal pain days, and hsCRP. For CD, the nomogram model comprised abdominal pain days, RDW, NE, hsCRP, Ca, age, and AST. For UC, the nomogram model comprised ALB, TP, DBIL, age, RDW, NE, AST, EO, and abdominal pain days (Figure 3). Internally validated ROC curve results showed that the nomogram-IBD, nomogram-CD, and nomogram-UC models had excellent classification ability in the diagnosis of IBD, CD, and UC, respectively (nomogram-IBD AUC = 0.778, 95% CI (0.688–0.868); nomogram-CD AUC = 0.744, 95% CI (0.710–0.778); nomogram-UC AUC = 0.702, 95% CI (0.591–0.814)). The calibration curves showed that the sample probabilities of the nomogram-IBD, nomogram-CD, and nomogram-UC models were in good agreement with the predicted probabilities. DCA results indicated that the above models had high clinical benefits (Figure 4).

Figure 3 Nomogram models. (a) nomogram-IBD model. (b) nomogram-CD model. (c) nomogram-UC model. Each variable was assigned a point, and the total points of each variable together corresponded to the risk probability of disease.

Figure 4 Performance nomogram models in 5-fold cross-validation. The calibration curve of nomogram models: (a) nomogram-IBD model. (b) nomogram-CD model. (c) nomogram-UC model. Decision curve analysis of nomogram models: (d) nomogram-IBD model. (e) nomogram-CD model. (f) nomogram-UC model.

External Validation of Nomogram Models

In Cohort 3, the study recruited 314 patients with IBD (117 with CD and 197 with UC) alongside 241 non-IBD patients. External validation (5-fold cross-validation) results for nomogram-IBD, nomogram-CD, and nomogram-UC models, as indicated by ROC curve analyses, demonstrated their stable and superior diagnostic ability for Cohort 3: nomogram-IBD AUC = 0.758, 95% CI (0.683–0.832); nomogram-CD AUC = 0.791, 95% CI (0.717–0.865); nomogram-UC AUC = 0.817, 95% CI (0.702–0.932). DCA for all three models demonstrated favorable clinical performance. ROC curves and DCA are presented in Figure 5.

Figure 5 Performance of external validation of ML models. The ROC curve of nomogram models: (a) nomogram-IBD model. (b) nomogram-CD model. (c) nomogram-UC model. DCA of nomogram models: (d) nomogram-IBD model. (e) nomogram-CD model. (f) nomogram-UC model.

Discussion

This study had the following innovative findings: (1) Predictive models for IBD and its subtypes were developed using clinical symptoms, including age, and routine laboratory parameters to enable rapid, noninvasive diagnosis of patients with IBD. (2) The model’s discrimination, calibration, and clinical utility were assessed across diverse racial populations, using the UKB database for modeling and data from a tertiary care hospital in China for external validation, thus providing a comprehensive evaluation of the model’s predictive capabilities.

Selecting Features and Optimal Algorithms for ML Models

In ML modeling, data and features set the upper limit of model performance, with the algorithm striving to approach this limit as closely as possible. To optimize model predictive performance, the study implemented rigorous data preprocessing and employed two feature selection methods. During data preprocessing, missing and outlier values were addressed, and the data were standardized. The LASSO method selected the most predictive features, eliminating redundancy and multicollinearity. Among the nine machine learning algorithms—XgBoost, LR, RF, LightGBM, AdaBoost, GNB, MLP, SVM, and Decision Tree—XGBoost, LR, and LightGBM emerged as the optimal choices. XGBoost, LR, and LightGBM were distinguished for their ability to prevent overfitting and to be finely tuned for unbalanced datasets compared to other algorithms.

Nomogram Modeling and Variable Analysis

The study merged the parameters of the ML models in Cohorts 1 and 2 and generated nomograms using LR and backward elimination. Both developed models, ML and nomogram, demonstrated good diagnostic efficiency. The ML model performed well in discrimination and further validation tests, albeit with less clarity. Conversely, the nomogram model was characterized by its simplicity, transparency, and ease of understanding. This study combined the strengths of both models—the ML model’s accuracy and the nomogram model’s transparency—to significantly enhance the clinical diagnosis of IBD.

Age, abdominal pain days, and RDW associated with IBD pathogenicity were incorporated into the nomogram-IBD, nomogram-CD, and nomogram-UC models simultaneously. IBD predominantly affects young adults, with peak ages for UC and CD being 20–49 and 18–35 years, respectively.35 Research indicates that age is a pivotal factor in models predicting IBD, with its characteristic importance value reaching as high as 1.75 in prior studies.17 In the ML model established in this study, age was also an important factor in distinguishing between IBD, including its subtypes, and non-IBD patients.Some studies have shown that the activity of inflammatory bowel disease is closely related to the occurrence and duration of abdominal pain.36,37 The pathogenesis of pain in IBD patients is not yet clear, but there are several potential pathological mechanisms, including inflammation, intestinal obstruction, psychological, sociopsychological, neurobiological, and genetic factors.38 Therefore, effective disease management and treatment can help reduce the number of days of abdominal pain and improve the quality of life of patients.39 Anemia is the most common extraintestinal manifestation of IBD, with the main types being iron deficiency anemia (IDA), inflammatory anemia, and anemia of chronic disease (ACD). IDA is most commonly a result of chronic inflammation of the small and large intestinal epithelium, reduced absorption by intestinal cells, and chronic gastrointestinal bleeding.40 Among red blood cell parameters, RDW is a classic and strong biomarker for diagnosing IDA, with good sensitivity and specificity. However, it is not very useful in differential diagnosis and remains a quite common abnormal biomarker.41

In addition, studies have shown that about 30% of IBD patients exhibit elevated liver enzymes (GGT, ALT, AST, ALP) and significant symptoms of liver injury.42 Primary sclerosing cholangitis (PSC) is the most common hepatic and biliary manifestation of IBD and is more common in UC.43 A large-scale population study conducted by Bernstein et al in Canada on the extraintestinal manifestations of IBD included 4454 patients and found that the incidence of PSC in UC patients was 2%, while the incidence in CD patients was 0.4%.44 This study included DBIL, ALB, and TP, which are related to liver injury symptoms, in the nomogram UC model.

Meanwhile, this study incorporated hsCRP and Ca into the nomogram CD model. hsCRP, an acute-phase protein produced by liver cells in response to inflammation (eg, microbial invasion, tissue damage), rises within a few hours of the onset of inflammation and reaches its peak value within 48 hours.45 In seeking non-invasive diagnostic solutions for IBD, hsCRP has been employed as a marker.46 The correlation between disease activity and CRP is stronger in Crohn’s disease than in UC; however, this depends on the severity and location of the disease.47 A study of 435 patients in South Korea found that, compared to patients with isolated ileal disease, patients with ileal or colonic Crohn’s disease are more likely to experience elevated CRP.48 Vitamin D is a fat-soluble vitamin with the active form of calcitriol or 1.25-dihydroxyvitamin D3 (1,25(OH)2D3), which regulates bone, calcium, and phosphorus metabolism. Vitamin D can promote the absorption of calcium in the intestine and regulate the absorption and release of calcium in the bones, thereby maintaining normal serum calcium balance. A study has shown that 27% of CD patients and 15% of UC patients have vitamin D deficiency (25-hydroxyvitamin D3 < 30 nmol/L). In addition, compared to UC patients, the average concentration of 25-hydroxyvitamin D3 in CD patients was significantly reduced.49

IBD patients often exhibit abnormal whole blood cell count parameters at the onset of the disease.50 The etiology and pathogenesis of IBD include intestinal barrier dysfunction and dysregulation of the intestinal mucosal immune system, both of which can be influenced by eosinophils. The abundance of eosinophils is related to the severity of the disease.51 In addition, Manousou et al described an increase in the expression of the eotaxin receptor CCR3 in colon biopsy samples from UC patients, rather than CD patients.52 Given the different activation patterns of colon eosinophils described by Lampinen et al, the comparison between UC and CD becomes more pronounced, indicating that activated eosinophils persist in the lamina propria of UC patients during disease remission, but not in CD patients.53 Meanwhile, this study showed that EO was only included in the nomogram UC model. Patients with IBD have been found to exhibit significantly higher peripheral blood Neutrophil-to-Lymphocyte Ratio (NLR) values than controls.54 The NLR consists of two critical immune system components: lymphocytes, key to the inflammatory response, and neutrophils, central to the innate immune mechanism. Thus, an imbalance in the inflammatory phase may lead to a higher NLR.55

This study demonstrates that a machine learning predictive model based on simple, accessible, and widely applied blood biomarkers, along with clinical manifestations of patients, can support the diagnosis of IBD. Pei et al included 414 IBD patients and employed four machine learning models to evaluate the diagnostic and predictive value of peripheral blood routine parameters in distinguishing UC from CD. The multilayer perceptron artificial neural network model based on peripheral blood routine parameters exhibited good performance. However, a larger sample size and additional models are needed for further investigation.56 Reddy and colleagues compared traditional regression models with GBM learning models. They constructed a decision support tool using ML methods with a large electronic medical records system, Cerner EMR, for the real-time diagnosis of CD patients. However, due to missing data, only 82 patients were available for analysis and model development. This study is also considered to have a high risk of bias and concerns regarding its applicability.57

Validation of ML Models

In this study, models were constructed using a large dataset from the UKB and a tertiary hospital in China. The models demonstrated robust performance and reliability in internal validation regarding differentiation, calibration, and utility. Reliable external validation was conducted at another tertiary hospital in different regions of China, and the models demonstrated strong diagnostic performance across different ethnic groups, thereby enhancing the generalizability of the research results. The limited sample sizes of Cohorts 2 and 3, offering fewer data points, could potentially affect the fitting accuracy and stability of the calibration curves. Compared to traditional diagnostic methods, the research models can be integrated into computerized decision-aid tools, offering rapid diagnostic predictions through the automated analysis of clinical features, potentially reducing the time from initial presentation to confirmed diagnosis. In conclusion, the models’ applicability across various countries could significantly aid clinicians in diagnosing IBD and implementing effective therapeutic interventions.

This study had several limitations: it was limited to clinical presentations and laboratory data, excluding imaging findings. Furthermore, genetic analysis data were not included in the clinical data analysis. As gene sequencing technology advances in clinical applications, integrating clinical and genetic data will become increasingly valuable for disease risk prediction, diagnosis, and prognosis.

Conclusion

This study involved constructing IBD, CD, and UC models using the UKB database and data from IBD and non-IBD patients at a tertiary hospital in China, with external validation using data from another tertiary hospital in a different region. This study used machine learning algorithms to select the optimal one from nine machine learning algorithms. Cohort 1 used LightGBM, while Cohort 2 used XGBoost and LR algorithms to develop an ML-IBD risk prediction model based on conventional laboratory parameters. Machine learning features from the two cohorts were merged to construct a nomogram model. Internal and external validation showed that the nomogram IBD, nomogram CD, and nomogram UC models have good accuracy in distinguishing between IBD and non-IBD patients. These models can be integrated into computer-aided decision-making tools to provide rapid diagnostic predictions.

Abbreviations

IBD, Inflammatory bowel disease; CD, Crohn’s disease; UC, Ulcerative colitis; ML, machine learning; hsCRP, high-sensitivity C-reactive protein; UKB, UK Biobank; WBC, White blood cell; NE, Neutrophil count; NE1, Neutrophil percentage; LY, Lymphocyte count; LY1, Lymphocyte percentage; MO, Monocyte count; MO1, Monocyte percentage; EO, Eosinophil count; EO1, Eosinophil percentage; BA, Basophil count; BA1, Basophil percentage; RBC, Red blood cell; HCT, Hematocrit; HGB, Hemoglobin; MCV, Mean corpuscular volume; MCH, Mean corpuscular hemoglobin; MCHC, Mean corpuscular hemoglobin concentration; RDW, Red blood cell distribution width; PLT, Platelet count; PCT, Plateletcrit; MPV, Mean platelet volume; PDW, Platelet distribution width; Glu, Glucose; Cr, Creatinine; Ur, Urea; TP, Total protein; ALB, Albumin; AST, Aspartate aminotransferase; ALT, Alanine aminotransferase; ALP, Alkaline phosphatase; GGT, γ-glutamyl transpeptidase; TBIL, Total bilirubin; DBIL, Direct bilirubin; Ca, Calcium; VIF, Variance Inflation Factor; LASSO, Least Absolute Shrinkage and Selection Operations; XgBoost, Extreme Gradient Boosting; RF, Decision Tree, Random Forest; LightGBM, Lightweight Gradient Boosting Machine Learning; AdaBoost, Adaptive Boosting Algorithm; GNB, Gaussian Plain Bayes; MLP, Neural Networks; SVM, Support Vector Machines; KNN, K Nearest Neighbor; ROC, receiver operating characteristic; DCA, decision curve analysis.

Data Sharing Statement

The datasets generated and/or analysed during the current study are not publicly available due to privacy or ethical restrictions but are available from the corresponding author on reasonable request. The UK Biobank, Application ID: 84347.

Code Availability

Code for analysis can be accessed at http://dxonline.deepwise.com. The model construction and online presentation were supported by the Deepwise & Beckman Coulter DxAI platform.

Ethics Approval and Consent to Participate

The World Medical Association’s Declaration of Helsinki (1964 and its later amendments) conducted the study. The ethics committee of the First Hospital of Jilin University approved this study (No.2016-306). The requirement for written informed consent was waived owing to the retrospective nature of the study by the ethics committee of the First Hospital of Jilin University.

Ethical approval for the UK Biobank was obtained from the North West Multicenter Research Ethics Committee (REC reference: 21/NW/0157), with participants providing informed consent upon recruitment.

Acknowledgments

This work was supported by data from Dr. Mu, the First Hospital of Jilin University, China.

Author Contributions

All authors made a significant contribution to the work reported, whether that is in the conception, study design, execution, acquisition of data, analysis and interpretation, or in all these areas; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; have agreed on the journal to which the article has been submitted; and agree to be accountable for all aspects of the work.

Funding

This work was supported by the Jilin Science and Technology Development Program [grant numbers 20190304110YY, 20200404171YY].

Disclosure

The authors declare no competing interests.

References

1. Fu W, Fu H, Ye W, et al. Peripheral blood neutrophil-to-lymphocyte ratio in inflammatory bowel disease and disease activity: a meta-analysis. Int Immunopharmacol. 2021;101(Pt B):108235. doi:10.1016/j.intimp.2021.108235

2. Kaplan GG, Windsor JW. The four epidemiological stages in the global evolution of inflammatory bowel disease. Nat Rev Gastroenterol Hepatol. 2021;18(1):56–66. doi:10.1038/s41575-020-00360-x

3. Glick LR, Cifu AS, Feld L. Ulcerative Colitis in Adults. JAMA. 2020;324(12):1205–1206. doi:10.1001/jama.2020.11583

4. Wilkins T, Jarvis K, Patel J. Diagnosis and management of Crohn’s disease. Am Fam Physician. 2011;84(12):1365–1375.

5. Guzzo GL, Andrews JM, Weyrich LS. The neglected gut microbiome: fungi, Protozoa, and Bacteriophages in inflammatory bowel disease. Inflamm Bowel Dis. 2022;28(7):1112–1122. doi:10.1093/ibd/izab343

6. Zhang L, Ocansey D, Liu L, et al. Implications of lymphatic alterations in the pathogenesis and treatment of inflammatory bowel disease. Biomed Pharmacother. 2021;140:111752. doi:10.1016/j.biopha.2021.111752

7. Ouahed J, Spencer E, Kotlarz D, et al. Very early onset inflammatory bowel disease: a clinical approach with a focus on the role of genetics and underlying immune deficiencies. Inflamm Bowel Dis. 2020;26(6):820–842. doi:10.1093/ibd/izz259

8. The global. regional, and national burden of inflammatory bowel disease in 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet Gastroenterol Hepatol. 2020;5(1):17–30. doi:10.1016/S2468-1253(19)30333-4

9. Carroll MW, Kuenzig ME, Mack DR, et al. The impact of inflammatory bowel disease in Canada 2018: children and Adolescents with IBD. J Can Assoc Gastroenterol. 2019;2(Suppl 1):S49–S67. doi:10.1093/jcag/gwy056

10. Vigod SN, Kurdyak P, Brown HK, et al. Inflammatory bowel disease and new-onset psychiatric disorders in pregnancy and post partum: a population-based cohort study. Gut. 2019;68(9):1597–1605. doi:10.1136/gutjnl-2018-317610

11. Feng J, Wu Y, Meng M, et al. The mediating effect of blood biomarkers in the associations between inflammatory bowel disease and incident psychiatric disorders: a prospective cohort study. Int J Surg. 2024;110(12):7738–7748. doi:10.1097/JS9.0000000000001831

12. Biancone L, Armuzzi A, Scribano ML, et al. Cancer risk in inflammatory bowel disease: a 6-year prospective multicenter nested case-control IG-IBD study. Inflamm Bowel Dis. 2020;26(3):450–459. doi:10.1093/ibd/izz155

13. Kessel C, Lavric M, Weinhage T, et al. Serum biomarkers confirming stable remission in inflammatory bowel disease. Sci Rep. 2021;11(1):6690. doi:10.1038/s41598-021-86251-w

14. Panes J, Jairath V, Levesque BG. Advances in use of endoscopy, radiology, and biomarkers to monitor inflammatory bowel diseases. Gastroenterology. 2017;152(2):362–373. doi:10.1053/j.gastro.2016.10.005

15. Berry N, Sinha SK, Bhattacharya A, et al. Role of positron emission tomography in assessing disease activity in ulcerative colitis: comparison with biomarkers. Dig Dis Sci. 2018;63(6):1541–1550. doi:10.1007/s10620-018-5026-3

16. Li Y, Khamou M, Schaarschmidt BM, et al. Comparison of (18)F-FDG PET-MR and fecal biomarkers in the assessment of disease activity in patients with ulcerative colitis. Br J Radiol. 2020;93(1112):20200167. doi:10.1259/bjr.20200167

17. Pang W, Zhang B, Jin L, Yao Y, Han Q, Zheng X. Serological biomarker-based machine learning models for predicting the relapse of ulcerative colitis. J Inflamm Res. 2023;16:3531–3545. doi:10.2147/JIR.S423086

18. Kraszewski S, Szczurek W, Szymczak J, Regula M, Neubauer K. Machine learning prediction model for inflammatory bowel disease based on laboratory markers. working model in a discovery cohort study. J Clin Med. 2021;10(20):4745. doi:10.3390/jcm10204745

19. Wang Y, Li C, Wang W, et al. Serum albumin to globulin ratio is associated with the presence and severity of inflammatory bowel disease. J Inflamm Res. 2022;15:1907–1920. doi:10.2147/JIR.S347161

20. Huang X, Liu Y, Zhou Z, et al. Clinical significance of the C-reactive protein-to-bilirubin ratio in patients with ulcerative colitis. Front Med Lausanne. 2023;10:1227998. doi:10.3389/fmed.2023.1227998

21. Bumrungthai S, Ekalaksananan T, Kleebkaow P, et al. Mathematical modelling of cervical precancerous lesion grade risk scores: linear regression analysis of cellular protein biomarkers and human papillomavirus E6/E7 RNA staining patterns. Diagnostics (Basel). 2023;13(6). doi:10.3390/diagnostics13061084.

22. Nguyen NH, Picetti D, Dulai PS, et al. Machine learning-based prediction models for diagnosis and prognosis in inflammatory bowel diseases: a systematic review. J Crohns Colitis. 2022;16(3):398–413. doi:10.1093/ecco-jcc/jjab155

23. Stidham RW, Liu W, Bishu S, et al. Performance of a deep learning model vs human reviewers in grading endoscopic disease severity of patients with ulcerative colitis. JAMA Network Open. 2019;2(5):e193963. doi:10.1001/jamanetworkopen.2019.3963

24. Waljee AK, Lipson R, Wiitala WL, et al. Predicting hospitalization and outpatient corticosteroid use in inflammatory bowel disease patients using machine learning. Inflamm Bowel Dis. 2017;24(1):45–53. doi:10.1093/ibd/izx007

25. Peng JC, Ran ZH, Shen J. Seasonal variation in onset and relapse of IBD and a model to predict the frequency of onset, relapse, and severity of IBD based on artificial neural network. Int J Colorectal Dis. 2015;30(9):1267–1273. doi:10.1007/s00384-015-2250-6

26. Javaid A, Shahab O, Adorno W, Fernandes P, May E, Syed S. Machine learning predictive outcomes modeling in inflammatory bowel diseases. Inflamm Bowel Dis. 2022;28(6):819–829. doi:10.1093/ibd/izab187

27. Ye L, Lin Y, Fan XD, et al. Identify inflammatory bowel disease-related genes based on machine learning. Front Cell Dev Biol. 2021;9:722410. doi:10.3389/fcell.2021.722410

28. Linares-Blanco J, Fernandez-Lozano C, Seoane JA, Lopez-Campos G. Machine learning based microbiome signature to predict inflammatory bowel disease subtypes. Front Microbiol. 2022;13:872671. doi:10.3389/fmicb.2022.872671

29. Liu Y, Zhang Y, Zhang X, et al. Nomogram and machine learning models predict 1-year mortality risk in patients with sepsis-induced cardiorenal syndrome. Front Med Lausanne. 2022;9:792238. doi:10.3389/fmed.2022.792238

30. Lin J, Su H, Zhou Q, Pan J, Zhou L. Predictive value of nomogram based on Kyoto classification of gastritis to diagnosis of gastric cancer. Scand J Gastroenterol. 2022;57(5):574–580. doi:10.1080/00365521.2021.2023626

31. Zhu C, Ding J, Wang S, Qiu Q, Ji Y, Wang L. Development and validation of a prognostic nomogram for malignant esophageal fistula based on radiomics and clinical factors. Thorac Cancer. 2021;12(23):3110–3120. doi:10.1111/1759-7714.14115

32. Liang JY, Lin HC, Liu J, et al. A novel prognostic nomogram for colorectal cancer liver metastasis patients with recurrence after hepatectomy. Cancer Med. 2021;10(5):1535–1544. doi:10.1002/cam4.3697

33. Alabi RO, Makitie AA, Pirinen M, Elmusrati M, Leivo I, Almangush A. Comparison of nomogram with machine learning techniques for prediction of overall survival in patients with tongue cancer. Int J Med Inform. 2021;145:104313. doi:10.1016/j.ijmedinf.2020.104313

34. Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:7726):203–209. doi:10.1038/s41586-018-0579-z

35. Maaser C, Sturm A, Vavricka SR, et al. ECCO-ESGAR guideline for diagnostic assessment in IBD Part 1: initial diagnosis, monitoring of known IBD, detection of complications. J Crohns Colitis. 2019;13(2):144–164. doi:10.1093/ecco-jcc/jjy113

36. Nikolaus S, Schreiber S. Diagnostics of inflammatory bowel disease. Gastroenterology. 2007;133(5):1670–1689. doi:10.1053/j.gastro.2007.09.001

37. Janssen LM, Rezazadeh AA, Romberg-Camps M, et al. Abdominal pain in patients with inflammatory bowel disease in remission: a prospective study on contributing factors. Aliment Pharmacol Ther. 2023;58(10):1041–1051. doi:10.1111/apt.17718

38. Volz MS, Farmer A, Siegmund B. Reduction of chronic abdominal pain in patients with inflammatory bowel disease through transcranial direct current stimulation: a randomized controlled trial. Pain. 2016;157(2):429–437. doi:10.1097/j.pain.0000000000000386

39. Docherty MJ, Jones RR, Wallace MS. Managing pain in inflammatory bowel disease. Gastroenterol Hepatol. 2011;7(9):592–601.

40. Zielinska A, Salaga M, Wlodarczyk M, Fichna J. Focus on current and future management possibilities in inflammatory bowel disease-related chronic pain. Int J Colorectal Dis. 2019;34(2):217–227. doi:10.1007/s00384-018-3218-0

41. Wozniak M, Borkowska A, Jastrzebska M, Sochal M, Malecka-Wojciesko E, Talar-Wojnarowska R. Clinical and laboratory characteristics of anaemia in hospitalized patients with inflammatory bowel disease. J Clin Med. 2023;12(7):2447. doi:10.3390/jcm12072447

42. Oustamanolakis P, Koutroubakis IE, Kouroumalis EA. Diagnosing anemia in inflammatory bowel disease: beyond the established markers. J Crohns Colitis. 2011;5(5):381–391. doi:10.1016/j.crohns.2011.03.010

43. Rojas-Feria M, Castro M, Suarez E, Ampuero J, Romero-Gomez M. Hepatobiliary manifestations in inflammatory bowel disease: the gut, the drugs and the liver. World J Gastroenterol. 2013;19(42):7327–7340. doi:10.3748/wjg.v19.i42.7327

44. Silva J, Brito BS, Silva I, et al. Frequency of hepatobiliary manifestations and concomitant liver disease in inflammatory bowel disease patients. Biomed Res Int. 2019;2019:7604939. doi:10.1155/2019/7604939

45. Bernstein CN, Blanchard JF, Rawsthorne P, Yu N. The prevalence of extraintestinal diseases in inflammatory bowel disease: a population-based study. Am J Gastroenterol. 2001;96(4):1116–1122. doi:10.1111/j.1572-0241.2001.03756.x

46. Liu A, Lv H, Tan B, et al. Accuracy of the highly sensitive C-reactive protein/albumin ratio to determine disease activity in inflammatory bowel disease. Medicine (Baltimore). 2021;100(14):e25200. doi:10.1097/MD.0000000000025200

47. Mankowska-Wierzbicka D, Karczewski J, Poniedzialek B, et al. C-reactive protein as a diagnostic and prognostic factor in inflammatory bowel diseases. Postepy Hig Med Dosw (Online). 2016;70:1124–1130. doi:10.5604/17322693.1223798

48. Clough J, Colwill M, Poullis A, Pollok R, Patel K, Honap S. Biomarkers in inflammatory bowel disease: a practical guide. Therap Adv Gastroenterol. 2024;17:1118434576. doi:10.1177/17562848241251600

49. Yang DH, Yang SK, Park SH, et al. Usefulness of C-reactive protein as a disease activity marker in Crohn’s disease according to the location of disease. Gut Liver. 2015;9(1):80–86. doi:10.5009/gnl13424

50. Kuwabara A, Tanaka K, Tsugawa N, et al. High prevalence of vitamin K and D deficiency and decreased BMD in inflammatory bowel disease. Osteoporos Int. 2009;20(6):935–942. doi:10.1007/s00198-008-0764-2

51. Danese S, Hoffman C, Vel S, et al. Anaemia from a patient perspective in inflammatory bowel disease: results from the European Federation of Crohn’s and Ulcerative Colitis Association’s online survey. Eur J Gastroenterol Hepatol. 2014;26(12):1385–1391. doi:10.1097/MEG.0000000000000200

52. Loktionov A. Eosinophils in the gastrointestinal tract and their role in the pathogenesis of major colorectal disorders. World J Gastroenterol. 2019;25(27):3503–3526. doi:10.3748/wjg.v25.i27.3503

53. Manousou P, Kolios G, Valatas V, et al. Increased expression of chemokine receptor CCR3 and its ligands in ulcerative colitis: the role of colonic epithelial cells in in vitro studies. Clin Exp Immunol. 2010;162(2):337–347. doi:10.1111/j.1365-2249.2010.04248.x

54. Lampinen M, Ronnblom A, Amin K, et al. Eosinophil granulocytes are activated during the remission phase of ulcerative colitis. Gut. 2005;54(12):1714–1720. doi:10.1136/gut.2005.066423

55. Liao W, Tao G, Chen G, et al. A novel clinical prediction model of severity based on red cell distribution width, neutrophil-lymphocyte ratio and intra-abdominal pressure in acute pancreatitis in pregnancy. BMC Pregnancy Childbirth. 2023;23(1):189. doi:10.1186/s12884-023-05500-0

56. Pei J, Wang G, Li Y, et al. Utility of four machine learning approaches for identifying ulcerative colitis and Crohn’s disease. Heliyon. 2024;10(1):e23439. doi:10.1016/j.heliyon.2023.e23439

57. Reddy BK, Delen D, Agrawal RK. Predicting and explaining inflammation in Crohn’s disease patients using predictive analytics methods and electronic medical record data. Health Informatics J. 2019;25(4):1201–1218. doi:10.1177/1460458217751015

Creative Commons License © 2025 The Author(s). This work is published and licensed by Dove Medical Press Limited. The full terms of this license are available at https://www.dovepress.com/terms.php and incorporate the Creative Commons Attribution - Non Commercial (unported, 4.0) License. By accessing the work you hereby accept the Terms. Non-commercial uses of the work are permitted without any further permission from Dove Medical Press Limited, provided the work is properly attributed. For permission for commercial use of this work, please see paragraphs 4.2 and 5 of our Terms.