Predicting Primary Biliary Cholangitis Stages Using Machine Learning with Automated Hyperparameter Optimization and Recursive Feature Elimination
Subject Areas : IT Strategy
Arman Rezasoltani
1
,
Ali Husseinzadeh Kashan
2
,
Shahram Agah
3
*
,
Fatemeh Agah
4
,
Amir Mohammad Khani
5
1 - Department of Industrial Management, Faculty of Management, University of Tehran, Tehran, Iran.
2 - Department of Industrial Engineering, Faculty of Industrial and Systems Engineering, Tarbiat Modares University, Tehran, Iran.
3 - Department of Gastroenterology and Hepatolog, Colorectal Research Center, Iran University of Medical Sciences, Tehran, Iran.
4 - The University of Adelaide, Discipline of Medicine, Adelaide, South Australia, Australia. Fatemeh.
5 - Department of Industrial Management, Faculty of Management, University of Tehran, Tehran, Iran.
Keywords: Primary Biliary Cholangitis, Machine Learning, Recursive Feature Elimination, Optuna, Imbalanced Data.,
Abstract :
This research used modern machine learning ways to predict the stages of primary biliary cholangitis using data from the Mayo Clinic trial. The research aims to obtain high prediction accuracy while representing balanced evaluation metrics. Important techniques include automated hyperparameters optimization with Optuna and Recursive Feature Elimination to improve model performance. Pre-processing included handling missing values, encoding of categorical features, and addressing class imbalances using SMOTE. A total of twelve machine learning algorithms are evaluated with ensemble-based models such as CatBoost and Extra Trees producing much better results. Evaluation metrics take into account all model predictions, including accuracy, precision, recall, F1 score, and ROC-AUC for performing balanced and interpretative evaluations of performances critical for imbalanced datasets. This endeavor includes clinical and laboratory information illustrating the prospect of machine learning in advancing therapeutic diagnosis, emphasizing the rigor and robustness in evaluation laid groundwork for future research to encompass even more generalizable and robust diagnostic tools.
[1] M. A. Konerman et al., “Machine learning models to predict disease progression among veterans with hepatitis C virus,” PLOS ONE, vol. 14, no. 1, p. e0208141, Jan. 2019, doi: https://doi.org/10.1371/journal.pone.0208141.
[2] Ahmet Ercan Topcu, Ersin Elbasi, and Yehia Ibrahim Alzoubi, “Machine Learning-Based Analysis and Prediction of Liver Cirrhosis,”Jul.2024,doi:https://doi.org/10.1109/tsp63128.2024.10605929.
[3] E. B. Tapper and N. D. Parikh, “Diagnosis and Management of Cirrhosis and Its Complications: A Review,” JAMA, vol. 329, no. 18, pp. 1589–1602, May 2023, doi: https://doi.org/10.1001/jama.2023.5997.
[4] R. Wei et al., “Clinical prediction of HBV and HCV related hepatic fibrosis using machine learning,” vol. 35, pp. 124–132, Sep. 2018, doi: https://doi.org/10.1016/j.ebiom.2018.07.041.
[5] C. Labenz et al., “Structured Early detection of Asymptomatic Liver Cirrhosis: Results of the population-based liver screening program SEAL,” Journal of Hepatology, vol. 77, no. 3,pp.695–701,Sep.2022,doi: https://doi.org/10.1016/j.jhep.2022.04.009.
[6] E. Forte et al., “Top-Down Proteomics Identifies Plasma Proteoform Signatures of Liver Cirrhosis Progression,” Molecular & Cellular Proteomics, pp. 100876–100876, Nov. 2024, doi: https://doi.org/10.1016/j.mcpro.2024.100876.
[7] Varshni Premnath and Shanthi Veerappapillai, “Unveiling miRNA–Gene Regulatory Axes as Promising Biomarkers for Liver Cirrhosis and Hepatocellular Carcinoma,” ACS Omega, vol. 9, no. 44, pp. 44507–44521, Oct. 2024, doi: https://doi.org/10.1021/acsomega.4c06551.
[8] L. Wang et al., “Impact of Asymptomatic Superior Mesenteric Vein Thrombosis on the Outcomes of Patients with Liver Cirrhosis,” Thrombosis and Haemostasis, vol. 122, no. 12, pp. 2019–2029, Sep. 2022, doi: https://doi.org/10.1055/s-0042-1756648.
[9] Md. Nahid Hasan, T. Ahmed, Md. Ashik, Md. Jahid Hasan, Tahaziba Azmin, and J. Uddin, “An Analysis of Covid-19 Pandemic Outbreak on Economy using Neural Network and Random Forest,” Journal of Information Systems and Telecommunication (JIST), vol. 11, no. 42, pp. 163–175, Jun. 2023, doi: https://doi.org/10.52547/jist.34246.11.42.163.
[10] Sudiksha Kottachery Kamath, Sanjeev Kushal Pendekanti, and D. Rao, “LivMarX: An Optimized Low-Cost Predictive Model Using Biomarkers for Interpretable Liver Cirrhosis Stage Classification,” IEEE Access, vol. 12, pp. 92506–92522,Jan.2024,doi:https://doi.org/10.1109/access.2024.3422451.
[11] I. Hanif and M. M. Khan, “Liver Cirrhosis Prediction using Machine Learning Approaches,” 2022 IEEE 13th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), Oct. 2022, doi: https://doi.org/10.1109/uemcon54665.2022.9965718.
[12] D. Bhardwaj, G. Kaur, and G. L. Babu, “Improving Prognostic Prediction of Cirrhosis Using an Optimized Ensemble Machine Learning Approach,” pp. 1–6, Aug. 2024, doi: https://doi.org/10.1109/ciscon62171.2024.10695979.
[13] Bhanu Prakash K, Vennela D, Dhana Lakshmi N, and Siva Priyanka S, “Stage Prediction of Liver Cirrhosis Disease using Machine Learning,” pp. 1–6, Aug. 2024, doi: https://doi.org/10.1109/icecsp61809.2024.10698096.
[14] Rauf Jamadar, Harsh Uike, and Vaishali Jabade, “Cirrhosis Disease Prediction Using Machine Learning,” pp. 515–520, Dec.2023,doi:https://doi.org/10.1109/icacctech61146.2023.00090.
[15] Tejasv Singh Sidana, S. Singhal, S. Gupta, and R. Goel, “Liver Cirrhosis Stage Prediction Using Machine Learning: Multiclass Classification,” Lecture notes in networks and systems, pp. 109–129, Nov. 2022, doi: https://doi.org/10.1007/978-981-19-3679-1_9.
[16] Arif Mudi Priyatno and Triyanna Widiyaningtyas, “A SYSTEMATIC LITERATURE REVIEW: RECURSIVE FEATURE ELIMINATION ALGORITHMS,” JITK (Jurnal Ilmu Pengetahuan dan Teknologi Komputer), vol. 9, no. 2, pp. 196–207,Feb.2024,doi: https://doi.org/10.33480/jitk.v9i2.5015.
[17] S. I. Khan and A. S. M. L. Hoque, “SICE: an improved missing data imputation technique,” Journal of Big Data, vol. 7, no. 1, Jun. 2020, doi: https://doi.org/10.1186/s40537-020-00313-w.
[18] S. Jeganathan, A. R. Lakshminarayanan, S. Parthasarathy, A. Abdul Azeez Khan, and K. J. Sathick, “OptCatB: Optuna Hyperparameter Optimization Model to Forecast the Educational Proficiency of Immigrant Students based on CatBoost Regression,” Journal of Internet Services and Information Security, vol. 14, no. 3, pp. 111–132, Aug. 2024, doi: https://doi.org/10.58346/jisis.2024.i2.008.
[19] F. Fazel and B. Foing, “Evaluating Classification Algorithms: Exoplanet Detection using Kepler Time Series Data,” arXiv (CornellUniversity),Feb.2024,doi: https://doi.org/10.48550/arxiv.2402.15874.
[20] Fedesoriano, “Cirrhosis Prediction Dataset,” www.kaggle.com.https://www.kaggle.com/fedesoriano/cirrhosis-prediction-dataset.
[21] V. Thambawita et al., “An Extensive Study on Cross-Dataset Bias and Evaluation Metrics Interpretation for Machine Learning Applied to Gastrointestinal Tract Abnormality Classification,” ACM Transactions on Computing for Healthcare, vol. 1, no. 3, pp. 1–29, Jul. 2020, doi: https://doi.org/10.1145/3386295.
[22] P. J. Muhammad Ali, “Investigating the Impact of Min-Max Data Normalization on the Regression Performance of K-Nearest Neighbor with Different Similarity Measurements,” ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, vol. 10, no. 1, pp. 85–91, Jun. 2022, doi: https://doi.org/10.14500/aro.10955.
[23] K. K, U. K, S. A, and A. Kumar, “Predicting Student Performance for Early Intervention using Classification Algorithms in Machine Learning,” Journal of Information Systems and Telecommunication (JIST), vol. 9, no. 36, pp. 226–235,Oct.2021,doi: https://doi.org/10.52547/jist.9.36.226.
[24] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A Next-generation Hyperparameter Optimization Framework,” arXiv (Cornell University), Jul. 2019, doi: https://doi.org/10.48550/arxiv.1907.10902.
[25] I. D. Mienye and N. Jere, “A Survey of Decision Trees: Concepts, Algorithms, and Applications,” IEEE access, pp. 1–1,Jan.2024,doi: https://doi.org/10.1109/access.2024.3416838.
[26] A. Jafarnejad, A. Rezasoltani, and A. M. Khani, "Comparative Analysis of Machine Learning Algorithms in Predicting Jumps in Stock Closing Price: Case Study of Iran Khodro Using NearMiss and SMOTE Approaches," Iranian Journal of Finance, vol. 9, no. 3, pp. 27–54, 2025, doi: 10.30699/ijf.2025.491324.1496.
[27] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine Learning, vol. 63, no. 1, pp. 3–42, Mar. 2006, doi: https://doi.org/10.1007/s10994-006-6226-1.
[28] G. Biau and B. Cadre, “Optimization by gradient boosting,” arXiv.org, Jul. 17, 2017. https://arxiv.org/abs/1707.05023 (accessed Apr. 24, 2024).
[29] Y. Ding, H. Zhu, R. Chen, and R. Li, “An Efficient AdaBoost Algorithm with the Multiple Thresholds Classification,” Applied Sciences, vol. 12, no. 12, p. 5872, Jun. 2022, doi: https://doi.org/10.3390/app12125872. [29] C. Starbuck, "Logistic regression," in Springer eBooks, pp. 223–238, 2023. doi: 10.1007/978-3-031-28674-2_12.
[30] A. Jafarnejad Chaghoshi, A. Rezasoltani, and A. M. Khani, "Unleashing the Power of Ensemble Learning: Predicting National Ranks in Iran’s University Entrance Examination," Industrial Management Journal, vol. 16, no. 3, pp. 457–481, 2024, doi: 10.22059/imj.2024.381521.1008178.
[31] G. Ke et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” hal.science, Dec. 04, 2017. https://hal.science/hal-03953007 (accessed Mar. 27, 2023).
[32] A. V. Dorogush, V. Ershov, and A. Gulin, “CatBoost: gradient boosting with categorical features support,” arXiv.org, Oct. 24, 2018. https://arxiv.org/abs/1810.11363.
[33] Motiei, M., Khani, A. M., & Beyrami, S. (2021). The effect of green supply chain and green human resource management on environmental performance: The mediating role of green innovation. Logistics Thought, 20(77), 165–197. https://doi.org/10.22034/lot.2021.96691.
[34] A. Jafarnjad, A. Rezasoltani, and A. M. Khani, "Analyzing and Predicting Hiring Decisions Using Machine Learning and Deep Learning," Journal of Public Administration, vol. 17, no. 2, pp. 295–327, 2025, doi: 10.22059/jipa.2025.390322.3649.
[35] Jafarnejad Chaghoshi, A., Khani, A. M., & Rezasoltani, A. (2024). Risk modeling in banking services for the blind using fuzzy FMEA and graph neural network (GNN). Journal of Industrial Management Perspective, 14(4), 223–255. https://doi.org/10.48308/jimp.14.4.223.
[36] P.J.Beslin Pajila, B. Gracelin. Sheena, A. Gayathri, J. Aswini, M. Nalini, and Siva Subramanian R, “A Comprehensive Survey on Naive Bayes Algorithm: Advantages, Limitations and Applications,” Sep. 2023, doi: https://doi.org/10.1109/icosec58147.2023.10276274.
[37] J. Kasubi, M. D. Huchaiah, I. Gad, and M. K. Hooshmand, “A Comparison Analysis of Conventional Classifiers and Deep Learning Model for Activity Recognition in Smart Homes based on Multi-label Classification,” Journal of Information Systems and Telecommunication (JIST), vol. 12,no46pp127–137,Jun.2024,doi: https://doi.org/10.61186/jist.36294.12.46.127.
[38] A. Rezasoltani, A. Jafarnejad, and A. M. Khani, "A voting-based hybrid machine learning model for predicting backorders in the supply chain," Journal of Decisions and Operations Research, vol. 10, no. 1, pp. 194–213, 2025, doi: 10.22105/dmor.2025.511401.1924.