Back to Journals » Infection and Drug Resistance » Volume 19
Analysis of High-Risk Factors for Tuberculosis Retreatment Based on Machine Learning and Latent Class Analysis
Authors Du X
, Yimamu M, Na Y, Li X, Wang Z, Nuermaihaimaiti ZZ, Wang Y, Zhang L, Zheng Y
Received 28 January 2026
Accepted for publication 16 April 2026
Published 25 April 2026 Volume 2026:19 594300
DOI https://doi.org/10.2147/IDR.S594300
Checked for plagiarism Yes
Review by Single anonymous peer review
Peer reviewer comments 2
Editor who approved publication: Dr Sandip Patil
Xilong Du,1 Maiwulajiang Yimamu,2 Yan Na,3 Xiaoxue Li,3 Ziyu Wang,3 Zulimire Z Nuermaihaimaiti,3 Yuxin Wang,3 Liping Zhang,3 Yanling Zheng3,4
1School of Public Health, Xinjiang Medical University, Urumqi, Xinjiang, People’s Republic of China; 2Tuberculosis and Leprosy Prevention and Control Department, kashgar Prefecture Center for Disease Control and Prevention, Kashgar, Xinjiang, People’s Republic of China; 3College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, Xinjiang, People’s Republic of China; 4Institute of Medical Engineering Interdisciplinary Research, Xinjiang Medical University, Urumqi, Xinjiang, People’s Republic of China
Correspondence: Yanling Zheng, Email [email protected]
Object: To identify high-risk factors for tuberculosis retreatment and to provide a scientific basis for developing targeted prevention and control strategies by integrating machine learning with latent class analysis.
Methods: This study retrospectively collected baseline and treatment-related data from 6,821 tuberculosis patients, employing machine learning and latent class analysis (LCA) to investigate the key influencing factors associated with high-risk populations for retreatment.
Results: The XGBoost model achieved an overall accuracy of 84% and an area under the ROC curve (AUC) of 0.938. The analysis identified sputum examination results at month 6 or 8 of treatment, treatment regimen, and diagnostic classification as the most influential factors associated with retreatment. SHAP analysis further revealed that a sputum examination status of “not performed” was strongly linked to increased retreatment risk. Logistic regression confirmed this finding, with “not performed” (OR = 123.47, P < 0.001) and a “positive” result (OR = 14.89, P = 0.02) at month 6 or 8 identified as significant risk factors. Latent class analysis stratified patients into four distinct subgroups, among which those characterized by comorbid diabetes or prior treatment failure constituted the highest-risk populations for retreatment.
Conclusion: It is recommended to improve treatment adherence and efficacy monitoring for newly diagnosed patients, strengthen whole-course supervision, and optimize management for elderly patients and those on long-term regimens.
Keywords: tuberculosis, latent class analysis, random forest, xgboost, cramér’s v
Introduction
Tuberculosis remains a globally widespread infectious disease with high transmissibility and fatal risks, posing an ongoing threat to human health.1,2 The World Health Organization’s latest Global Tuberculosis Report 20253 indicates that in 2024, there were an estimated 10.7 million new tuberculosis cases worldwide, with an estimated incidence rate of 131 per 100,000 population. The global tuberculosis incidence rate declined by nearly 2% between 2023 and 2024. In 2024, an estimated 390,000 people developed multidrug-resistant or rifampicin-resistant tuberculosis, accounting for 3.6% of all tuberculosis cases. While the global estimated number of such cases has been declining since 2015–2024, some countries and regions still report localized increases in drug-resistant tuberculosis cases. Regarding drug resistance risk, the 2024 report clearly states that the global rate of multidrug-resistant/rifampicin-resistant tuberculosis among previously treated patients was 16%, compared to only 3.2% among newly treated patients. This highlights that the drug resistance risk in retreatment cases is significantly higher than in newly treated cases.
Tuberculosis retreatment patients refer to individuals with a history of previous tuberculosis treatment who require anti-tuberculosis therapy again after treatment failure or relapse. Tuberculosis recurrence may result from exogenous reinfection or endogenous reactivation of the initial infection.4,5 Currently, the prevention and control of tuberculosis retreatment is one of the key challenges facing China’s tuberculosis control system.6 Neglecting risk factors related to “retreatment” in clinical diagnosis, treatment, and prevention efforts can easily lead to treatment failure or disease recurrence.7 Therefore, an in-depth exploration of various factors influencing the treatment outcomes of tuberculosis retreatment patients can not only provide a basis for developing personalized treatment plans for different patients,8 enhance patient treatment confidence, and facilitate recovery, but also offer practical support for the refined optimization of tuberculosis treatment strategies in high-burden regions of western China.9 This holds significant importance both academically and practically.
In the fields of machine learning and statistical analysis, the synergistic application of multiple methods provides robust support for research on tuberculosis clinical characteristics. Random Forest (RF), as a widely used algorithm in bioinformatics and related fields,10 serves as an efficient feature selection tool. It can output feature importance scores, exclude irrelevant variables, and capture nonlinear relationships and variable interactions in data, while also offering strong predictive power and intuitive interpretability. XGBoost (eXtreme Gradient Boosting), as a high-performance ensemble learning algorithm, iteratively optimizes loss functions through gradient descent and supports various complex algorithms for precise error fitting,11 making it suitable for scenarios requiring high prediction accuracy. It has previously been applied in predicting drug resistance of tuberculosis strains.12 Latent Class Analysis (LCA), on the other hand, can identify potential patient subgroups from complex clinical data through probabilistic models, revealing heterogeneity among different subgroups in terms of clinical characteristics and treatment responses.13 This effectively addresses the limitations of overall data analysis caused by significant individual differences among tuberculosis patients and provides targeted evidence for the formulation of precision intervention strategies. Additionally, the logistic regression model, as a classical statistical modeling method, offers efficient analysis of associations between binary or multi-class outcomes and clinical characteristics based on its straightforward mathematical logic and clear interpretability. By calculating odds ratios (OR values), it quantifies the impact of features on outcomes such as tuberculosis risk and treatment prognosis. With its low requirements for data distribution assumptions and computational cost, it often serves as a baseline model complementary to machine learning algorithms, playing an irreplaceable foundational role in studies on tuberculosis risk factor screening and prognosis prediction.14–16
Based on large-sample baseline and treatment-related data of tuberculosis patients, this study first employs the Random Forest algorithm to screen key features of research value. It then uses Cramér’s V coefficient to measure the strength of associations among categorical variables, eliminating highly correlated variables to enhance the performance and interpretability of subsequent models. After completing variable selection, this study applies the XGBoost model to conduct an in-depth analysis of the selected variables. Simultaneously, SHapley Additive exPlanations(SHAP) is incorporated to obtain more precise data on feature contributions, exploring the impact of different factors on the risk of tuberculosis retreatment. Furthermore, LCA is introduced to reveal the heterogeneity of potential subgroups within the patient population. Ultimately, this study aims to clarify the specific effects of different factor categories on tuberculosis retreatment patients, providing theoretical reference for the precision prevention and control of tuberculosis.
While each of these methods can be individually applied in tuberculosis research, their synergistic integration—leveraging Random Forest for efficient feature screening, XGBoost for robust predictive modeling with SHAP for interpretability, and LCA for uncovering hidden patient heterogeneity—offers a more comprehensive analytical framework for identifying high-risk populations and informing precision prevention strategies.
Materials and Methods
Data Source
A retrospective analysis was conducted on data from 6,821 tuberculosis patients in the Kashgar region of China from January 1, 2022 to December 31, 2022. After screening, data from 5,826 tuberculosis patients were included, comprising 4,430 patients undergoing initial treatment and 1,396 patients undergoing retreatment. The dataset encompassed baseline information of the tuberculosis patients (including name, gender, age, patient source, history of previous anti-tuberculosis treatment, diagnostic classification, diagnostic results, comorbidities, sputum smear examination at month 0, sputum culture results, imaging findings, molecular biology results, etiological results, drug susceptibility testing results, and strain identification). Treatment-related data were also collected (including treatment outcomes, sputum examination, actual medication management method, treatment regimen, sputum smear examination at month 2, sputum smear examination at month 5, treatment protocol, and sputum smear examination at month 6 or 8).
Statistical Analysis Method
Based on the baseline and treatment-related data of tuberculosis patients, this study first calculated the feature importance scores using Random Forest. These scores represent the contribution level of each feature to the model. Features with relatively low contribution levels were excluded. The Cramér’s V coefficient was then used to measure the strength of association between two categorical variables, and among highly correlated features, those with lower importance scores were excluded. The XGBoost model was employed to analyze and calculate the impact of different factors on the treatment outcomes of tuberculosis retreatment. The performance of the XGBoost model was evaluated using the confusion matrix, Receiver Operating Characteristic Curve (ROC Curve), and Precision-Recall Curve (PR Curve). The feature contributions from the XGBoost model and the more precise feature contributions provided by SHAP were used to analyze the influence of various factors on tuberculosis treatment outcomes. Additionally, the logistic regression model offered excellent interpretability for each feature. Latent Class Analysis (LCA) revealed hidden heterogeneity within the patient population and provided well-defined target groups for precision intervention. The specific analytical workflow is illustrated in Figure 1.
|
Figure 1 Flowchart of the study. Bold text denotes the three main phases: data collection, model analysis, and result analysis. |
Statistical analysis was conducted using R version 4.4.1 and Python version 3.10. Random Forest and Cramér’s V were implemented in R 4.4.1, while XGBoost and SHAP were implemented in Python.
For XGBoost modeling, the xgboost package (version 1.7.5) in Python 3.10 was used. Hyperparameters were optimized via grid search with three-fold stratified cross-validation, searching over n_estimators (50, 100), max_depth (3, 6), and learning_rate (0.01, 0.1). The final optimal parameters were n_estimators = 100, max_depth = 6, and learning_rate = 0.1. Random Forest was implemented using the randomForest package (version 4.7–1) in R 4.4.1 with ntree = 500 and mtry = sqrt(p). To address class imbalance between initial treatment (n = 4,430) and retreatment (n = 1,396) cases, sample weights were applied using class_weight = “balanced”. Model performance was evaluated using precision, recall, F1-score, and the area under the precision-recall curve (AP) in addition to accuracy and ROC-AUC, as these metrics are more robust for imbalanced data.Latent class analysis was performed using the poLCA package (version 1.6.0) in R 4.4.1. Model selection was guided by the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and entropy (ranging from 0 to 1, with higher values indicating better class separation). A maximum of five latent classes was specified, and the final four-class model was selected based on optimal fit indices and interpretability.
Results
Univariate Analysis of Indicators Related to Tuberculosis Retreatment
This study employed the Chi-square test to conduct a univariate analysis of factors associated with retreatment among tuberculosis patients. The results are presented in Table 1. The analysis identified the following as significant risk factors influencing tuberculosis retreatment (all with P-values < 0.05):Sex, Patient Source, HIV, Diagnosis Typing, Diagnostic Result, Complication, Treatment Classification, Medication Management, Treatment Outcome, Treatment Mode, Sputum examination at month 0, Sputum examination at month 2, Sputum examination at month 5, Sputum Culture, Radiological Results, Molecular Biology, Treatment Scheme, Etiological Results, Sputum examination at month 6 or 8, Drug Sensitivity Test, Age, Bacteria Identification.
|
Table 1 Baseline Characteristics and Clinical Treatment Variables of Tuberculosis Patients, Stratified by Initial Treatment Vs. Retreatment |
Feature Selection with Random Forest
Due to the large number of independent variables, feature screening was conducted based on their importance scores. Features with relatively low mean decrease in accuracy were excluded. Specifically, seven variables—gender, imaging findings, HIV test results, treatment modality, patient origin, etiological findings, and medication management—were removed.
Selecting key features helps reduce noise from irrelevant variables and lowers the computational complexity of the XGBoost model. Moreover, an excessive number of features may lead to overfitting or slower training in XGBoost (see Figure 2). By pre-screening with Random Forest, the most discriminative features were retained, thereby improving the generalization ability of the subsequent model.
|
Figure 2 Feature Importance from the Random Forest Model. |
Cramér’s V
Highly correlated features can cause XGBoost to repeatedly learn similar information, thereby increasing the risk of overfitting. Furthermore, high correlation can dilute the importance scores among the features. Finally, highly correlated features may introduce bias in gradient updates, affecting the optimization path of the gradient boosting process.
Since the independent variables in this study are categorical, we assessed the correlations between them using Cramér’s V coefficient. For pairs of variables exhibiting high correlation, the feature with the lower importance score was excluded (see Figure 3). Consequently, five variables were removed: diagnosis result, May sequential sputum test, molecular biology test, drug susceptibility test, and strain identification.
|
Figure 3 Cramér’s V Correlation Matrix. |
The Performance of the XGBoost Model
Based on Table 2 and Figure 4, the model demonstrated outstanding performance in identifying “initial treatment” cases, achieving a precision of 0.98, a recall of 0.81, and an F1-score of 0.89. This indicates that the model maintains a low misdiagnosis rate while successfully identifying 81% of patients requiring initial treatment.
|
Table 2 Performance Metrics of the XGBoost Model for Distinguishing Initial Treatment from Retreatment Tuberculosis Patients |
|
Figure 4 Confusion Matrix. |
For “retreatment” cases, the model achieved a recall of 0.95, demonstrating high sensitivity in detecting actual retreatment patients. This high recall effectively minimizes the risk of delays in initiating second-line or salvage therapy due to missed diagnoses. Although the precision was 0.61, suggesting that some initial treatment cases might be misclassified as retreatment, such instances can be further addressed in clinical practice through secondary evaluations or supplemental examinations. Therefore, after careful consideration of the trade-offs, the model maintains a high level of safety and practical utility.
The model achieved an area under the ROC curve (AUC) of 0.938 and an average precision (AP) of 0.741. Both metrics are substantially higher than the random classifier baseline, further confirming the model’s robustness across different classification thresholds and its exceptional ability to distinguish between the two classes.(Please refer to Figure 5 for details).
|
Figure 5 (a) Receiver Operating Characteristic curve (b) Precision-Recall curve. |
Feature Importance and SHAP Analysis of the XGBoost Model
In Figure 6, we employed the XGBoost model combined with SHAP value analysis to deeply explore the characteristics that identify individuals at high risk of tuberculosis retreatment. The research yielded a series of meaningful conclusions.
|
Figure 6 Comparison of global feature importance metrics for the XGBoost model: (a) Gain-based importance versus mean |SHAP value|, and (b) Top features based on Gain-based importance. |
In Figure 7 and Table 3, the SHAP values are based on the predicted probability of the “initial treatment” outcome. A positive SHAP value indicates that the feature increases the model’s prediction of “initial treatment” (ie., a protective factor against retreatment), while a negative SHAP value indicates that the feature increases the prediction of “retreatment” (ie., a risk factor for retreatment).
It was found that the intrinsic feature importance metric (Gain) of the XGBoost model showed high concordance with the ranking based on SHAP values. This indicates a consensus between the two methods in identifying critical variables. The alignment in ranking for specific features suggests a stable and reliable assessment of their importance by the XGBoost model, implying that their significance is not an artifact of the evaluation method and is thus highly credible.
Notably, three features were consistently identified as core decision factors influencing the “initial treatment” outcome: 1) sequential sputum test in June, 2) treatment regimen, and 3) diagnostic classification.(Please refer to Figure 6 and 7 and Table 3 for details).
Logistic Regression Analysis for Initial Treatment versus Retreatment in Tuberculosis
Multivariable logistic regression analysis identified several factors significantly associated with the risk of tuberculosis retreatment. Age demonstrated a dose-response relationship with increasing risk (45–54 years: OR = 2.17; 55–64 years: OR = 3.25; ≥65 years: OR = 4.09). A diagnosis of secondary tuberculosis was strongly associated with higher odds (OR = 5.16), as were various complications (OR range: 2.52–3.71). The strongest predictor was sputum smear status at treatment completion, with “not performed” (OR = 123.47) and “positive” (OR = 14.89) conferring substantially elevated risks. Specific treatment regimens (2HRZE/4HR: OR = 2.85) were also risk factors. Conversely, treatment outcome of death (OR = 0.40), positive initial sputum smear (OR = 0.63), “not performed” sputum smear at month 2 (OR = 0.32), and “not performed” sputum culture (OR = 0.73) were significantly associated with reduced odds of retreatment. (Please refer to Figure 8 and Table 4 for details).
|
Figure 8 Forest Plot of Risk Factors for Tuberculosis Retreatment. |
Classification of Tuberculosis Patients
Latent class analysis identified the optimal subgroup classification for tuberculosis patients. As shown in Table 5 and Figure 9, both AIC and BIC values decreased continuously as the number of latent classes increased. To avoid excessive model complexity and overfitting, a maximum of five subgroups was set for this study. The latent class analysis revealed significant differences among all clinical characteristics across the five subgroups of tuberculosis patients (all p-values < 0.001). Although the five-class model showed slightly better absolute fit indices, the four-class model was selected as the optimal and more pragmatic choice due to its higher classification clarity (entropy), better parsimony, and perfect alignment with predefined clinical subgroups. Specifically, Class 1 (Treatment Failure Type) was characterized by poor treatment outcomes, with significantly higher mortality (43.5%) and transition to MDR-TB treatment regimens (26.6%) compared to other classes, along with an extremely high rate of unexamined sputum tests during treatment (63.4% at 2 months, 98.1% at treatment completion). Class 2 (Treatment Success Type) was primarily defined by high success rates under standard treatment regimens (97.9% received the 2HRZE/4HR regimen), achieving a 99.8% sputum culture conversion rate at 2 months and with most patients (98.6%) cured or completing treatment. Class 3 (Diabetes Comorbidity Type) was distinguished by a high proportion of diabetes comorbidity (49.3%), with most patients receiving long-term treatment regimens (81.0% on the 2HRZE/10HRE regimen) and exhibiting diverse treatment outcomes. Class 4 (High Infectivity with Rapid Control Type) presented with high pre-treatment sputum bacterial load (83.1% sputum culture positivity) but responded well to standard treatment regimens (96.2% on the 2HRZE/4HR regimen), achieving a 98.7% sputum culture conversion rate at 2 months and a high cure rate of 90.3%, demonstrating rapid and effective disease control.(Please refer to Table 6 for details).
|
Table 5 Model Fit Indices (AIC, BIC, Log-Likelihood) and Entropy for Latent Class Analysis with 2 to 5 Classes Among Tuberculosis Patients |
|
Table 6 Comparison of Clinical Characteristics Across Four Latent Classes of Tuberculosis Patients |
|
Figure 9 Trends in Model Fit Indices with Increasing Number of Latent Classes in Latent Class Analysis. |
Discussion
Retreatment of tuberculosis is a focal point in tuberculosis prevention and control. Analyzing and identifying high-risk factors for retreatment provides a scientific basis for precision prevention and control strategies, which holds significant importance for tuberculosis containment.Given the retrospective observational design of this study, all findings should be interpreted as associations rather than causal relationships. The identified risk factors are predictive indicators of tuberculosis retreatment, but causality cannot be inferred from this study design.
Figures 4 and 5 and Table 2 show that the overall accuracy of the model reaches 84%, with notable performance differences between categories. The model demonstrates excellent capability in identifying newly diagnosed tuberculosis (TB) patients, with a precision of 0.98 and a recall of 0.81. However, its ability to identify recurrent TB patients is relatively limited, with a precision of 0.61 and a recall of 0.95. This discrepancy is primarily attributed to imbalanced data distribution, obscured subgroup heterogeneity, and biases in the weighting of dynamic monitoring indicators.17 The area under the ROC curve (AUC) is 0.938, and the area under the Precision-Recall curve (AP) is 0.741, both significantly higher than random classification thresholds. This confirms the model’s robustness across different thresholds and its superior discriminative ability, indicating that the model maintains good classification performance under various conditions.
Results from Figure 6 reveal that the XGBoost model, combined with SHAP (SHapley Additive exPlanations), identifies sputum smear results at the 6th or 8th month of treatment, the treatment regimen, and diagnostic type as key influencing factors. Among these, sputum smear results at the 6th or 8th month rank first in importance in both methods. This importance stems from the fact that sputum conversion in recurrent TB patients is typically slower or remains persistently positive, consistent with findings from existing studies.18 The treatment regimen, as a high-risk influencing factor, not only reflects differences in drug combinations but also implies complex clinical information such as history of prior treatment failure, potential drug resistance, patient adherence, and disease severity.19 Diagnostic type serves as a risk factor by affecting potential drug resistance risk, disease severity, and the patient’s immune baseline, thus indirectly influencing recurrence risk.20
Figure 7 and Table 3 further reveals the non-linear impact of these key features on the model’s predictions. The sputum smear result at the 6th or 8th month of treatment has a decisive influence on the model’s prediction. A “negative” result is a significant protective factor (mean SHAP value = 0.230), while “not examined” (mean SHAP value = −0.357) and “positive” (mean SHAP value = −0.433) are both associated with a very high risk of adverse outcomes. This suggests that standardized sputum examination during the mid-to-late stages of treatment is a critical step in assessing prognosis.21 In contrast, sputum smear results at the initial stage of treatment have minimal impact, indicating that the model places greater emphasis on continuous monitoring rather than a single baseline measurement.22 Regarding treatment regimens, the standard “2HRZE/4HR” regimen (mean SHAP value = 0.016) shows a contribution value close to zero and has the largest sample size, reflecting its widespread effectiveness. In contrast, longer regimens for complex cases, such as “2HRZE/7-10HRE” (mean SHAP value = −0.205), are significantly associated with higher risk, primarily because the population receiving such regimens inherently carries a higher baseline risk. For diagnostic types, “tuberculous pleurisy” (mean SHAP value = 0.2) shows a higher risk contribution than the more common “secondary pulmonary tuberculosis.” This subtype may represent a patient subgroup with unique clinical characteristics (eg., extrapulmonary lesions, delayed diagnosis), and its independent value as a risk indicator warrants attention.23
Logistic regression analysis further quantifies the strength of association between key factors and recurrence risk, providing intuitive validation of the core role of treatment monitoring. Using sputum smear “negative” at the 6th or 8th month of treatment as the reference, the recurrence risk for patients with sputum “not examined” surges (OR=123.47), and the risk for those with sputum “positive” also increases significantly (OR=14.89). This highlights the decisive impact of obtaining bacteriological evidence during the mid-to-late stages of treatment on prognosis.24,25 Increasing age shows a clear dose-response relationship with recurrence risk. Using the 0–24 age group as reference, risk begins to rise significantly from the 45–54 age group (OR=2.17) and peaks in the 65+ age group (OR=4.09), suggesting that enhanced monitoring and individualized management are needed for elderly patients.26 Analysis of treatment regimens reveals significant clinical selection bias. Using the “2HRZE/10HRE” regimen as the reference group, the standard “2HRZE/4HR” regimen shows a stronger association with higher risk (OR=2.85). This result does not indicate inferior efficacy of the standard regimen but strongly reflects its role as a first-line therapy applied to a broader patient population. In contrast, the risk association for longer regimens like “2HRZE/7-10HRE” did not reach statistical significance, possibly due to their use for more complex cases with smaller sample sizes.27
Several findings that appear protective warrant cautious interpretation. First, the negative association between death and retreatment (OR = 0.40) reflects a competing risk phenomenon rather than a true protective effect: patients who die during treatment are no longer at risk for retreatment. Second, the lower retreatment risk associated with positive baseline sputum smear (OR = 0.63) may seem counterintuitive but likely results from enhanced clinical monitoring and management of patients with higher initial bacterial load, rather than a biological protective effect. Third, the “protective” associations for “not performed” sputum examination at month 2 (OR = 0.32) and “not performed” sputum culture (OR = 0.73) should be interpreted with caution; these may be attributable to selection bias (patients with favorable clinical response were less likely to be tested) or coding artifacts (eg., “not performed” may include patients who completed treatment early and were thus no longer under monitoring). Given the retrospective design, these associations should not be misinterpreted as causal protective effects.
This study identified four patient subgroups through Latent Class Analysis (LCA): treatment-failure patients (Class 1), treatment-success patients (Class 2), patients with comorbid diabetes (Class 3), and highly infectious but rapidly controlled patients (Class 4). Classes 2 and 4 represent low-risk groups, exemplifying the successful paradigm of the current TB control system and validating the effectiveness of standardized management. Their characteristics (overall high treatment success rate, minimal comorbidities, excellent sputum conversion rate) represent the successful paradigms within the current TB control system.28 Identifying these groups allows for more efficient allocation of public health resources rather than a one-size-fits-all approach. Conversely, Classes 1 and 3 constitute high-risk groups, revealing the clinical dilemma that “drug susceptibility does not guarantee treatment success” and the independent impact of comorbid diabetes on treatment outcomes. Patients with comorbid diabetes highlight the critical role of comorbidity management in TB treatment.29 Even with drug-susceptible strains, the presence of diabetes may complicate treatment independently, possibly by affecting immune response or drug metabolism. Treatment-failure patients (Class 1) reflect weaknesses in treatment management (slower sputum conversion), systematic gaps in treatment monitoring (extremely high rates of “not examined” at various time points), and a vicious cycle of treatment failure.30
This study finds that treatment adherence and efficacy monitoring are key to success or failure. Both LCA and SHAP analysis indicate that the “not examined” status is highly associated with recurrence. This strongly suggests that failure to complete sputum smear examinations at critical time points is itself an independent, more prevalent, and more alarming risk signal than a positive smear result. This directly guides us to prioritize ensuring the completion rate of sputum examinations at key time points as a core performance indicator for assessing healthcare system quality and optimizing patient management strategies. The high proportion of drug resistance in Class 4 identified by LCA corroborates the extremely high risk associated with “rifampicin resistance” and “multidrug resistance” in the logistic regression analysis. The root cause of all these issues points to failures in the management of initial treatment. Therefore, strengthening supervised management for newly diagnosed patients, ensuring they complete the full course of standardized treatment, is the most cost-effective and fundamental strategy for preventing acquired drug resistance. Advanced age and the use of longer treatment regimens indicate more complex patient conditions or the presence of initial drug resistance, alerting healthcare providers to initiate more stringent management and monitoring processes for such patients.
Several limitations of this study should be acknowledged. First, the retrospective observational design precludes causal inference; all identified associations are predictive rather than causal. Second, data were collected from a single region (Kashgar, China), which may limit generalizability to other populations with different epidemiological profiles. Third, the “not performed” category in sputum examinations may introduce information bias, as the reasons for missing tests (eg., clinical improvement, loss to follow-up, or logistical issues) could not be determined from the available data. Fourth, despite the use of sample weights to address class imbalance, the model’s precision for retreatment cases remained moderate (0.61), suggesting that additional predictors not captured in this dataset may further improve classification. Fifth, unmeasured confounders (eg., socioeconomic status, patient adherence beyond recorded data, HIV status with CD4 counts) may influence both treatment outcomes and retreatment risk. Future prospective studies with standardized follow-up protocols and larger sample sizes are needed to validate these findings.
Conclusions
This study utilized clinical data from tuberculosis patients and applied Random Forest and Cramér’s V for feature selection. An XGBoost predictive model was constructed, and SHAP was employed to interpret feature contributions. The research identified several factors associated with tuberculosis retreatment. The methodology adopted in this study demonstrates potential clinical value: it may help address nonlinear relationships and sample imbalance in tuberculosis data, and SHAP improves model interpretability. Given the retrospective observational design, these findings should be interpreted as associations rather than causal relationships. Based on these associations, the following considerations may inform tuberculosis control programs: optimizing treatment for patients with comorbid diabetes, enhancing treatment adherence and efficacy monitoring for newly diagnosed patients, strengthening whole-course supervised management, and implementing further optimized management for elderly patients and those on long-term treatment regimens.
Abbreviations
AIC, Akaike Information Criterion; AP, Average Precision; AUC, Area Under the ROC Curve; BIC, Bayesian Information Criterion; LCA, Latent Class Analysis; MDR-TB, Multidrug-Resistant Tuberculosis; OR, Odds Ratio; PR Curve, Precision-Recall Curve; RF, Random Forest; ROC Curve, Receiver Operating Characteristic Curve; SHAP, SHapley Additive exPlanationsTB, Tuberculosis; XGBoost, eXtreme Gradient Boosting.
Ethical Approval and Consent to Participate
This retrospective observational study was conducted in accordance with the principles of the Declaration of Helsinki. The study protocol was approved by the Ethics Committee of Xinjiang Medical University (Approval No. XJYKDXR20240724011). The requirement for informed consent was waived by the ethics committee because: (1) this study utilized only existing, fully de-identified historical medical records and did not involve any patient intervention; (2) obtaining consent from each patient was impracticable and would have rendered the retrospective study unfeasible; and (3) the research posed no more than minimal risk to participants. All data were handled with strict confidentiality, and researchers had no access to personally identifiable information.
Acknowledgments
The authors appreciate the works by the Kashgar CDC.
Funding
Project of Top-notch Talents of Technological Youth of Xinjiang [Grant No. 2024TSYCCX0080]. This study was funded by the grants from the National Natural Science Foundation of China (72174175, 72064036, 72163033) and the College Student Innovation and Entrepreneurship Training Program (Grant No. S202510760111).
Disclosure
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
1. Dheda K, Mirzayev F, M CD, et al. Multidrug-resistant Tuberculosis. Nat. Rev. Dis. Primers. 2024;10(1). doi:10.1038/s41572-024-00504-2.
2. Janssen S, Murphy M, Upton C, Allwood B, Diacon AH. Tuberculosis: an Update for the Clinician. Respirology. 2025;30(3):196–19. doi:10.1111/resp.14887
3. World Health Organization. Global tuberculosis report 2025. Geneva:World Health Organization,2025.
4. Vega V, Cabrera-Sanchez J, Rodríguez S, et al. Risk factors for pulmonary tuberculosis recurrence, relapse and reinfection: a systematic review and meta-analysis. BMJ Open Respir Res. 2024;11(1):e002281. doi:10.1136/bmjresp-2023-002281
5. Hermans SM, Akkerman OW, Meintjes G, Grobusch MP. Post-tuberculosis treatment paradoxical reactions. Infection. 2024;52(5):2083–2095. doi:10.1007/s15010-024-02310-0
6. Vo TTB, Nguyen DT, Nguyen TC, et al. Exploring gene mutations and multidrug resistance in Mycobacterium tuberculosis: a study from the Lung Hospital in Vietnam. Mol. Biol. Rep. 2024;51(1). doi:10.1007/s11033-024-10015-8.
7. Lv H, Zhang X, Zhang X, et al. Global prevalence and burden of Multidrug-resistant tuberculosis from 1990 to 2019. BMC Infect Dis. 2024;24(1). doi:10.1186/s12879-024-09079-5.
8. Naidoo K, Perumal R, Cox H, et al. The epidemiology, transmission, diagnosis, and management of drug-resistant tuberculosis—lessons from the South African Experience. Lancet Infect Dis. 2024;24(9):e559–e575. doi:10.1016/S1473-3099(24)00144-0
9. Jin C, Wu Y, Chen J, et al. Prevalence and patterns of Drug-resistant Mycobacterium tuberculosis in newly diagnosed patients in China: a systematic review and meta-Analysis. J Global Antimicrob Resist. 2024;38:292–301. doi:10.1016/j.jgar.2024.05.018
10. Sambarey A, Smith K, Chung C, et al. Integrative analysis of multimodal patient data identifies personalized predictors of tuberculosis treatment Prognosis. IScience. 2024;27(2):109025. doi:10.1016/j.isci.2024.109025
11. Liang D, Wang L, Zhong P, et al. Perspective: global burden of iodine deficiency: insights and projections to 2050 using xgboost and SHAP. Adv Nutr. 2025;16(3):100384. doi:10.1016/j.advnut.2025.100384
12. M RS, Shiddik B. A.Utilizing artificial intelligence to predict and analyze socioeconomic, environmental, and healthcare factors driving tuberculosis globally. Sci Rep. 2025;15(1):13619. doi:10.1038/s41598-025-96973-w
13. Wang S, Li Z, Zhang T, et al. An interpretable machine learning approach reveals the interaction between air pollutants and climate factors on tuberculosis. Urban Climate. 2025;102420.
14. Wang Z, Guo Z, Wang W, et al. Prediction of tuberculosis treatment outcomes using biochemical makers with machine learning. BMC Infect. Dis. 2025;25(1):229.
15. Pal A, Mohanty D. Pal A,Mohanty D.Machine learning-based approach for identification of new resistance associated mutations from whole genome sequences of Mycobacterium tuberculosis. Bioinform Adv. 2025;5(1):vbaf050. doi:10.1093/bioadv/vbaf050
16. Kong H, Li Y, Shen Y, et al. Predicting the risk of pulmonary embolism in patients with tuberculosis using machine learning algorithms. Eur. J. Med. Res. 2024;29(1):618. doi:10.1186/s40001-024-02218-3
17. Regan M, Barham T, Li Y, et al. Risk factors underlying racial and ethnic disparities in tuberculosis diagnosis and treatment outcomes, 2011-19: a multiple mediation analysis of national surveillance data. Lancet Public Health. 2024;9(8):e564–e572. doi:10.1016/S2468-2667(24)00151-8
18. Yue X, Yanfei C, Ruijian H, et al. Interpretable machine learning in predicting drug-induced liver injury among tuberculosis patients: model development and validation study. BMC Med. Res. Method. 2024;24(1):92. doi:10.1186/s12874-024-02214-5
19. Xu R, Zhang Y, Li Z, et al. Breathomics for diagnosing tuberculosis in diabetes mellitus patients. Front Mol Biosci. 2024;1436135.
20. Srinivasan S, H D, S HRR, et al. Evaluating factors influencing tuberculosis treatment outcomes and the impact of COVID-19 on TB incidence in Bengaluru, India (2017-2023). Infectious Diseases. 2025;1–9.
21. Guo K, Xu X, Zhan Q, et al. Study on influencing factors of tuberculosis based on logistic regression and decision tree model. Soc Med Health Manage. 2025;6(1):1235–1245.
22. Zhou F, Sun Q, Huang S, et al. Trends and delays in pulmonary tuberculosis diagnosis among elderly patients (≥ 60 Years) in Southern China: a 13-year surveillance data analysis (2010–2022). BMC Public Health. 2025;25(1):1854. doi:10.1186/s12889-025-23031-5
23. Zhang W, Chen J, Chen Z, et al. Differentiating nontuberculous mycobacterial pulmonary disease from pulmonary tuberculosis in resource-limited settings: a pragmatic model for reducing misguided antitubercular treatment. Healthcare. 2025;13(9):1065. doi:10.3390/healthcare13091065
24. L FM, Magwaza C, Dlatu N, et al. Exploring determinants and predictive models of latent tuberculosis infection outcomes in rural areas of the eastern cape: a pilot comparative analysis of logistic regression and machine learning approaches. Information. 2025;16(3)):239.
25. Mok J, Jeong D, Sohn H, et al. Nationwide coverage of molecular drug susceptibility testing in patients with pulmonary multidrug/rifampicin-resistant tuberculosis in South Korea: a retrospective cohort study (2015-2021). BMJ Open Respir. Res. 2025;12(1):e003307. doi:10.1136/bmjresp-2025-003307
26. Xue D, Chen X, Shao L, et al. Risk factors for the progression from pulmonary tuberculosis to spinal tuberculosis: a logistic regression analysis. J. Orthop. Surg. Res. 2025;20(1):422. doi:10.1186/s13018-025-05848-3
27. Ma Z, Liu X, Zhang M, et al. Differences analysis between spinal tuberculosis and brucella spondylitis with preoperative non-invasive differential diagnosis. European Spine Journal. 2025;34(2):1–9. doi:10.1007/s00586-025-08647-w
28. Zhang L, Ma X, Gao H, et al. Analysis of care-seeking and diagnosis delay among pulmonary tuberculosis patients in Beijing, China. Front Public Health. 2024;1369541.
29. Rupani PM. Silicosis predicts drug resistance and retreatment among tuberculosis patients in India: a secondary data analysis from Khambhat, Gujarat (2006–2022). BMC Pulm Med. 2024;24(1):522. doi:10.1186/s12890-024-03338-6
30. J LY, Myong J, Kim Y, et al. Identifying predictors of unfavorable treatment outcomes in tuberculosis patients. Int J Environ Res Public Health. 2024;21(11):1454. doi:10.3390/ijerph21111454
© 2026 The Author(s). This work is published and licensed by Dove Medical Press Limited. The
full terms of this license are available at https://www.dovepress.com/terms
and incorporate the Creative Commons Attribution
- Non Commercial (unported, 4.0) License.
By accessing the work you hereby accept the Terms. Non-commercial uses of the work are permitted
without any further permission from Dove Medical Press Limited, provided the work is properly
attributed. For permission for commercial use of this work, please see paragraphs 4.2 and 5 of our Terms.
Recommended articles
Predicting Antibiotic Resistance in ICUs Patients by Applying Machine Learning in Vietnam
Tran Quoc V, Nguyen Thi Ngoc D, Nguyen Hoang T, Vu Thi H, Tong Duc M, Do Pham Nguyet T, Nguyen Van T, Ho Ngoc D, Vu Son G, Bui Duc T
Infection and Drug Resistance 2023, 16:5535-5546
Published Date: 22 August 2023
