Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Dec 1;19(23):16080.
doi: 10.3390/ijerph192316080.

Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series

Affiliations
Review

Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series

Liangyuan Hu et al. Int J Environ Res Public Health. .

Abstract

Tree-based machine learning methods have gained traction in the statistical and data science fields. They have been shown to provide better solutions to various research questions than traditional analysis approaches. To encourage the uptake of tree-based methods in health research, we review the methodological fundamentals of three key tree-based machine learning methods: random forests, extreme gradient boosting and Bayesian additive regression trees. We further conduct a series of case studies to illustrate how these methods can be properly used to solve important health research problems in four domains: variable selection, estimation of causal effects, propensity score weighting and missing data. We exposit that the central idea of using ensemble tree methods for these research questions is accurate prediction via flexible modeling. We applied ensemble trees methods to select important predictors for the presence of postoperative respiratory complication among early stage lung cancer patients with resectable tumors. We then demonstrated how to use these methods to estimate the causal effects of popular surgical approaches on postoperative respiratory complications among lung cancer patients. Using the same data, we further implemented the methods to accurately estimate the inverse probability weights for a propensity score analysis of the comparative effectiveness of the surgical approaches. Finally, we demonstrated how random forests can be used to impute missing data using the Study of Women's Health Across the Nation data set. To conclude, the tree-based methods are a flexible tool and should be properly used for health investigations.

Keywords: causal inference; ensemble methods; missing data; sensitivity analysis; variable selection.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
An illustrating classification tree diagram. Y indicates a case and N indicates a non-case.
Figure 2
Figure 2
Visualization of the BART variable selection algorithm. The vertical lines are the threshold levels determined from the “null” distributions for variable inclusion proportions computed from 100 permutated data. Variable inclusion proportions from the original (unpermutated) data passing this threshold are displayed as solid dots. Open dots correspond to variables that are not selected.
Figure 3
Figure 3
Distributions of the inverse probability of treatment weights estimated by BART, random forest, and XGBoost.
Figure 4
Figure 4
A comparison of the distributions of values for total hip bone mineral density and total spine bone mineral density among the imputed values and among the complete cases.

References

    1. Hernández B., Pennington S.R., Parnell A.C. Bayesian methods for proteomic biomarker development. EuPA Open Proteom. 2015;9:54–64. doi: 10.1016/j.euprot.2015.08.001. - DOI
    1. Hu L., Gu C., Lopez M., Ji J., Wisnivesky J. Estimation of causal effects of multiple treatments in observational studies with a binary outcome. Stat. Methods Med. Res. 2020;29:3218–3234. doi: 10.1177/0962280220921909. - DOI - PMC - PubMed
    1. Hu L., Gu C. Estimation of causal effects of multiple treatments in healthcare database studies with rare outcomes. Health Serv. Outcomes Res. Methodol. 2021;21:287–308. doi: 10.1007/s10742-020-00234-4. - DOI
    1. Mazumdar M., Lin J.Y.J., Zhang W., Li L., Liu M., Dharmarajan K., Sanderson M., Isola L., Hu L. Comparison of statistical and machine learning models for healthcare cost data: A simulation study motivated by Oncology Care Model (OCM) data. BMC Health Serv. Res. 2020;20:350. doi: 10.1186/s12913-020-05148-y. - DOI - PMC - PubMed
    1. Hu L., Liu B., Ji J., Li Y. Tree-Based Machine Learning to Identify and Understand Major Determinants for Stroke at the Neighborhood Level. J. Am. Heart Assoc. 2020;9:e016745. doi: 10.1161/JAHA.120.016745. - DOI - PMC - PubMed

LinkOut - more resources