Menu Back to Poster-Presentations-Details

PP11-82: Going Upstream in Machine Learning; The Importance of Feature Engineering With Examples From a Surgical ICU

Poster Presenter

      Andrew Wilson

      • Head of Innovative RWD Analytics
      • Parexel
        United States


To evaluate the impact of feature selection, data processing decisions, and algorithm selection on the performance of prediction models using open-source machine learning tools in Python and R.


EHR data from 5,101 surgical ICU patients was used to predict hospital-acquired pressure injury (HAPI) development. Performance (AUC, F-1 Score) was evaluated across variations of imputation method, minority-class oversampling, and applied algorithm, e.g., neural net, AdaBoost, or random forest.


HAPI occurred among 333 (6.5%) of surgical ICU patients. Model performance, as measured by the area under the [ROC] curve (AUC), F1-score, and the confusion matrix, was similar between models using different subsets of available variables as predictive features. Subsets ranged from full (k=33) to parsimonious (k=5) feature sets. Additionally, models performed similarly whether using a case-wise deletion or single value random forest imputation. There was a modest improvement with models that included an indicator of missingness, particularly for lab values, indicating informative missingness, that could be passed to the predictive models. The most impactful variation was the approach to imbalanced data; without adjusting for the imbalance, models tended to predict mostly zeroes resulting in misleadingly optimistic ROC performance relative to accuracy in predicting the minority class. Synthetic minority oversampling technique (SMOTE) resulted in improved accuracy in predicting the minority class in the full and parsimonious data sets. After applying SMOTE, all algorithms (neural network, random forest, logistic regression, adaptive boosting (AdaBoost), and extreme gradient boosting) performed similarly, but decidedly better than models without class balance considerations. The best performing algorithms, overall, were the logistic regression models.


Results demonstrate the importance of adjusting for the class imbalance in classification problems with strong class imbalance. Additionally, the similarity of performance between algorithms supports the theory that more, and higher quality data, are more important than the choice of algorithm and reemphasize the importance of upstream data retention and inclusion. Advances in machine learning may be upstream, in data retention for use, e.g., application of deep feature synthesis, rather than downstream, in more advanced analytics of flattened files. Furthermore, in the case study, the robust performance of the (highly) parsimonious group of variables suggests that it may be possible to develop more focused guidance for clinicians and caregivers, with the goal of reducing caregiver burden without compromising care.