DIA会员及用户请点击登录

登录

忘记用户 ID? or 忘记密码?

Not a Member?

创建账户并加入。

Menu 返回 Poster-Presentations-Details

PP11-82: Going Upstream in Machine Learning; The Importance of Feature Engineering With Examples From a Surgical ICU





Poster Presenter

      Andrew Wilson

      • Head of Innovative RWD Analytics
      • Parexel
        United States

Objectives

To evaluate the impact of feature selection, data processing decisions, and algorithm selection on the performance of prediction models using open-source machine learning tools in Python and R.

Method

EHR data from 5,101 surgical ICU patients was used to predict hospital-acquired pressure injury (HAPI) development. Performance (AUC, F-1 Score) was evaluated across variations of imputation method, minority-class oversampling, and applied algorithm, e.g., neural net, AdaBoost, or random forest.

Results

HAPI occurred among 333 (6.5%) of surgical ICU patients. Model performance, as measured by the area under the [ROC] curve (AUC), F1-score, and the confusion matrix, was similar between models using different subsets of available variables as predictive features. Subsets ranged from full (k=33) to parsimonious (k=5) feature sets. Additionally, models performed similarly whether using a case-wise deletion or single value random forest imputation. There was a modest improvement with models that included an indicator of missingness, particularly for lab values, indicating informative missingness, that could be passed to the predictive models. The most impactful variation was the approach to imbalanced data; without adjusting for the imbalance, models tended to predict mostly zeroes resulting in misleadingly optimistic ROC performance relative to accuracy in predicting the minority class. Synthetic minority oversampling technique (SMOTE) resulted in improved accuracy in predicting the minority class in the full and parsimonious data sets. After applying SMOTE, all algorithms (neural network, random forest, logistic regression, adaptive boosting (AdaBoost), and extreme gradient boosting) performed similarly, but decidedly better than models without class balance considerations. The best performing algorithms, overall, were the logistic regression models.

Conclusion

Results demonstrate the importance of adjusting for the class imbalance in classification problems with strong class imbalance. Additionally, the similarity of performance between algorithms supports the theory that more, and higher quality data, are more important than the choice of algorithm and reemphasize the importance of upstream data retention and inclusion. Advances in machine learning may be upstream, in data retention for use, e.g., application of deep feature synthesis, rather than downstream, in more advanced analytics of flattened files. Furthermore, in the case study, the robust performance of the (highly) parsimonious group of variables suggests that it may be possible to develop more focused guidance for clinicians and caregivers, with the goal of reducing caregiver burden without compromising care.

获得信息并保持参与

不要错失任何机会——请加入我们的邮件列表,了解DIA的观点和事件。