PP03-36: Classification of Electronic Medical Record Laboratory Data using Machine Learning Techniques

Poster Presenter

Sapthagirishwaran Thennal Sivaramakrishnan

R&D Information Systems Analyst II
Gilead Sciences Inc
United States

Objectives

The objective is to build a semi-supervised machine learning model to correctly and efficiently classify real-world laboratory data which helps in reducing manual intervention on a periodic basis.

Method

Data preprocessing included data wrangling tasks as well as applying business rules. Feature Engineering involved String Indexer, One Hot Encoder and Vector Assembler stages. Naïve Bayes was chosen as the baseline model since it is a binary classification problem. Random Forest was the best model.

Results

The primary source of data was obtained from EMR LAB data, where a laboratory test such as albumin was used, and a random sample was generated for analysis (approx., 200,000). Benchmark Model A basic model was trained to examine the relationship between the type of records (a good/not sure record), and other set of features. In this context, we fit a model Y (outcome) = X1 + X2 + … + Xn + error Where outcome = good or not sure, X1 + X2 + … + Xn = variables A basic relationship between outcome and minimum set of predictors was examined. The final model used all the features in the dataset to predict the outcome. Evaluation Metrics Accuracy, precision, recall and F1 score were used to evaluate the performance of the model. The results of the machine learning model on both the sample and the full dataset are given below Model Sample Dataset (Albumin) = 200448 records Full Dataset (Albumin) = 75199337 records Accuracy Weighted Precision Weighted Recall F1 Score Accuracy Weighted Precision Weighted Recall F1 Score Naïve Bayes (baseline model) 0.5217 0.5223 0.5217 0.5145 0.5213 0.5218 0.5213 0.5131 Random Forest 0.8932 0.8979 0.8932 0.8929 0.9541 0.9581 0.9541 0.9449 Given that it was a binary classification problem, Naïve Bayes were chosen to be the baseline model. Random Forest classifier was applied to assess its classification capacity on predicting whether a laboratory record was good or not.

Conclusion

With a semi-supervised approach, we were able to predict the type of laboratory record (good or not sure) with high precision by using a random forest classifier. We can extend this approach to other EMR laboratory and easily classify any type of a laboratory measure that we may be of interest to us. Author: Sapthagirishwaran Thennal Sivaramakrishnan, R&D Information Systems Analyst II, Pharmacovigilance & Epidemiology, Gilead Sciences Inc. To know more about Gilead Sciences, please visit www.gilead.com