PP03-36: Classification of Electronic Medical Record Laboratory Data using Machine Learning Techniques
Sapthagirishwaran Thennal Sivaramakrishnan
R&D Information Systems Analyst II
Gilead Sciences Inc United States
The objective is to build a semi-supervised machine learning model to correctly and efficiently classify real-world laboratory data which helps in reducing manual intervention on a periodic basis.
Data preprocessing included data wrangling tasks as well as applying business rules. Feature Engineering involved String Indexer, One Hot Encoder and Vector Assembler stages. Naïve Bayes was chosen as the baseline model since it is a binary classification problem. Random Forest was the best model.
The primary source of data was obtained from EMR LAB data, where a laboratory test such as albumin was used, and a random sample was generated for analysis (approx., 200,000).
A basic model was trained to examine the relationship between the type of records (a good/not sure record), and other set of features. In this context, we fit a model
Y (outcome) = X1 + X2 + … + Xn + error
Where outcome = good or not sure, X1 + X2 + … + Xn = variables
A basic relationship between outcome and minimum set of predictors was examined. The final model used all the features in the dataset to predict the outcome.
Accuracy, precision, recall and F1 score were used to evaluate the performance of the model.
The results of the machine learning model on both the sample and the full dataset are given below
Model Sample Dataset (Albumin) = 200448 records Full Dataset (Albumin) = 75199337 records
Accuracy Weighted Precision Weighted Recall F1 Score Accuracy Weighted Precision Weighted Recall F1 Score
Naïve Bayes (baseline model) 0.5217 0.5223 0.5217 0.5145 0.5213 0.5218 0.5213 0.5131
Random Forest 0.8932 0.8979 0.8932 0.8929 0.9541 0.9581 0.9541 0.9449
Given that it was a binary classification problem, Naïve Bayes were chosen to be the baseline model.
Random Forest classifier was applied to assess its classification capacity on predicting whether a laboratory record was good or not.
With a semi-supervised approach, we were able to predict the type of laboratory record (good or not sure) with high precision by using a random forest classifier. We can extend this approach to other EMR laboratory and easily classify any type of a laboratory measure that we may be of interest to us.
Sapthagirishwaran Thennal Sivaramakrishnan, R&D Information Systems Analyst II, Pharmacovigilance & Epidemiology, Gilead Sciences Inc.
To know more about Gilead Sciences, please visit www.gilead.com