We are updating our Privacy Policy and Terms and Conditions to help you clearly understand how your personal information is collected, stored and managed. Learn More
Menu Back to Poster-Presentations-Details

T 31: Comparison of Feature Encoding Methods for Automated Document Classification in Adverse Event Detection

Poster Presenter

      Joshua Ainsley

      • Data Scientist
      • Fino Consulting
        United States


Multiple methods for converting the unstructured text of medical research article abstracts into representations suitable for machine learning classification algorithms are tested for their effectiveness in identifying those articles with potentially adverse drug events.


Medical article abstracts and case reports were used to train machine learning models to classify documents based on whether or not they suggested an adverse drug reaction. Comparisons were made between bag-of-words, bigram, trigram, tf-idf, and distributed language representation methods.


A corpus of medical research article abstracts and case reports that detail patient problems, treatments, and outcomes for a variety of medical disorders and diseases were collected. These documents were classified into two groups based on whether the cause of the problem was likely to be due to an adverse drug reaction or not. Data was prepared by tokenizing each abstract into vectors of single words which were then filtered to remove single-occurrence words, highly common words, and numbers. Encoding the tokenized word vectors for machine learning was accomplished by counting the number of times single words were in each document, as well as two word phrases (bigrams) and three word phrases (trigrams). Another data set was created by weighing each vector using the term frequency-inverse document frequency (TF-IDF) method. This resulted in a sparse, fixed length numeric vector for each document that was used as input for training a logistic regression model for document classification. In addition, we created a model using the distributed language representation method word2vec that produces dense numeric representations of words based on the surrounding context. When turned into a classifier through inversion using Bayes rule, this yielded more accurate results than logistic regression. For this research, we used the open source programming languages R and Python and trained custom models on the Azure Machine Learning platform.


Post-approval pharmocovigilance, or the identification of potential new adverse drug events after drug approval, is a critical goal of the healthcare system. These activities are often mandated by regulatory agencies to ensure the continuous collection of safety data on a drug during normal use and to a more diverse group of patients than is usually possible during pre-approval clinical trials. The continually increasing number of medical journal articles and case studies being published each year has created additional difficulties in monitoring for potential adverse effects in an accurate and timely manner. Numerous data mining and machine learning algorithms have been utilized to reduce the amount of manual expert time required to evaluate a publication for potential adverse effects. However, no standardized methodology has been implemented. The development of distributed language representation methods such as word2vec that utilize neural networks to learn words and their contextual meaning have potential to lead to highly accurate numerical representations of words that can be used for classification models. In this study, we sought to compare distributed language representation methods with more established language processing methodology. The success of distributed language representation methods here suggest that their use should become more common in medical language processing methods.