Menu Back to Poster-Presentations-Details

T-01: A Novel Approach to Standardizing Data and Detecting Duplicates Across Adverse Events Data Sources Using Machine Learning

Poster Presenter

      Sameen Desai

      • Executive Director, IT Worldwide Patient Safety
      • Bristol-Myers Squibb Company
        United States


This paper aims to solve the problem of duplicate detection of adverse event cases within and across multiple data sets using a novel approach including machine learning.


Data from FDA (2004-Present), WHO Vigibase (1968-Present) was used for 1. Case Ingestion and Standardization 2. Case Clustering: Spark-based locality based hashing and calculation of Jaccard similarities run 3. Case Deduplication: Supervised learning with PV Scientists for labelled data and model tuning.


Celgene Inc. & Enigma Technologies, using supervised learning models from PV scientists, was able to probabilistically stratify 5.2% of FDA AERS & Vigibase cases into low and high confidence duplicates for further manual curation. 59,000 adverse events reports in FDA AERS data alone were found to be high confidence duplicates. Over 18,000 reports were detected as duplicates in the WHO Vigibase data. The team was able to achieve 70,000 pairwise duplicate checks per minute using this scaled, cloud-computing approach. This novel duplicate detection mechanism processed over 20 million unique adverse events reports in under 24 hours.


Accurate signals detection relies on resolved and timely data. The team’s approach here provides a novel and scalable approach to efficiently detect duplicates among large, bulky, and messy datasets. Furthermore, the process reveals key systematic patterns in how duplicates emerge in Adverse Events reporting systems. These patterns could only be revealed through this probabilistic, non-rules based approach to duplicate detection—previous attempts using narrow heuristics of duplicate detection could not reveal. We believe this approach allows PV departments to analyze public and private AE data rapidly and limits duplicates without delaying data release. It also creates a data based framework that evolves PV discipline by leveraging emerging sets of tools and methods build for big data.