Menu Back to Poster-Presentations-Details

P117: Quality and Reliability of Machine-Learning Derived Variables in Clinico-Genomic, Real-World Datasets

Poster Presenter

      Timothe Claude Menard

      • Head of Quality Data Science
      • F. Hoffmann-La Roche


Establish a framework to define quality thresholds for machine learning (ML) derived variables used in clinico-genomic real-world datasets. We used published inference algorithms (for potential germline pathogenic variants (PGPV) and microsatellite instability (MSI) data) as a case study.


We reviewed and evaluated two published inference methods (one for PGPV and one for MSI data), and leveraged recent good machine learning practices (GMLP) and real-world data (RWD) health authorities guidance, to define a fit-for-purpose framework for quality and reliability of ML derived variables.


For our case study, we used the MSI inference method by Wang et al (Scientific Reports, 2018) and the PGPV inference method by Sun et al (PLOS Comput Biol, 2018). Our framework followed a two-step approach, with the first step being a quality assessment of the algorithm to understand its performance and limitations, and the second being a set of rule-based data quality checks. Both ML models were evaluated for their applicability considering the original training set and validation techniques that were used (especially since any classification modeling technique must be adequately validated). For the PGPV method, most metrics were not available, making it difficult to evaluate its true performance in the real-world (i.e. true positive rate, true negative rate and false negative rate which would allow reconstructing the entire confusion matrix). The MSI method, however, provided all metrics including the area under the ROC curve. For both PGPV and MSI, requirements to ensure optimal performance of the inference methods (e.g. which breadth and depth of sequencing coverage are required, and under which tumor purity level the model performs best) were disclosed, so a quality threshold could be defined, with any data generated outside the specifications to be discarded. In addition, rule-based data quality checks should be implemented to give further assurance of the quality of the data. Finally, RWD generated under the model's specifications (for both methods) which also pass the rule-based data quality checks would be suitable at least for exploratory research. Appropriate quality thresholds for RWD must be set to generate real-world evidence (RWE) to support a regulatory decision.


The increasing use of genomic testing in oncology has resulted in clinico-genomic databases (CGDBs) which connect patient-level data from electronic health records to the genomic testing results. Furthermore, there is an increased number of real-world datasets that contain variables that were inferred using ML algorithms. For example, PGPV and MSI data can be generated by using either a “gold standard” method (i.e. for PGPV: sequencing using a matched normal sample; for MSI: polymerase chain reaction), or by using an inference (ML) approach. The major challenge for the latter is that this RWD can’t be directly verified for quality and reliability, as source data verification and/or concordance analysis can’t always be performed. This poses a risk to the integrity of the research conducted if/when using these datasets (e.g. to generate RWE). Therefore, we developed a quality framework based on two published inference methods (for PGPV and MSI data). This framework can be used to assess RWD of ML derived variables, and to set a quality and reliability threshold, i.e. when/how these variables could be used, under which guardrails and specifications. Ultimately, RWE use cases, and whether data are used for exploratory research vs. to support a regulatory submission, should help define the optimal quality and reliability thresholds.