W-05: What's the Best Download Format from NIH - Comparing Data Availability in Different Export Options

      John Willmore

      • BizInt Solutions, Inc.
        United States


The registry website (NIH) offers several data formats for downloading search results. We examine differences in data field availability between formats. We further examine differences in the presentation of data fields when those differences affect the use of the data.


Field availability in each export format (CSV, Text, XML) for is compared based on field labels. The resulting matrix of available data elements is then validated by comparing actual clinical trial study details exported in each format. Differences in format are noted.


As expected, the XML export of study details is the most comprehensive of the export formats. (The display of study details on is equally comprehensive, but only shows one study at a time). We found no study details in the other export formats that was not also present in the XML. The other available export formats (comma separated, tab delimited, and plain text) all contain the same data elements. Furthermore, even though these file formats could allow different formatting within a data element, in fact there are no differences between these formats within a field. Table 1 shows study detail fields available in all export formats Table 2 shows study detail fields only available in the XML export format Study results data are only available in the XML export format. It is worth noting that the XML format for study details requires additional software to translate the structured data into a format usable in an office application such as Microsoft Excel. The comma separated and tab delimited formats can be easily imported in to Excel. Depending on the question being studied, additional processing may be required within Excel to extract data that is suitable for analysis. This additional processing (via formulae or macros) may add to the cost of using the CSV format. We performed an unconstrained search for gabapentin (n = 314 studies) and downloaded the results in each format. For example, Subject Age is presented in the "Age" field in the plain text, CSV, and tab formats as: 10 Years to 19 Years (Child, Adult) In the XML this is available as two separate entities minimum_age = 10 Years maximum_age = 19 Years Note also that the patient classifications "Child" and "Adult" are not present in the XML, but rather are derived from the minimum and maximum age.


When reporting on the results of a single study, the browser display on is an easy to use and comprehensive display. When performing an analysis of many studies, the download formats allow the user to export a large number of results. Most users of use one of the text export formats like CSV when creating a report or visualization of a collection of studies. A primary reason is the easy access to office suite tools like Microsoft Office. As we have shown, a significant fraction of the study details is not available in text exports, but is available in the XML. Many concepts, such as inclusion and exclusion criteria and locations, are only available in the XML. Other concepts such as outcome measures and ages are present in both XML and CSV, but more detail is available in the XML. Study results are only available for export in XML format. Researchers working with study details should be aware of the additional study details available in the XML format, and consider adopting tools capable of translating the XML.