Beware of public datasets in healthcare.

@April 28, 2022

The Boom in Healthcare Datasets

As interest in data science and machine learning in healthcare as exploded, it's become popular to release large, open datasets. Some famous examples of this are the canonical critical care EMR dataset MIMIC, the chest x-ray dataset CXR-14, and EyePACS dataset for eye fundus images.

As these datasets have grown in renown, the incentive to gather and release public datasets for the accompanying publications and fame has grown. The positive side of this is that the number of datasets in healthcare has increased tremendously. We're now at a point where there are meta-analyses of healthcare datasets, like this one in the Lancet on ophthalmology datasets.

The Challenge of Bad Datasets

The downside, however, is that the amount of bad data has skyrocketed. Because of the emphasis on releasing large amounts of data, without agreements on standards or with the resources to support such datasets for the long run, there is now a proliferation of large, noisy datasets.

Luke Oakden-Rayner published a great analysis of the Chest X-Ray14 dataset at the time it was released, with clear explanations for the noisiness of the dataset. I myself have had many experiences with public datasets that seem promising at first, but hopelessly noisy upon further examination.

I worked with an optometrist to evaluate the Kermany dataset, which is the largest and most commonly used (~30K downloads) public OCT dataset. We found that the dataset had a disproportionate number of myopic eyes, too much imaging device variation, and quality control issues with image processing (intensity averaging and field of view cut offs). All of this combined to make the dataset useless for our medical AI application of diagnosing retinal disease in the real-world.

It’s important to note that bad public datasets are not just healthcare-specific. Northcutt and Athalye of MIT found common machine learning benchmark datasets like ImageNet were of poor quality, with “label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets”. Galileo finds tons of errors in NLP datasets. Hasty.ai found that cleaning up the PASCAL computer vision dataset increased modeling performance by 13% (!), a massive improvement.

Bad data is rampant in healthcare and everywhere.

The Consequences of Bad Datasets

This negative results of all this bad data proliferating are several-fold.

It's hard to build anything meaningful with public data. With the avalanche of poor quality datasets only increasing, it's getting harder. Rather than being empowered by the vast amount of data, the data scientist or ML engineer wastes a lot of time parsing and characterizing the noise in the dataset.

It diminishes trust in medical applications of AI, because systems trained on large amounts of public data by non-healthcare people generalize poorly. This is frustrating both for developers, who find the problem area to be more obtuse than initially expected, and customers, who despise the accompanying hype cycle.

It perpetuates a broken cycle for medical AI maturity, which acts like an ImageNet moment is just waiting around the corner. The reality is that medical AI depends heavily on the context of the application of the algorithm, not so much on the quantity of the data used to train a model. Rather than spending time developing huge datasets, it would be better to get the community focused on standards for AI deployment, as well as open standards for model training. In a nutshell, more data has never been the problem for healthcare AI; people just assumed it was because it was the problem in other areas (i.e. image recognition).

Ideas for Releasing Better Healthcare Datasets

All of this begs the question of how to actually put together a good healthcare dataset. Here are some ideas for how to release higher quality public datasets:

Structured documentation: Many medical datasets are released with an accompanying paper. These papers vary widely in their level of detail in explaining how the dataset in question was created. We can do better by embracing standards for documentation and being more structured. Every dataset should catalogue the classes inherent in the data and underlying demographic and contextual statistics. The same way that any major OSS release has structured documentation, every medical dataset should follow high quality standard for documentation. Rajpurkar, et. al. put forward an excellent example for the CheXpert dataset.
Better technology: Medical datasets are frequently released as a bunch of JPEG files with accompanying CSVs of labels. This is an outdated, irresponsible format to use with medical data. It doesn’t allow for standard field definitions or native versioning. It is willfully ignorant of all the progress that has been made in data storage and sharing. Medical datasets should be released in standard medical formats like DICOM with fields stripped or using database technology like Dolthub that makes sharing and version control easy. While it can be expensive, I very much appreciate the example set by MIMIC-III of being distributed via GCP or AWS. Nightingale Open Science also does a great job of this.
Emphasize context: Medical data is generated in specific clinical contexts. A routine fundus image exam for a diabetic is very different than a fundus image for a patient known to have AMD. These contexts shape the detail and precision with which data is generated, as well as the context in which insights from the data are applied. Medical dataset authors should clarify where the data they release was created with more context and be clearer about where said data should not be applied.