Following the rise of deep learning, AI is making its way into the medical sector. A recent study showed that AI could detect breast cancer based on screening mammograms with comparable accuracy as expert radiologists. Radiology together with pathology form the core of cancer detection and both are essential in making a correct diagnosis. The consequences could be profound as it could assist radiologists and pathologists in detecting malignant tissue that would otherwise be overlooked.
The study was able to achieve these excellent results not only because of the model development by Google Deepmind AI, but also because of a dataset that consisted out of almost 29,000 mammograms. As with every machine learning project, both the quantity and quality of the underlying data are key to its success.
The availability of large labelled datasets in the medical sector is usually limited. New data can often only be generated by an expert, who has to manually annotate or classify the images, which is an expensive and time-consuming process.
Mammograms are only one type of medical images where AI could add value, Whole Slide Imaging (WSI) is another promising one. Whole slide scanners capture images of tissue sections. These multiple images are captured and digitally assembled (“stitched”) to generate one digital image of the entire slide .
These images can take up to several gigabytes and are too large to feed in any machine learning algorithm. A common practice has been to overlay the slide with a grid and extract smaller patches that are fed to an e.g. Convolutional Neural Network:
Although this is a valid approach, there can be some shortcomings:
To improve these shortcomings, we developed a new data generator that can create an infinite stream of patches on the fly and automatically balance the different classes. Instead of using a fixed grid, the generator uses stochastic sampling to sample a steady stream of patches. As this leads to an “infinite” dataset, the epoch size is not defined anymore but can be set as a hyperparameter, which leads to additional flexibility.
In order to avoid sampling from empty white space, a background color mask based on k-means clustering is first applied. This already performs a first filtering in which we only keep regions of the slide that contain tissue.
Now that we made a first rough indication of the interesting regions, we can start the stochastic sampling process. All random numbers are generated using the Mersenne-Twister pseudo-random number generator (PRNG).
We use the following algorithm in pseudo-code:
- Specify classes (e.g. background, target) - Set epoch size - For epoch size: - Draw random class, select slides that contain the class - Draw random slide from selection - Draw random annotation from slide, create annotation mask - Draw random patch from annotation mask - Calculate if patch falls in the annotation mask - Return patch if a specified threshold is reached - Else draw new patch
Using the stochastic patch generator allows us to quickly set up a dataset for any machine learning algorithm that uses WSI. It eliminates the need for unpacking slides to patches and to store them on a drive, while still having reproducible results and without delaying the process flow.
 Sorace, J., Aberle, D.R., Elimam, D. et al. Integrating pathology and radiology disciplines: an emerging opportunity?. BMC Med 10, 100 (2012) doi:10.1186/1741–7015–10–100
 Mark D. Zarella, Douglas Bowman;, Famke Aeffner, Navid Farahani, Albert Xthona;, Syeda Fatima Absar, Anil Parwani, Marilyn Bui, and Douglas J. Hartman (2019) A Practical Guide to Whole Slide Imaging: A White Paper From the Digital Pathology Association. Archives of Pathology & Laboratory Medicine: February 2019, Vol. 143, №2, pp. 222–234.