Stochastic patch generation for Whole Slide Imaging

Lukas Verret - 2020-01-28


Stochastic patch generation for Whole Slide Imaging

Following the rise of deep learning, AI is making its way into the medical sector. A recent study showed that AI could detect breast cancer based on screening mammograms with comparable accuracy as expert radiologists. Radiology together with pathology form the core of cancer detection and both are essential in making a correct diagnosis[1]. The consequences could be profound as it could assist radiologists and pathologists in detecting malignant tissue that would otherwise be overlooked.

Data, data and more data

The study was able to achieve these excellent results not only because of the model development by Google Deepmind AI, but also because of a dataset that consisted out of almost 29,000 mammograms. As with every machine learning project, both the quantity and quality of the underlying data are key to its success.

The availability of large labelled datasets in the medical sector is usually limited. New data can often only be generated by an expert, who has to manually annotate or classify the images, which is an expensive and time-consuming process.

Mammograms are only one type of medical images where AI could add value, Whole Slide Imaging (WSI) is another promising one. Whole slide scanners capture images of tissue sections. These multiple images are captured and digitally assembled (“stitched”) to generate one digital image of the entire slide [2].

These images can take up to several gigabytes and are too large to feed in any machine learning algorithm. A common practice has been to overlay the slide with a grid and extract smaller patches that are fed to an e.g. Convolutional Neural Network:


Example of a breast tissue slide with dummy annotations on our Ixorthink portal

Although this is a valid approach, there can be some shortcomings:

  • Unbalanced dataset: usually only a small portion of the slide is annotated and the majority can be considered background, this imbalance can induce a bias in the algorithm if not properly addressed.
  • Patches have to be regenerated if the dataset changes: if annotations are added or corrected, the entire dataset has to be regenerated.
  • Patches need to be physically stored and can take up a lot of space in the case of a large dataset.

Stochastic patch generation

To improve these shortcomings, we developed a new data generator that can create an infinite stream of patches on the fly and automatically balance the different classes. Instead of using a fixed grid, the generator uses stochastic sampling to sample a steady stream of patches. As this leads to an “infinite” dataset, the epoch size is not defined anymore but can be set as a hyperparameter, which leads to additional flexibility.

In order to avoid sampling from empty white space, a background color mask based on k-means clustering is first applied. This already performs a first filtering in which we only keep regions of the slide that contain tissue.


Black regions are excluded from the sampling process

Now that we made a first rough indication of the interesting regions, we can start the stochastic sampling process. All random numbers are generated using the Mersenne-Twister pseudo-random number generator (PRNG).

We use the following algorithm in pseudo-code:

- Specify classes (e.g. background, target)
- Set epoch size
     - For epoch size:
     - Draw random class, select slides that contain the class
     - Draw random slide from selection
     - Draw random annotation from slide, create annotation mask
     - Draw random patch from annotation mask
          - Calculate if patch falls in the annotation mask
          - Return patch if a specified threshold is reached
          - Else draw new patch


  • The first random draw selects a class at random, making sure every class is sampled equally. In case weighted sampling would be required, it would only need a minor modification.
  • We analyse if a patch falls in the annotation mask by calculating a patch annotation score from the annotation mask. Setting a threshold score ensures the patch belongs to the class and falls within the annotation with sufficient accuracy. The patch annotation score is calculated as:


Irregular annotations could otherwise lead to patches falling outside the annotation mask
  • Having the flexibility to set the epoch size eliminates the need for random crops as data augmentation.
  • Patches are not physically stored but generated on the fly. The random generation may seem a time-consuming process, but as the bottleneck is in the training of the model, there is no delay in the process flow.
  • Setting the seed of the PRNG still allows us to have reproducible results.


Using the stochastic patch generator allows us to quickly set up a dataset for any machine learning algorithm that uses WSI. It eliminates the need for unpacking slides to patches and to store them on a drive, while still having reproducible results and without delaying the process flow.


[1] Sorace, J., Aberle, D.R., Elimam, D. et al. Integrating pathology and radiology disciplines: an emerging opportunity?. BMC Med 10, 100 (2012) doi:10.1186/1741–7015–10–100

[2] Mark D. Zarella, Douglas Bowman;, Famke Aeffner, Navid Farahani, Albert Xthona;, Syeda Fatima Absar, Anil Parwani, Marilyn Bui, and Douglas J. Hartman (2019) A Practical Guide to Whole Slide Imaging: A White Paper From the Digital Pathology Association. Archives of Pathology & Laboratory Medicine: February 2019, Vol. 143, №2, pp. 222–234.

Related articles


Read more

Automated Epileptic Seizure Detection on EEG Brainwaves for Ghent University, 4BrainLab

Read more

Ixor en CodaBox slaan de handen in mekaar en maken e-facturatie toegankelijk voor alle bedrijven in België

Read more