Week 2 — Histopathologic Cancer Detection

Furkan Kaya
bbm406f19
Published in
2 min readDec 11, 2019

--

Hello everyone! We will share with you today the second series of our Machine Learning Course Project on Cancer Detection with Histopathological Data that we started to convey with great excitement to you last week. This week we want to share with you the dataset we will use in our project. We hope you will have a pleasant time with our sharing today, let’s take a look at this week’s agenda!

Week 1 — Histopathologic Cancer Detection

Example images from our dataset PatchCamelyon (PCam). Green boxes indicate tumor tissue in the center region.

The original PatchCamelyon (PCam) benchmark dataset to be used in our project contains repeating data (images). There are some important points to consider when creating train and test data for improves the quality of the training data for analytics and enables accurate decision-making. The appearance of duplicate data in train and test data indicates an unreliable model. Therefore, to get better and more accurate results, we decided to remove duplicate data from the original PCam dataset and use a dataset that will give more accurate results.

Via GIPHY

Numbers

PatchCamelyon data set is divided into three; training set, validation set, and test set. There are 262.144 data in the training set and 32.768 in the validation and test sets. These sections were equally distributed for positive and negative samples (tumor tissue — healthy tissue). But the data set we will use is a subset of this PatchCamelyon dataset, which includes 220k training images and 57k evaluation images. These 220k training images contain 130908 positive data and 89117 negative data. So there is no equal label distribution like PatchCamelyon.

Data Label Distribution of Training Data

Labeling

A positive label indicates that the center 32x32 px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable the design of fully-convolutional models that do not use any zero-padding, to ensure consistent behavior when applied to a whole-slide image.

References

--

--