Tile Generation Tutorial

Welcome to the tile generation tutorial!

As a whole slide image is too large for deep learning model training, a slide is often divded into a set of small tiles, and used for training. For tile-based whole slide image analysis, generating tiles and labels is an important and laborious step. With LUNA tiling CLIs and tutorials, you can easily generate tile labels and get your data ready for downstream analysis. In this notebook, we will see how to generate tiles and labels using LUNA tiling CLIs. Here are the main steps we will review:

  • Load slides

  • Generate tiles, labels

  • Collect tiles for model training

Through out this notebook, we will use different method parameter files. Please refer to the example parameter files in the configs directory to follow these steps.

Load slides

The first step in generating tiles is to load slides in a data store, where our results will be generated. We will use load_slide CLI to prepare slides from a whole slide image (WSI) table to our analysis location. The slide is represented as a WholeSlideImage data type.

All LUNA tiling CLIs offer a help option. To check the the CLI arguments, simply run your CLI with --help option.

[1]:
%%bash

load_slide --help
Usage: load_slide [OPTIONS]

Options:
  -a, --app_config TEXT         application configuration yaml file. See
                                config.yaml.template for details.  [required]

  -s, --datastore_id TEXT       datastore name. usually a slide id.
                                [required]

  -m, --method_param_path TEXT  json parameter file with path to a WSI delta
                                table.  [required]

  --help                        Show this message and exit.
[2]:
import multiprocessing
import subprocess

slide_ids = ['2551571', '2551531', '2551028', '2551389', '2551129']

# simple wrapper around the cli for multiple slides
def pool_process(func, slides):
    pool = multiprocessing.Pool(3)
    pool.map(func, slides)
    pool.close()
    pool.join()

[3]:
# call load_slide as subprocess
def call_load_slide(slide):
    subprocess.run(f"load_slide -a configs/app_config.yaml -s {slide} -m configs/load_slides.yaml", shell=True)
    return slide

pool_process(call_load_slide, slide_ids)

Once this step is done, the data store will be created at your datastore_path or PRO_12-123/tiles with the example method parameters.

Let’s take a look at the WholeSlideImage location for slide 2551571. We’ll see that this process created a softlink pointing to the svs image path, along with a metadata.json

[4]:
%%bash

ls -lhtr PRO_12-123/tiles/2551571/ov_slides/WholeSlideImage/
total 1.0K
lrwxrwxrwx 1 pashaa pashaa  104 Jul 13 13:29 data -> /gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/PRO_12-123/data/toy_data_set/2551571.svs
-rwxrwxrwx 1 pashaa pashaa 3.1K Jul 13 13:29 metadata.json

Generate tiles and labels

This is the main tiling step. The CLI generates tiles, populates otsu and purple scores along with the regional annotation label. An otsu score is calculated using the otsu foreground/background detection algorithm commonly used to filter out the background of the slide. Purple scores are calculated to provide additional guidance to H&E slide analysis.

[5]:
%%bash

generate_tiles --help
Usage: generate_tiles [OPTIONS]

Options:
  -a, --app_config TEXT         application configuration yaml file. See
                                config.yaml.template for details.  [required]

  -s, --datastore_id TEXT       datastore name. usually a slide id.
                                [required]

  -m, --method_param_path TEXT  json file with method parameters for tile
                                generation and filtering.  [required]

  --help                        Show this message and exit.

With this method configuration, the tile size is set to 128, scale factor to 16 and slide magnification (from slide metadata) to 20. In this example, we label the tiles with the default labels provided by the regional annotations. Note that we keep only the tiles that have been annotated and have an otsu score above 0.5 for our analysis. Please refer to configs/generate_tiles.yaml for more details on the method parameters.

Here we reserve 4 slides for model training, and 1 slide for testing. For training, we will only generate tiles for the areas that have been annotated by the pathologists, so the model will have ground-truth labels. For testing, we will generate tiles for the whole slide.

We reserve the test slide, to be annotated by the model in the inference notebook. For this test slide, as mentioned before, we generate tiles for all tissue regions (otsu score > 0.5). Note here that we use a different config file configs/generate_tiles_all_tissues.yaml which excludes parameters project_id, labelset, annotation_table_path which pertains to the regional annotation.

Depending on the size of the WSI and tiles, this step can take up to 10 minutes per slide.

[6]:
slide_ids_train = ['2551571', '2551531', '2551028', '2551389']
slide_ids_test = '2551129'

# call generate_tiles as subprocess
def call_generate_tiles(slide):
    subprocess.run(f"generate_tiles -a configs/app_config.yaml -s {slide} -m configs/generate_tiles.yaml", shell=True)
    return slide

pool_process(call_generate_tiles, slide_ids_train)
[ ]:
%%bash

generate_tiles -a configs/app_config.yaml -s 2551129 -m configs/generate_tiles_all_tissues.yaml

Once the step is done, you can find the tiles and score CSV for your slide, at your output location. For slide id 2551571, we have the tile image and metadata stored at PRO_12-123/tiles/2551571/ov_default_labels/TileImages/data.

[7]:
%%bash

ls -lhtr PRO_12-123/tiles/2551571/ov_default_labels/TileImages/data
total 3.6G
-rwxrwxrwx 1 pashaa pashaa 3.6G Jul 13 13:44 tiles.slice.pil
-rwxrwxrwx 1 pashaa pashaa 207K Jul 13 13:45 address.slice.csv
-rwxrwxrwx 1 pashaa pashaa  635 Jul 13 13:45 metadata.json

Let’s look at the tile metadata in the output CSV.

The tile otsu_score, purple score and regional annotation labels are stored along tile metadata such as address, coordinates, size, and offset. From the log, we see that out of total 206830 tiles only a subset that meets the filter criteria has been kept.

[39]:
import pandas as pd

# For a train slide, we have generated tiles for annotated regions, and populated regional_labels
df = pd.read_csv("PRO_12-123/tiles/2551571/ov_default_labels/TileImages/data/address.slice.csv")
df
[39]:
address coordinates otsu_score purple_score regional_label tile_image_offset tile_image_length tile_image_size_xy tile_image_mode
0 x107_y183_z20 (107, 183) 0.859375 0.984375 veins 2.845409e+08 49152.0 128.0 RGB
1 x107_y184_z20 (107, 184) 0.890625 0.984375 veins 2.845901e+08 49152.0 128.0 RGB
2 x107_y185_z20 (107, 185) 1.000000 1.000000 veins 2.846392e+08 49152.0 128.0 RGB
3 x107_y192_z20 (107, 192) 0.593750 1.000000 veins 2.849833e+08 49152.0 128.0 RGB
4 x108_y183_z20 (108, 183) 0.875000 0.953125 veins 2.918154e+08 49152.0 128.0 RGB
... ... ... ... ... ... ... ... ... ...
2615 x453_y148_z20 (453, 148) 1.000000 1.000000 lympho_rich_tumor 3.706257e+09 49152.0 128.0 RGB
2616 x454_y146_z20 (454, 146) 1.000000 1.000000 lympho_rich_tumor 3.713237e+09 49152.0 128.0 RGB
2617 x454_y147_z20 (454, 147) 1.000000 1.000000 lympho_rich_tumor 3.713286e+09 49152.0 128.0 RGB
2618 x454_y148_z20 (454, 148) 1.000000 1.000000 lympho_rich_tumor 3.713335e+09 49152.0 128.0 RGB
2619 x455_y143_z20 (455, 143) 1.000000 1.000000 lympho_rich_stroma 3.719725e+09 49152.0 128.0 RGB

2620 rows × 9 columns

=== LOGS === 2021-07-06 17:58:24,533 - INFO - data_processing.pathology.common.preprocess - Params = {'input_wsi_tag': 'ov_slides', 'job_tag': 'ov_default_labels', 'tile_size': 128, 'scale_factor': 16, 'magnification': 20, 'project_id': 'PRO_12-123', 'labelset': 'DEFAULT_LABELS', 'filter': {'otsu_score': 0.5}, 'root_path': '/gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/PRO_12-123/tiles', 'annotation_table_path': '/gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/PRO_12-123/tables/REGIONAL_METADATA_RESULTS'} 2021-07-06 17:58:24,645 - INFO - data_processing.pathology.common.preprocess - Slide size = [71711,47602] 2021-07-06 17:58:24,646 - INFO - data_processing.pathology.common.preprocess - Normalized magnification scale factor for 20x is 1, overall thumbnail scale factor is 16 2021-07-06 17:58:24,646 - INFO - data_processing.pathology.common.preprocess - Requested tile size=128, tile size at full magnficiation=128, tile size at thumbnail=8 2021-07-06 17:58:26,508 - INFO - data_processing.pathology.common.preprocess - tiles x 561, tiles y 372 2021-07-06 17:58:27,282 - INFO - data_processing.pathology.common.preprocess - Number of tiles in raster: 206830 2021-07-06 18:01:42,894 - INFO - data_processing.pathology.common.preprocess - Proccessing tiles [10000,78149] 2021-07-06 18:02:41,410 - INFO - data_processing.pathology.common.preprocess - Proccessing tiles [20000,78149] 2021-07-06 18:03:39,555 - INFO - data_processing.pathology.common.preprocess - Proccessing tiles [30000,78149] 2021-07-06 18:04:36,638 - INFO - data_processing.pathology.common.preprocess - Proccessing tiles [40000,78149] 2021-07-06 18:05:33,419 - INFO - data_processing.pathology.common.preprocess - Proccessing tiles [50000,78149] 2021-07-06 18:06:30,591 - INFO - data_processing.pathology.common.preprocess - Proccessing tiles [60000,78149] 2021-07-06 18:07:31,272 - INFO - data_processing.pathology.common.preprocess - Proccessing tiles [70000,78149] 2021-07-06 18:08:21,267 - INFO - data_processing.pathology.common.preprocess - Saved tile scores and images at /gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/PRO_12-123/tiles/2551571/ov_default_labels/TileImages/data

For the test slide, we keep all tissue regions, so we have far more tiles generated. Notice we don’t have the regional labels.

[40]:
# For the test slide, we have generated tiles for all tissue regions
df = pd.read_csv("PRO_12-123/tiles/2551129/ov_default_labels/TileImages/data/address.slice.csv")
df
[40]:
address coordinates otsu_score purple_score tile_image_offset tile_image_length tile_image_size_xy tile_image_mode
0 x1_y1_z20 (1, 1) 1.0000 0.0 0.000000e+00 49152.0 128.0 RGB
1 x1_y2_z20 (1, 2) 1.0000 0.0 4.915200e+04 49152.0 128.0 RGB
2 x2_y1_z20 (2, 1) 1.0000 0.0 9.830400e+04 49152.0 128.0 RGB
3 x2_y2_z20 (2, 2) 1.0000 0.0 1.474560e+05 49152.0 128.0 RGB
4 x3_y1_z20 (3, 1) 1.0000 0.0 1.966080e+05 49152.0 128.0 RGB
... ... ... ... ... ... ... ... ...
28750 x636_y2_z20 (636, 2) 0.8750 0.0 1.413120e+09 49152.0 128.0 RGB
28751 x636_y3_z20 (636, 3) 0.8125 0.0 1.413169e+09 49152.0 128.0 RGB
28752 x637_y1_z20 (637, 1) 1.0000 0.0 1.413218e+09 49152.0 128.0 RGB
28753 x637_y2_z20 (637, 2) 0.9375 0.0 1.413267e+09 49152.0 128.0 RGB
28754 x637_y3_z20 (637, 3) 0.8750 0.0 1.413317e+09 49152.0 128.0 RGB

28755 rows × 8 columns

Collect tiles for model training

Now that we have created tile labels, we can use collect_tiles CLI to collect the tile metadata as a set of parquet tables and save the outputs for multiple slide ids in the same dataset. This step is done to gather our dataset for model training.

[9]:
%%bash

collect_tiles --help
Usage: collect_tiles [OPTIONS]

Options:
  -a, --app_config TEXT         application configuration yaml file. See
                                config.yaml.template for details.  [required]

  -s, --datastore_id TEXT       datastore name. usually a slide id.
                                [required]

  -m, --method_param_path TEXT  json file with method parameters including
                                input, output details.  [required]

  --help                        Show this message and exit.

At this point, it is critical to note that our model will train on the 4 slides reserved for trainig. We have reserved one slide out of the model training step in order to use it for the inference step.

We will call collect_tiles on the training slides to prepare a dataset for training.

[ ]:
slide_ids_train = ['2551571', '2551531', '2551028', '2551389']

# call collect_tiles as subprocess
def call_collect_tiles(slide):
    subprocess.run(f"collect_tiles -a configs/app_config.yaml -s {slide} -m configs/collect_tiles.yaml", shell=True)

pool_process(call_collect_tiles, slide_ids_train)

Let’s check the output. The collected parquet files can be loaded as a pyarrow ParquetDataset, and be converted to Pandas Dataframe.

You’ll notice the table is indexed by patient_id, slide id and address. The data_path points to the tile image file. The rest of the metadata stored in this table are similar to the output of generate_tiles CLI.

[37]:
from pyarrow.parquet import ParquetDataset

ds = ParquetDataset('PRO_12-123/tiles/ov_tileset').read().to_pandas()
ds
[37]:
coordinates otsu_score purple_score regional_label tile_image_offset tile_image_length tile_image_size_xy tile_image_mode data_path
patient_id id_slide_container address
4 2551028 x128_y72_z20 (128, 72) 1.0 1.0 arteries 4.446781e+08 49152.0 128.0 RGB /gpfs/mskmindhdp_emc/user/shared_data_folder/p...
x128_y73_z20 (128, 73) 1.0 1.0 arteries 4.447273e+08 49152.0 128.0 RGB /gpfs/mskmindhdp_emc/user/shared_data_folder/p...
x128_y74_z20 (128, 74) 1.0 1.0 arteries 4.447764e+08 49152.0 128.0 RGB /gpfs/mskmindhdp_emc/user/shared_data_folder/p...
x128_y75_z20 (128, 75) 1.0 1.0 arteries 4.448256e+08 49152.0 128.0 RGB /gpfs/mskmindhdp_emc/user/shared_data_folder/p...
x128_y76_z20 (128, 76) 1.0 1.0 arteries 4.448748e+08 49152.0 128.0 RGB /gpfs/mskmindhdp_emc/user/shared_data_folder/p...
... ... ... ... ... ... ... ... ... ... ... ...
1 2551571 x453_y148_z20 (453, 148) 1.0 1.0 lympho_rich_tumor 3.706257e+09 49152.0 128.0 RGB /gpfs/mskmindhdp_emc/user/shared_data_folder/p...
x454_y146_z20 (454, 146) 1.0 1.0 lympho_rich_tumor 3.713237e+09 49152.0 128.0 RGB /gpfs/mskmindhdp_emc/user/shared_data_folder/p...
x454_y147_z20 (454, 147) 1.0 1.0 lympho_rich_tumor 3.713286e+09 49152.0 128.0 RGB /gpfs/mskmindhdp_emc/user/shared_data_folder/p...
x454_y148_z20 (454, 148) 1.0 1.0 lympho_rich_tumor 3.713335e+09 49152.0 128.0 RGB /gpfs/mskmindhdp_emc/user/shared_data_folder/p...
x455_y143_z20 (455, 143) 1.0 1.0 lympho_rich_stroma 3.719725e+09 49152.0 128.0 RGB /gpfs/mskmindhdp_emc/user/shared_data_folder/p...

14696 rows × 9 columns

Congratulations! Now you have the tiles images and labels ready to train your model.