{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dataset Preparation Tutorial\n",
"\n",
"Welcome to the dataset preparation tutorial! In this notebook, we will download the toy data set for the tutorial and prepare the necessary tables used for later analysis. Here are the steps we will review:\n",
"\n",
"- Check server connection\n",
"- Create a new directory for your project\n",
"- Download data\n",
"- Set up project configuration files\n",
"- Build the proxy table\n",
"- Run regional annotation ETL\n",
"\n",
"Please note that, for the remainder of the tutorial, we assume that you are on the LUNA servers. The following steps will not execute properly if this is not the case. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check server connection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To check that that you are connected to the LUNA servers, make sure you can run the following without errors:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['/gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/pathology-tutorial-sandbox/data-processing/data_processing']"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import data_processing\n",
"data_processing.__path__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If so, congratulations! It is as simple as that. You are ready to start making the project workspace and preparing the data!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create a new directory for your project\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will create a project space for your configurations, data, models, and outputs to go for this tutorial.\n",
"\n",
"To do so, we first need to create a file called *manifest.yaml* and populate its contents with those of the template configuration file."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"‘configs/manifest.yaml’ -> ‘manifest.yaml’\n"
]
}
],
"source": [
"%%bash\n",
"\n",
"touch manifest.yaml\n",
"cp -v configs/manifest.yaml manifest.yaml"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will use this file to create a new project space for this tutorial using a CLI from the repository."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"python3 -m data_processing.project.generate --manifest_file /gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/manifest.yaml\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should now see a new folder called *PRO_12-123* in this directory. This will be your project name!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download data\n",
"\n",
"The data that you will be using for this tutorial is a set of 5 whole slide images of ovarian cancer H&E slides, available in the svs file format. Whole slide imaging refers to the scanning of conventional glass slides for research purposes; in this case, these are slides that oncologists have used to inspecting cancer samples! We will download these images from Synapse, a data warehouse used for digital research. \n",
"\n",
"We will now make a folder for your data and the toy data set in this new project workspace."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cd PRO_12-123\n",
"mkdir data && cd data\n",
"mkdir toy_data_set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can find the pathology slides for your toy data set on Synapse. First, you must navigate to the Synapse website (https://www.synapse.org/) and create an account if you do not already have one. Once your account is created, open the site, search for the project ID (syn25946167) in the righthand corner, click the \"Files\" tab, and download the tar.gz file as a file (not as a package). This process may take a while, as you will be downloading a little under 5 GB of data onto your machine. Once downloaded, expand the tar file, and then relocate the five svs files into the *toy_data_set* folder."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set up project configuration files\n",
"\n",
"Next, you must set up your configuration files.\n",
"\n",
"In your project workspace, make a new directory called *my_conf* and copy the contents of the *configs/* file into it."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"\n",
"cp -R configs/ PRO_12-123/my_conf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: while you do not have to change the contents of *my_conf/app_confi.yaml*, you must fill out a few personal fields in *my_conf/data_config.yaml* and *regional_annotation_config.yaml*, namely: REQUESTOR, REQUESTOR_EMAIL, and DATE. Please take a moment to do so now, manually."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build the proxy table\n",
"\n",
"Now, we will run the Whole Slide Image (WSI) ETL to database the slides and build a proxy table. For reference, ETL stands for extract-transform-load; it is the method that often involves clearning data, transforming data types, and loading data into different systems. We will use to translate and obtain data hosted on LUNA servers into our project environment. Additionally, a proxy table is a local table that points to a remote object. The table that we will create will point to data from the LUNA servers that is relevant for our whole slide image slides that we just downloaded for the toy data set.\n",
"\n",
"First, make sure that your environment variables are set to the right destinations: "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/gpfs/mskmindhdp_emc/sw/env/bin/python3\n",
"/gpfs/mskmindhdp_emc/sw/env/bin/python3\n",
"/opt/spark-3.0.0-bin-hadoop3.2\n"
]
}
],
"source": [
"%%bash\n",
"\n",
"export PYSPARK_PYTHON=/gpfs/mskmindhdp_emc/sw/env/bin/python3\n",
"export PYSPARK_DRIVER_PYTHON=/gpfs/mskmindhdp_emc/sw/env/bin/python3\n",
"\n",
"echo $PYSPARK_PYTHON\n",
"echo $PYSPARK_DRIVER_PYTHON\n",
"echo $SPARK_HOME"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, to run the ETL, run the following command:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%bash\n",
"python3 -m data_processing.pathology.proxy_table.generate \\\n",
" -d configs/wsi_config.yaml \\\n",
" -a configs/app_config.yaml \\\n",
" -p delta\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step may take a while. At the end, your proxy table should be generated!\n",
"\n",
"Before we view the table, we must first update it to associate patient ID's with the slides. This is necessary for correctly training and validating the machine learning model in the coming notebooks. Once the slides are divided into \"tiles\" in the next notebook, the tiles are split between the training and validation sets for the ML model. If the tiles do not have patient ID's associated with them, then it is possible for tiles from one individual to appear in both the training and validation of the model; this would cause researchers to have an exaggerated interpretation of the model's accuracy, since we would essentially be validating the model on information that is too near to what it has already seen. \n",
"\n",
"Note that we will not be using patient IDs associated with MSK. Instead, we will be using spoof IDs that will suffice for this tutorial. When running this workflow with real data, make sure to include the IDs safely and securely. Run the following block of code to add a 'patient_id' column to the table and store it using Spark."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"from pyspark.sql import SparkSession\n",
"\n",
"# setup spark session\n",
"spark = SparkSession.builder \\\n",
" .appName(\"test\") \\\n",
" .master('local[*]') \\\n",
" .config(\"spark.driver.host\", \"127.0.0.1\") \\\n",
" .config(\"spark.jars.packages\", \"io.delta:delta-core_2.12:0.7.0\") \\\n",
" .config(\"spark.delta.logStore.class\", \"org.apache.spark.sql.delta.storage.HDFSLogStore\") \\\n",
" .config(\"spark.sql.extensions\", \"io.delta.sql.DeltaSparkSessionExtension\") \\\n",
" .config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\") \\\n",
" .config(\"spark.hadoop.dfs.client.use.datanode.hostname\", \"true\") \\\n",
" .config(\"spark.driver.memory\", \"6g\") \\\n",
" .config(\"spark.executor.memory\", \"6g\") \\\n",
" .getOrCreate()\n",
"\n",
"# read WSI delta table\n",
"wsi_table = spark.read.format(\"delta\") \n",
" .load(\"file:////gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/PRO_12-123/tables/WSI_toy_data_set\").toPandas()\n",
"\n",
"# insert spoof patient ids\n",
"patient_id=[1,2,3,4,5]\n",
"wsi_table['patient_id']=patient_id\n",
"\n",
"wsi_table\n",
"\n",
"# convert back to a spark table (update table)\n",
"x = spark.createDataFrame(wsi_table)\n",
"x.write.format(\"delta\").mode(\"overwrite\").option(\"mergeSchema\", \"true\").save(\"file:////gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/PRO_12-123/tables/WSI_toy_data_set\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we may view the WSI table! This table should have the metadata associated with the WSI slides that you just collected, including the patient IDs. "
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['/gpfs/mskmindhdp_emc/etl-runner/data-processing/data_processing']"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import data_processing\n",
"data_processing.__path__"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" path | \n",
" modificationTime | \n",
" length | \n",
" wsi_record_uuid | \n",
" slide_id | \n",
" metadata | \n",
" patient_id | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" file:/gpfs/mskmindhdp_emc/user/shared_data_fol... | \n",
" 2021-07-06 11:42:52 | \n",
" 584611357 | \n",
" WSI-93ccfd50a210d0b8c7589352be9036ef5abf6b4f81... | \n",
" 2551129 | \n",
" {'aperio_User': 'd9286672-cd53-4139-87ba-d68c4... | \n",
" 2 | \n",
"
\n",
" \n",
" | 1 | \n",
" file:/gpfs/mskmindhdp_emc/user/shared_data_fol... | \n",
" 2021-07-06 11:53:51 | \n",
" 1413574341 | \n",
" WSI-03662b6be585f8bdb1a16a175a7cfda07c4057afe5... | \n",
" 2551571 | \n",
" {'aperio_User': 'd9286672-cd53-4139-87ba-d68c4... | \n",
" 1 | \n",
"
\n",
" \n",
" | 2 | \n",
" file:/gpfs/mskmindhdp_emc/user/shared_data_fol... | \n",
" 2021-07-06 11:46:00 | \n",
" 520642043 | \n",
" WSI-12677b7d98691d1eef8043727f27878eb9fda14b65... | \n",
" 2551531 | \n",
" {'aperio_User': 'd9286672-cd53-4139-87ba-d68c4... | \n",
" 3 | \n",
"
\n",
" \n",
" | 3 | \n",
" file:/gpfs/mskmindhdp_emc/user/shared_data_fol... | \n",
" 2021-07-06 11:43:07 | \n",
" 1322921471 | \n",
" WSI-1ba07f58166fc2073c854dd9b00a11eaca2203ff20... | \n",
" 2551028 | \n",
" {'aperio_User': 'd9286672-cd53-4139-87ba-d68c4... | \n",
" 4 | \n",
"
\n",
" \n",
" | 4 | \n",
" file:/gpfs/mskmindhdp_emc/user/shared_data_fol... | \n",
" 2021-07-06 14:18:26 | \n",
" 966069709 | \n",
" WSI-f3890775a7f36c982aae28ac58de43b1852652fc20... | \n",
" 2551389 | \n",
" {'aperio_User': 'd9286672-cd53-4139-87ba-d68c4... | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" path modificationTime \\\n",
"0 file:/gpfs/mskmindhdp_emc/user/shared_data_fol... 2021-07-06 11:42:52 \n",
"1 file:/gpfs/mskmindhdp_emc/user/shared_data_fol... 2021-07-06 11:53:51 \n",
"2 file:/gpfs/mskmindhdp_emc/user/shared_data_fol... 2021-07-06 11:46:00 \n",
"3 file:/gpfs/mskmindhdp_emc/user/shared_data_fol... 2021-07-06 11:43:07 \n",
"4 file:/gpfs/mskmindhdp_emc/user/shared_data_fol... 2021-07-06 14:18:26 \n",
"\n",
" length wsi_record_uuid slide_id \\\n",
"0 584611357 WSI-93ccfd50a210d0b8c7589352be9036ef5abf6b4f81... 2551129 \n",
"1 1413574341 WSI-03662b6be585f8bdb1a16a175a7cfda07c4057afe5... 2551571 \n",
"2 520642043 WSI-12677b7d98691d1eef8043727f27878eb9fda14b65... 2551531 \n",
"3 1322921471 WSI-1ba07f58166fc2073c854dd9b00a11eaca2203ff20... 2551028 \n",
"4 966069709 WSI-f3890775a7f36c982aae28ac58de43b1852652fc20... 2551389 \n",
"\n",
" metadata patient_id \n",
"0 {'aperio_User': 'd9286672-cd53-4139-87ba-d68c4... 2 \n",
"1 {'aperio_User': 'd9286672-cd53-4139-87ba-d68c4... 1 \n",
"2 {'aperio_User': 'd9286672-cd53-4139-87ba-d68c4... 3 \n",
"3 {'aperio_User': 'd9286672-cd53-4139-87ba-d68c4... 4 \n",
"4 {'aperio_User': 'd9286672-cd53-4139-87ba-d68c4... 5 "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# read WSI delta table\n",
"wsi_table = spark.read.format(\"delta\") \\\n",
" .load(\"file:////gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/PRO_12-123/tables/WSI_toy_data_set\").toPandas()\n",
"\n",
"# view table\n",
"wsi_table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the table is depicted above, congratulations, you have successfully run the Whole Slide Image (WSI) ETL to database the slides!\n",
"\n",
"## Run the regional annotation ETL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The whole slide images that you downloaded are images of ovarian cancer, but not every pixel on each slide is a tumor. In fact, the images show tumor cells, normal ovarian cells, necrosis (dead cells), fibrosis (scarred cells), and more. Pathologists at Memorial Sloan Kettering examined each slide and denoted these different features by hand, providing us with regional annotations. You may think of regional annotations as scientific highlighter marks over the different regions of the image.\n",
"\n",
"What actually happens when the regional annotation ETL is run? First, annotation bitmaps are downloaded from SlideViewer, a repository which stores WSI images and their annotation data. These bitmaps are converted into numpy arrays, which are then converted into GeoJSON files and organized in the proxy table. The GeoJSON files store the annotation regions marked by pathologists as polygons, which makes the data simpler to store and analyze. Once the annotation files are loaded into QuPath- a software used for digital pathology- later in the pipeline, this data format becomes incredibly useful and easy to work with.\n",
"\n",
"To run the regional annotation ETL, try:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" >>>>>>> Processing [2551389] <<<<<<<<\n",
"No label 1 found\n",
"Building contours for label 2\n",
"num_pixels with label 344474790\n",
"num_contours 2\n",
"[-1, 0]\n",
"No label 3 found\n",
"Building contours for label 4\n",
"num_pixels with label 62336170\n",
"num_contours 3\n",
"[-1, 0, 0]\n",
"No label 5 found\n",
"No label 6 found\n",
"Building contours for label 7\n",
"num_pixels with label 2720232\n",
"num_contours 2\n",
"[-1, -1]\n",
"No label 8 found\n",
"No label 9 found\n",
"No label 10 found\n",
"No label 11 found\n",
"No label 12 found\n",
"No label 13 found\n",
"No label 14 found\n",
"No label 15 found\n",
" >>>>>>> Processing [2551571] <<<<<<<<\n",
"Building contours for label 1\n",
"num_pixels with label 3612930\n",
"num_contours 3\n",
"[-1, -1, -1]\n",
"No label 2 found\n",
"Building contours for label 3\n",
"num_pixels with label 20257170\n",
"num_contours 3\n",
"[-1, -1, -1]\n",
"No label 4 found\n",
"Building contours for label 5\n",
"num_pixels with label 38403188\n",
"num_contours 2\n",
"[-1, -1]\n",
"Building contours for label 6\n",
"num_pixels with label 28809658\n",
"num_contours 29\n",
"[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]\n",
"Building contours for label 7\n",
"num_pixels with label 4451030\n",
"num_contours 11\n",
"[-1, -1, -1, -1, -1, 2, 2, -1, -1, -1, -1]\n",
"Building contours for label 8\n",
"num_pixels with label 1248\n",
"num_contours 1\n",
"[-1]\n",
"No label 9 found\n",
"No label 10 found\n",
"No label 11 found\n",
"No label 12 found\n",
"No label 13 found\n",
"No label 14 found\n",
"No label 15 found\n",
" >>>>>>> Processing [2551028] <<<<<<<<\n",
"No label 1 found\n",
"Building contours for label 2\n",
"num_pixels with label 385750\n",
"num_contours 1\n",
"[-1]\n",
"Building contours for label 3\n",
"num_pixels with label 1205304\n",
"num_contours 5\n",
"[-1, -1, -1, -1, -1]\n",
"Building contours for label 4\n",
"num_pixels with label 6615826\n",
"num_contours 11\n",
"[-1, -1, 0, -1, -1, -1, -1, -1, -1, -1, -1]\n",
"No label 5 found\n",
"Building contours for label 6\n",
"num_pixels with label 51908714\n",
"num_contours 8\n",
"[-1, -1, -1, -1, -1, -1, 4, -1]\n",
"Building contours for label 7\n",
"num_pixels with label 3202104\n",
"num_contours 2\n",
"[-1, -1]\n",
"No label 8 found\n",
"No label 9 found\n",
"No label 10 found\n",
"No label 11 found\n",
"No label 12 found\n",
"No label 13 found\n",
"No label 14 found\n",
"No label 15 found\n",
" >>>>>>> Processing [2551129] <<<<<<<<\n",
"Building contours for label 1\n",
"num_pixels with label 394812\n",
"num_contours 3\n",
"[-1, -1, -1]\n",
"No label 2 found\n",
"Building contours for label 3\n",
"num_pixels with label 880568\n",
"num_contours 1\n",
"[-1]\n",
"Building contours for label 4\n",
"num_pixels with label 255604\n",
"num_contours 2\n",
"[-1, -1]\n",
"Building contours for label 5\n",
"num_pixels with label 3428922\n",
"num_contours 3\n",
"[-1, -1, -1]\n",
"No label 6 found\n",
"No label 7 found\n",
"No label 8 found\n",
"No label 9 found\n",
"Building contours for label 10\n",
"num_pixels with label 2587508\n",
"num_contours 8\n",
"[-1, -1, -1, -1, -1, -1, 5, 5]\n",
"No label 11 found\n",
"Building contours for label 12\n",
"num_pixels with label 11236\n",
"num_contours 1\n",
"[-1]\n",
"No label 13 found\n",
"No label 14 found\n",
"No label 15 found\n",
" >>>>>>> Processing [2551531] <<<<<<<<\n",
"Building contours for label 1\n",
"num_pixels with label 16475072\n",
"num_contours 5\n",
"[-1, -1, -1, 2, 2]\n",
"Building contours for label 2\n",
"num_pixels with label 3750146\n",
"num_contours 3\n",
"[-1, 0, 0]\n",
"Building contours for label 3\n",
"num_pixels with label 115784442\n",
"num_contours 11\n",
"[-1, -1, 0, 1, 1, 1, -1, 1, -1, -1, 9]\n",
"No label 4 found\n",
"Building contours for label 5\n",
"num_pixels with label 62971884\n",
"num_contours 27\n",
"[-1, 0, -1, -1, 0, -1, -1, 6, 0, -1, -1, 3, 3, 0, 0, 0, -1, 0, 0, 16, 16, 16, 16, 16, 16, 16, -1]\n",
"No label 6 found\n",
"No label 7 found\n",
"No label 8 found\n",
"No label 9 found\n",
"No label 10 found\n",
"No label 11 found\n",
"No label 12 found\n",
"No label 13 found\n",
"No label 14 found\n",
"No label 15 found\n",
"\n",
"Table output directory = /gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/PRO_12-123/tables/REGIONAL_METADATA_RESULTS\n",
" sv_project_id ... labelset\n",
"0 134 ... DEFAULT_LABELS\n",
"1 134 ... PIXEL_CLASSIFIER_LABELS\n",
"2 134 ... OBJECT_CLASSIFIER_LABELS\n",
"3 134 ... SIMPLIFIED_PIXEL_CLASSIFIER_LABELS\n",
"4 134 ... DEFAULT_LABELS\n",
"5 134 ... PIXEL_CLASSIFIER_LABELS\n",
"6 134 ... OBJECT_CLASSIFIER_LABELS\n",
"7 134 ... SIMPLIFIED_PIXEL_CLASSIFIER_LABELS\n",
"\n",
"[8 rows x 10 columns]\n",
" sv_project_id ... labelset\n",
"0 134 ... DEFAULT_LABELS\n",
"1 134 ... PIXEL_CLASSIFIER_LABELS\n",
"2 134 ... OBJECT_CLASSIFIER_LABELS\n",
"3 134 ... SIMPLIFIED_PIXEL_CLASSIFIER_LABELS\n",
"4 134 ... DEFAULT_LABELS\n",
"5 134 ... PIXEL_CLASSIFIER_LABELS\n",
"6 134 ... OBJECT_CLASSIFIER_LABELS\n",
"7 134 ... SIMPLIFIED_PIXEL_CLASSIFIER_LABELS\n",
"\n",
"[8 rows x 10 columns]\n",
" sv_project_id ... labelset\n",
"0 134 ... DEFAULT_LABELS\n",
"1 134 ... PIXEL_CLASSIFIER_LABELS\n",
"2 134 ... OBJECT_CLASSIFIER_LABELS\n",
"3 134 ... SIMPLIFIED_PIXEL_CLASSIFIER_LABELS\n",
"4 134 ... DEFAULT_LABELS\n",
"5 134 ... PIXEL_CLASSIFIER_LABELS\n",
"6 134 ... OBJECT_CLASSIFIER_LABELS\n",
"7 134 ... SIMPLIFIED_PIXEL_CLASSIFIER_LABELS\n",
"\n",
"[8 rows x 10 columns]\n",
" sv_project_id ... labelset\n",
"0 134 ... DEFAULT_LABELS\n",
"1 134 ... PIXEL_CLASSIFIER_LABELS\n",
"2 134 ... OBJECT_CLASSIFIER_LABELS\n",
"3 134 ... SIMPLIFIED_PIXEL_CLASSIFIER_LABELS\n",
"4 134 ... DEFAULT_LABELS\n",
"5 134 ... PIXEL_CLASSIFIER_LABELS\n",
"6 134 ... OBJECT_CLASSIFIER_LABELS\n",
"7 134 ... SIMPLIFIED_PIXEL_CLASSIFIER_LABELS\n",
"\n",
"[8 rows x 10 columns]\n",
" sv_project_id ... labelset\n",
"0 134 ... DEFAULT_LABELS\n",
"1 134 ... PIXEL_CLASSIFIER_LABELS\n",
"2 134 ... OBJECT_CLASSIFIER_LABELS\n",
"3 134 ... SIMPLIFIED_PIXEL_CLASSIFIER_LABELS\n",
"4 134 ... DEFAULT_LABELS\n",
"5 134 ... PIXEL_CLASSIFIER_LABELS\n",
"6 134 ... OBJECT_CLASSIFIER_LABELS\n",
"7 134 ... SIMPLIFIED_PIXEL_CLASSIFIER_LABELS\n",
"\n",
"[8 rows x 10 columns]\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2021-07-21 15:16:11,351 - ERROR - asyncio - Task was destroyed but it is pending!\n",
"2021-07-21 15:16:11,351 - ERROR - asyncio - task: wait_for=()]> cb=[_HandlerDelegate.execute..() at /gpfs/mskmindhdp_emc/sw/env/lib64/python3.6/site-packages/tornado/web.py:2333]>\n"
]
}
],
"source": [
"%%bash\n",
"\n",
"python3 -m data_processing.pathology.refined_table.regional_annotation.dask_generate \\\n",
" -d configs/regional_annotation_config.yaml \\\n",
" -a configs/app_config.yaml\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To check that the regional annotation ETL was correctly run, after the Jupyter cell finishes, you may load the regional annotations table! This table contains the metadata saved from running the ETL. It includes paths to the bitmap files, numpy files, and geoJSON files that were mentioned before. To load the table, run the following cell: "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sv_project_id | \n",
" slideviewer_path | \n",
" slide_id | \n",
" user | \n",
" bmp_filepath | \n",
" npy_filepath | \n",
" geojson_path | \n",
" date | \n",
" labelset | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 134 | \n",
" 2019;HobS19-409411851898;2551028.svs | \n",
" 2551028 | \n",
" ellensol | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810254 | \n",
" DEFAULT_LABELS | \n",
"
\n",
" \n",
" | 1 | \n",
" 134 | \n",
" 2019;HobS19-409411851898;2551028.svs | \n",
" 2551028 | \n",
" ellensol | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810254 | \n",
" PIXEL_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 2 | \n",
" 134 | \n",
" 2019;HobS19-409411851898;2551028.svs | \n",
" 2551028 | \n",
" ellensol | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810254 | \n",
" OBJECT_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 3 | \n",
" 134 | \n",
" 2019;HobS19-409411851898;2551028.svs | \n",
" 2551028 | \n",
" ellensol | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810254 | \n",
" SIMPLIFIED_PIXEL_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 4 | \n",
" 134 | \n",
" 2019;HobS19-159147602774;2551129.svs | \n",
" 2551129 | \n",
" ellensol | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810066 | \n",
" DEFAULT_LABELS | \n",
"
\n",
" \n",
" | 5 | \n",
" 134 | \n",
" 2019;HobS19-159147602774;2551129.svs | \n",
" 2551129 | \n",
" ellensol | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810066 | \n",
" PIXEL_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 6 | \n",
" 134 | \n",
" 2019;HobS19-159147602774;2551129.svs | \n",
" 2551129 | \n",
" ellensol | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810066 | \n",
" OBJECT_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 7 | \n",
" 134 | \n",
" 2019;HobS19-159147602774;2551129.svs | \n",
" 2551129 | \n",
" ellensol | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810066 | \n",
" SIMPLIFIED_PIXEL_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 8 | \n",
" 134 | \n",
" 2019;HobS19-475053909405;2551389.svs | \n",
" 2551389 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810126 | \n",
" DEFAULT_LABELS | \n",
"
\n",
" \n",
" | 9 | \n",
" 134 | \n",
" 2019;HobS19-475053909405;2551389.svs | \n",
" 2551389 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810126 | \n",
" PIXEL_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 10 | \n",
" 134 | \n",
" 2019;HobS19-475053909405;2551389.svs | \n",
" 2551389 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810126 | \n",
" OBJECT_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 11 | \n",
" 134 | \n",
" 2019;HobS19-475053909405;2551389.svs | \n",
" 2551389 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810126 | \n",
" SIMPLIFIED_PIXEL_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 12 | \n",
" 134 | \n",
" 2019;HobS19-176164505079;2551531.svs | \n",
" 2551531 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810039 | \n",
" DEFAULT_LABELS | \n",
"
\n",
" \n",
" | 13 | \n",
" 134 | \n",
" 2019;HobS19-176164505079;2551531.svs | \n",
" 2551531 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810039 | \n",
" PIXEL_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 14 | \n",
" 134 | \n",
" 2019;HobS19-176164505079;2551531.svs | \n",
" 2551531 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810039 | \n",
" OBJECT_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 15 | \n",
" 134 | \n",
" 2019;HobS19-176164505079;2551531.svs | \n",
" 2551531 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.810039 | \n",
" SIMPLIFIED_PIXEL_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 16 | \n",
" 134 | \n",
" 2019;HobS19-030513574376;2551571.svs | \n",
" 2551571 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.809901 | \n",
" DEFAULT_LABELS | \n",
"
\n",
" \n",
" | 17 | \n",
" 134 | \n",
" 2019;HobS19-030513574376;2551571.svs | \n",
" 2551571 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.809901 | \n",
" PIXEL_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 18 | \n",
" 134 | \n",
" 2019;HobS19-030513574376;2551571.svs | \n",
" 2551571 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.809901 | \n",
" OBJECT_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
" | 19 | \n",
" 134 | \n",
" 2019;HobS19-030513574376;2551571.svs | \n",
" 2551571 | \n",
" soslowr | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" /gpfs/mskmindhdp_emc/user/shared_data_folder/p... | \n",
" 2021-07-06 14:02:04.809901 | \n",
" SIMPLIFIED_PIXEL_CLASSIFIER_LABELS | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sv_project_id slideviewer_path slide_id user \\\n",
"0 134 2019;HobS19-409411851898;2551028.svs 2551028 ellensol \n",
"1 134 2019;HobS19-409411851898;2551028.svs 2551028 ellensol \n",
"2 134 2019;HobS19-409411851898;2551028.svs 2551028 ellensol \n",
"3 134 2019;HobS19-409411851898;2551028.svs 2551028 ellensol \n",
"4 134 2019;HobS19-159147602774;2551129.svs 2551129 ellensol \n",
"5 134 2019;HobS19-159147602774;2551129.svs 2551129 ellensol \n",
"6 134 2019;HobS19-159147602774;2551129.svs 2551129 ellensol \n",
"7 134 2019;HobS19-159147602774;2551129.svs 2551129 ellensol \n",
"8 134 2019;HobS19-475053909405;2551389.svs 2551389 soslowr \n",
"9 134 2019;HobS19-475053909405;2551389.svs 2551389 soslowr \n",
"10 134 2019;HobS19-475053909405;2551389.svs 2551389 soslowr \n",
"11 134 2019;HobS19-475053909405;2551389.svs 2551389 soslowr \n",
"12 134 2019;HobS19-176164505079;2551531.svs 2551531 soslowr \n",
"13 134 2019;HobS19-176164505079;2551531.svs 2551531 soslowr \n",
"14 134 2019;HobS19-176164505079;2551531.svs 2551531 soslowr \n",
"15 134 2019;HobS19-176164505079;2551531.svs 2551531 soslowr \n",
"16 134 2019;HobS19-030513574376;2551571.svs 2551571 soslowr \n",
"17 134 2019;HobS19-030513574376;2551571.svs 2551571 soslowr \n",
"18 134 2019;HobS19-030513574376;2551571.svs 2551571 soslowr \n",
"19 134 2019;HobS19-030513574376;2551571.svs 2551571 soslowr \n",
"\n",
" bmp_filepath \\\n",
"0 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"1 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"2 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"3 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"4 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"5 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"6 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"7 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"8 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"9 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"10 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"11 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"12 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"13 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"14 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"15 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"16 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"17 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"18 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"19 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"\n",
" npy_filepath \\\n",
"0 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"1 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"2 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"3 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"4 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"5 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"6 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"7 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"8 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"9 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"10 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"11 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"12 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"13 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"14 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"15 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"16 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"17 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"18 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"19 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"\n",
" geojson_path \\\n",
"0 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"1 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"2 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"3 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"4 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"5 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"6 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"7 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"8 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"9 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"10 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"11 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"12 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"13 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"14 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"15 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"16 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"17 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"18 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"19 /gpfs/mskmindhdp_emc/user/shared_data_folder/p... \n",
"\n",
" date labelset \n",
"0 2021-07-06 14:02:04.810254 DEFAULT_LABELS \n",
"1 2021-07-06 14:02:04.810254 PIXEL_CLASSIFIER_LABELS \n",
"2 2021-07-06 14:02:04.810254 OBJECT_CLASSIFIER_LABELS \n",
"3 2021-07-06 14:02:04.810254 SIMPLIFIED_PIXEL_CLASSIFIER_LABELS \n",
"4 2021-07-06 14:02:04.810066 DEFAULT_LABELS \n",
"5 2021-07-06 14:02:04.810066 PIXEL_CLASSIFIER_LABELS \n",
"6 2021-07-06 14:02:04.810066 OBJECT_CLASSIFIER_LABELS \n",
"7 2021-07-06 14:02:04.810066 SIMPLIFIED_PIXEL_CLASSIFIER_LABELS \n",
"8 2021-07-06 14:02:04.810126 DEFAULT_LABELS \n",
"9 2021-07-06 14:02:04.810126 PIXEL_CLASSIFIER_LABELS \n",
"10 2021-07-06 14:02:04.810126 OBJECT_CLASSIFIER_LABELS \n",
"11 2021-07-06 14:02:04.810126 SIMPLIFIED_PIXEL_CLASSIFIER_LABELS \n",
"12 2021-07-06 14:02:04.810039 DEFAULT_LABELS \n",
"13 2021-07-06 14:02:04.810039 PIXEL_CLASSIFIER_LABELS \n",
"14 2021-07-06 14:02:04.810039 OBJECT_CLASSIFIER_LABELS \n",
"15 2021-07-06 14:02:04.810039 SIMPLIFIED_PIXEL_CLASSIFIER_LABELS \n",
"16 2021-07-06 14:02:04.809901 DEFAULT_LABELS \n",
"17 2021-07-06 14:02:04.809901 PIXEL_CLASSIFIER_LABELS \n",
"18 2021-07-06 14:02:04.809901 OBJECT_CLASSIFIER_LABELS \n",
"19 2021-07-06 14:02:04.809901 SIMPLIFIED_PIXEL_CLASSIFIER_LABELS "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from pyarrow.parquet import read_table\n",
"\n",
"regional_annotation_table = read_table(\"PRO_12-123/tables/REGIONAL_METADATA_RESULTS\",\n",
" filters = [('user', '!=', f'CONCAT')]).to_pandas()\n",
"regional_annotation_table\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this point, you have successfully set up your workspace, dowloaded the data, and run both the pathology and regional annotation ETLs to prepare your data. You are ready to move on to the tiling notebook!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}