{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset Preparation Tutorial\n", "\n", "Welcome to the dataset preparation tutorial! In this notebook, we will download the toy data set for the tutorial and prepare the necessary tables used for later analysis. Here are the steps we will review:\n", "\n", "1. Verify prerequisites\n", "2. Create a new project workspace\n", "3. Review sample dataset\n", "4. Build the proxy table\n", "5. Run regional annotation ETL\n", "\n", "**NOTE**: All of the configuration files for this tutorial have been provided in the container. The host and port values in the configuration files are dynamically set based on your system. \n", "\n", "**NOTE**: The current working directory is '~/vmount/notebooks'. All file and directory paths specified in the configuration files are relative to the current working directory. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Verify prerequisites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the software prerequisites for executing tasks with luna packages. These prerequiristes have already been baked into this docker container. Too view the setup, please see the corresponding dockerfile. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2021-11-16T14:11:02.591178Z", "iopub.status.busy": "2021-11-16T14:11:02.590864Z", "iopub.status.idle": "2021-11-16T14:11:05.749394Z", "shell.execute_reply": "2021-11-16T14:11:05.748361Z", "shell.execute_reply.started": "2021-11-16T14:11:02.591150Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python 3.6.9\n", "openjdk version \"11.0.13\" 2021-10-19\n", "OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.18.04)\n", "OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)\n", "env: JAVA_HOME=/usr\n", "PYSPARK_PYTHON: /usr/bin/python3\n", "PYSPARK_DRIVER_PYTHON: /usr/bin/python3\n", "JAVA_HOME: /usr\n", "LUNA_HOME: /home/rosed2/vmount\n", "/home/rosed2/.local/bin/jupyter\n", "pyluna-common 0.1.0\n", "pyluna-core 0.1.0\n", "pyluna-pathology 0.1.0\n" ] }, { "data": { "text/plain": [ "['/home/rosed2/.local/lib/python3.6/site-packages/luna/pathology']" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "!python3 --version\n", "!java -version\n", "%env JAVA_HOME=/usr\n", "!echo PYSPARK_PYTHON: $PYSPARK_PYTHON\n", "!echo PYSPARK_DRIVER_PYTHON: $PYSPARK_DRIVER_PYTHON\n", "!echo JAVA_HOME: $JAVA_HOME\n", "!echo LUNA_HOME: $LUNA_HOME\n", "!which jupyter\n", "!pip list | grep luna-\n", "import luna\n", "luna.__path__\n", "import luna.pathology\n", "luna.pathology.__path__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Create a new project workspace\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create a luna home space and place the configuration files there. Using a manifest file, we will create a project workspace for your configurations, data, models, and outputs to go for this tutorial." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# project manifest template\n", "\n", "# MIND project id\n", "PROJECT: PRO_12-123\n", "\n", "# IRB\n", "IRB:\n", "\n", "# project title\n", "TITLE: pathology-tutorial\n", "\n", "# project description\n", "DESCRIPTION: End-to-end pathology analysis tutorial\n", "\n", "DATA_MODALITIES: pathology\n", "\n", "ROOT_PATH: ../\n", "/home/rosed2/vmount/PRO_12-123\n", "├── data\n", "│   └── toy_data_set\n", "│   ├── 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs\n", "│   ├── 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs\n", "│   ├── 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.svs\n", "│   ├── 01OV008-308ad404-7079-4ff8-8232-12ee2e.svs\n", "│   └── 01OV008-7579323e-2fae-43a9-b00f-a15c28.svs\n", "└── manifest.yaml\n", "\n", "2 directories, 6 files\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-12-20 20:14:04,669 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [ (INFO)>, ]\n", "2021-12-20 20:14:04,671 - INFO - luna.common.config - loading config file /home/rosed2/luna/conf/manifest.yaml\n", "2021-12-20 20:14:04,695 - INFO - root - config files copied to ../PRO_12-123\n", "2021-12-20 20:14:04,696 - INFO - root - Code block 'generate project folder' took: 0.02510554599984971s\n" ] } ], "source": [ "%%bash\n", "mkdir -p ~/luna\n", "cp -R ~/vmount/conf ~/luna\n", "cat ~/luna/conf/manifest.yaml\n", "python3 -m luna.project.generate --manifest_file ~/luna/conf/manifest.yaml\n", "tree ~/vmount/PRO_12-123" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should now see a new directory called *PRO_12-123* with the manifest file in it. This will be your project workspace!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Review sample dataset\n", "\n", "The data that you will be using for this tutorial is a set of 5 whole slide images of ovarian cancer H&E slides, available in the svs file format. Whole slide imaging refers to the scanning of conventional glass slides for research purposes; in this case, these are slides that oncologists have used to inspecting cancer samples!\n", "\n", "While bringing up the DSA container, we already ran a script to get the data, and set up DSA. The `vmount/provision.py` script ran these steps:\n", " \n", " - Set up admin user and default assetstore\n", " \n", " - Download sample data from [public kitware site](https://data.kitware.com/#user/61b9f3dc4acac99f42ca7678/folder/61b9f4564acac99f42ca7692). to `~/vmount/PRO_12-123/data/toy_data_set/`\n", " \n", " - Create a collection and add slides/annotations to your local DSA\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/rosed2/vmount/PRO_12-123/data/toy_data_set\n", "├── 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs\n", "├── 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs\n", "├── 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.svs\n", "├── 01OV008-308ad404-7079-4ff8-8232-12ee2e.svs\n", "└── 01OV008-7579323e-2fae-43a9-b00f-a15c28.svs\n", "\n", "0 directories, 5 files\n" ] } ], "source": [ "%%bash\n", "tree ~/vmount/PRO_12-123/data/toy_data_set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to import your own data, you can do so from your local filesystem as well as an object store. For more details, refer to the [girder user documentation](https://girder.readthedocs.io/en/latest/user-guide.html#assetstores)\n", "\n", "To import images from your local filesystem, \n", "\n", "- Login to DSA with admin/password\n", "- Add images to your local computer at `vmount/assetstore` \n", "- Navigate to **Admin Console** -> **Assetstores**\n", "- From the default assetstore, click on **Import data**\n", "- Specify the path to the images you wish to import. e.g. `/assetstore/yourimage` and click import\n", "\n", "As the `/assetstore` mount is available to DSA, this import should be much faster than uploading the image through the **Upload files** in the UI.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Build the proxy table\n", "\n", "Now, we will run the Whole Slide Image (WSI) ETL to build a meta-data catalog of the slides in a proxy table. \n", "\n", "For reference, ETL stands for extract-transform-load; it is the method that often involves cleaning data, transforming data types, and loading data into different systems. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "REQUESTOR: viki mancoridis # The name of the requestor. You are likely the requestor\r\n", "REQUESTOR_DEPARTMENT: computational oncology # The department to which the requestor belongs\r\n", "REQUESTOR_EMAIL: MancoriV@mskcc.org # The email address of the requestor\r\n", "PROJECT: PRO_12-123 # The project name decided by data coordination\r\n", "SOURCE: toy_set # Source name of the input data file\r\n", "MODALITY: radiology # Data modality\r\n", "DATA_TYPE: WSI # Data type within this modality\r\n", "COMMENTS: # Description of template defined by requestor. You may leave blank\r\n", "DATE: 2021-07-06 # The date on which the request was made, likely today\r\n", "DATASET_NAME: toy_data_set # Name to be given to the dataset\r\n", "ETL_TYPE: proxy # Type of ETL\r\n", "FILE_TYPE: svs # Input source file\r\n", "FORMAT_TYPE: delta # Format type of the output proxy table\r\n", "NUM_PARTITION: 1 # Number of partitions for the delta table creation\r\n", "HOST: localhost # IP or hostname of machine where source data fil(s) reside\r\n", "ROOT_PATH: ../ # File path to the root of your local folder of data\r\n", "SOURCE_PATH: ../PRO_12-123/data/toy_data_set # Path to your specific data folder\r\n", "LANDING_PATH: . # Path for tables and file transfer\r\n", "RAW_DATA_PATH: . # Path to data transfer on destination machine\r\n", "INCLUDE: --include=.svs # Specifies inclusion of svs files\r\n", "CHUNK_FILE: ../PRO_12-123/data/chunks.txt # Output text will redirect here\r\n", "FILE_COUNT: 5 # Number of files for the table\r\n", "DATA_SIZE: 5000000000 # Upper bound for the number of bytes of data\r\n", "BWLIMIT: 5G # Amount of network bandwidth to utilize for the data transfer\r\n" ] } ], "source": [ "!cat ~/luna/conf/wsi_config.yaml " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ":: loading settings :: url = jar:file:/home/rosed2/.local/lib/python3.6/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n", "root\n", " |-- path: string (nullable = true)\n", " |-- modificationTime: timestamp (nullable = true)\n", " |-- length: long (nullable = true)\n", " |-- wsi_record_uuid: string (nullable = true)\n", " |-- slide_id: string (nullable = true)\n", " |-- metadata: map (nullable = true)\n", " | |-- key: string\n", " | |-- value: string (valueContainsNull = true)\n", "\n", "+--------------------+--------------------+---------+--------------------+--------------------+--------------------+\n", "| path| modificationTime| length| wsi_record_uuid| slide_id| metadata|\n", "+--------------------+--------------------+---------+--------------------+--------------------+--------------------+\n", "|file:/home/rosed2...|2021-12-20 20:14:...|262796337|WSI-214d5f6dc60e5...|01OV007-9b90eb78-...|{aperio_User -> b...|\n", "|file:/home/rosed2...|2021-12-20 20:14:...|240691747|WSI-754715472db56...|01OV002-ed65cf94-...|{aperio_User -> b...|\n", "|file:/home/rosed2...|2021-12-20 20:13:...|237047223|WSI-c9bd1f11b93b8...|01OV002-bd8cdc70-...|{aperio_User -> b...|\n", "|file:/home/rosed2...|2021-12-20 20:15:...|215796305|WSI-9dbf3d29fac30...|01OV008-7579323e-...|{aperio_User -> b...|\n", "|file:/home/rosed2...|2021-12-20 20:15:...|207479411|WSI-1ac0d0b8337e3...|01OV008-308ad404-...|{aperio_User -> b...|\n", "+--------------------+--------------------+---------+--------------------+--------------------+--------------------+\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2021-12-20 20:29:08,048 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [ (INFO)>, ]\n", "2021-12-20 20:29:08,050 - INFO - root - data_ingestions_template: /home/rosed2/luna/conf/wsi_config.yaml\n", "2021-12-20 20:29:08,052 - INFO - root - config_file: /home/rosed2/luna/conf/app_config.yaml\n", "2021-12-20 20:29:08,053 - INFO - root - processes: ['delta']\n", "2021-12-20 20:29:08,054 - INFO - luna.common.config - loading config file /home/rosed2/luna/conf/app_config.yaml\n", "2021-12-20 20:29:08,058 - INFO - luna.common.config - loading config file /home/rosed2/luna/conf/wsi_config.yaml\n", "2021-12-20 20:29:08,063 - INFO - luna.common.config - validating config /home/rosed2/luna/conf/wsi_config.yaml against schema /home/rosed2/.local/lib/python3.6/site-packages/luna/pathology/proxy_table/data_ingestion_template_schema.yml for DATA_CFG\n", "2021-12-20 20:29:08,114 - INFO - root - config files copied to ../PRO_12-123/configs/WSI_toy_data_set\n", "Warning: Ignoring non-Spark config property: fs.defaultFS\n", "WARNING: An illegal reflective access operation has occurred\n", "WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/rosed2/.local/lib/python3.6/site-packages/pyspark/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)\n", "WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform\n", "WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations\n", "WARNING: All illegal access operations will be denied in a future release\n", "Ivy Default Cache set to: /home/rosed2/.ivy2/cache\n", "The jars for the packages stored in: /home/rosed2/.ivy2/jars\n", "io.delta#delta-core_2.12 added as a dependency\n", ":: resolving dependencies :: org.apache.spark#spark-submit-parent-7ac9b111-a068-4c2c-b2a1-edcffb4c556e;1.0\n", "\tconfs: [default]\n", "\tfound io.delta#delta-core_2.12;0.7.0 in central\n", "\tfound org.antlr#antlr4;4.7 in central\n", "\tfound org.antlr#antlr4-runtime;4.7 in central\n", "\tfound org.antlr#antlr-runtime;3.5.2 in central\n", "\tfound org.antlr#ST4;4.0.8 in central\n", "\tfound org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central\n", "\tfound org.glassfish#javax.json;1.0.4 in central\n", "\tfound com.ibm.icu#icu4j;58.2 in central\n", "downloading https://repo1.maven.org/maven2/io/delta/delta-core_2.12/0.7.0/delta-core_2.12-0.7.0.jar ...\n", "\t[SUCCESSFUL ] io.delta#delta-core_2.12;0.7.0!delta-core_2.12.jar (141ms)\n", "downloading https://repo1.maven.org/maven2/org/antlr/antlr4/4.7/antlr4-4.7.jar ...\n", "\t[SUCCESSFUL ] org.antlr#antlr4;4.7!antlr4.jar (79ms)\n", "downloading https://repo1.maven.org/maven2/org/antlr/antlr4-runtime/4.7/antlr4-runtime-4.7.jar ...\n", "\t[SUCCESSFUL ] org.antlr#antlr4-runtime;4.7!antlr4-runtime.jar (38ms)\n", "downloading https://repo1.maven.org/maven2/org/antlr/antlr-runtime/3.5.2/antlr-runtime-3.5.2.jar ...\n", "\t[SUCCESSFUL ] org.antlr#antlr-runtime;3.5.2!antlr-runtime.jar (29ms)\n", "downloading https://repo1.maven.org/maven2/org/antlr/ST4/4.0.8/ST4-4.0.8.jar ...\n", "\t[SUCCESSFUL ] org.antlr#ST4;4.0.8!ST4.jar (32ms)\n", "downloading https://repo1.maven.org/maven2/org/abego/treelayout/org.abego.treelayout.core/1.0.3/org.abego.treelayout.core-1.0.3.jar ...\n", "\t[SUCCESSFUL ] org.abego.treelayout#org.abego.treelayout.core;1.0.3!org.abego.treelayout.core.jar(bundle) (25ms)\n", "downloading https://repo1.maven.org/maven2/org/glassfish/javax.json/1.0.4/javax.json-1.0.4.jar ...\n", "\t[SUCCESSFUL ] org.glassfish#javax.json;1.0.4!javax.json.jar(bundle) (18ms)\n", "downloading https://repo1.maven.org/maven2/com/ibm/icu/icu4j/58.2/icu4j-58.2.jar ...\n", "\t[SUCCESSFUL ] com.ibm.icu#icu4j;58.2!icu4j.jar (509ms)\n", ":: resolution report :: resolve 3367ms :: artifacts dl 878ms\n", "\t:: modules in use:\n", "\tcom.ibm.icu#icu4j;58.2 from central in [default]\n", "\tio.delta#delta-core_2.12;0.7.0 from central in [default]\n", "\torg.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]\n", "\torg.antlr#ST4;4.0.8 from central in [default]\n", "\torg.antlr#antlr-runtime;3.5.2 from central in [default]\n", "\torg.antlr#antlr4;4.7 from central in [default]\n", "\torg.antlr#antlr4-runtime;4.7 from central in [default]\n", "\torg.glassfish#javax.json;1.0.4 from central in [default]\n", "\t---------------------------------------------------------------------\n", "\t| | modules || artifacts |\n", "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", "\t---------------------------------------------------------------------\n", "\t| default | 8 | 8 | 8 | 0 || 8 | 8 |\n", "\t---------------------------------------------------------------------\n", ":: retrieving :: org.apache.spark#spark-submit-parent-7ac9b111-a068-4c2c-b2a1-edcffb4c556e\n", "\tconfs: [default]\n", "\t8 artifacts copied, 0 already retrieved (15071kB/60ms)\n", "21/12/20 20:29:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", "2021-12-20 20:29:17,725 - INFO - __main__ - generating binary proxy table...\n", "2021-12-20 20:29:17,767 - INFO - __main__ - Writing to ../PRO_12-123/tables/WSI_toy_data_set\n", "2021-12-20 20:29:19,552 - INFO - __main__ - Code block 'load wsi metadata' took: 1.783534656000029s\n", "\r", "[Stage 0:> (0 + 5) / 5]\r", "Current mem limits: -1 of max -1\n", "\n", "Setting mem limits to 1073741824 of max 1073741824\n", "\n", "Current mem limits: -1 of max -1\n", "\n", "Setting mem limits to 1073741824 of max 1073741824\n", "\n", "Current mem limits: -1 of max -1\n", "\n", "Setting mem limits to 1073741824 of max 1073741824\n", "\n", "Current mem limits: -1 of max -1\n", "\n", "Setting mem limits to 1073741824 of max 1073741824\n", "\n", "Current mem limits: -1 of max -1\n", "\n", "Setting mem limits to 1073741824 of max 1073741824\n", "\n", "\r", " \r", "\r", "[Stage 2:========> (7 + 6) / 50]\r", "\r", "[Stage 2:==============> (13 + 7) / 50]\r", "\r", "[Stage 2:===========================> (24 + 6) / 50]\r", "\r", "[Stage 2:=======================================> (35 + 7) / 50]\r", "\r", " \r", "2021-12-20 20:29:27,968 - INFO - __main__ - Processed 5 whole slide images out of total 5 files\n", "Current mem limits: 1073741824 of max 1073741824\n", "\n", "\r", "[Stage 6:> (0 + 1) / 1]\r", "\r", " \r", "Current mem limits: 1073741824 of max 1073741824\n", "\n", "Current mem limits: 1073741824 of max 1073741824\n", "\n", "Current mem limits: 1073741824 of max 1073741824\n", "\n", "Current mem limits: 1073741824 of max 1073741824\n", "\n", "\r", "[Stage 7:> (0 + 4) / 4]\r", "\r", "[Stage 7:==============> (1 + 3) / 4]\r", "\r", " \r", "2021-12-20 20:29:30,010 - INFO - root - Code block 'generate proxy table' took: 21.959749678999742s\n" ] } ], "source": [ "%%bash\n", "python3 -m luna.pathology.proxy_table.generate \\\n", " -d ~/luna/conf/wsi_config.yaml \\\n", " -a ~/luna/conf/app_config.yaml \\\n", " -p delta\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This step may take a while. At the end, your proxy table should be generated!\n", "\n", "Before we view the table, we must first update it to associate patient ID's with the slides. This is necessary for correctly training and validating the machine learning model in the coming notebooks. Once the slides are divided into \"tiles\" in the next notebook, the tiles are split between the training and validation sets for the ML model. If the tiles do not have patient ID's associated with them, then it is possible for tiles from one individual to appear in both the training and validation of the model; this would cause researchers to have an exaggerated interpretation of the model's accuracy, since we would essentially be validating the model on information that is too near to what it has already seen. \n", "\n", "Note that we will not be using patient IDs associated with MSK. Instead, we will be using spoof IDs that will suffice for this tutorial. When running this workflow with real data, make sure to include the IDs safely and securely. Run the following block of code to add a 'patient_id' column to the table and store it using Spark." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "from pyspark.sql import SparkSession\n", "\n", "# setup spark session\n", "spark = SparkSession.builder \\\n", " .appName(\"test\") \\\n", " .master('local[*]') \\\n", " .config(\"spark.driver.host\", \"127.0.0.1\") \\\n", " .config(\"spark.jars.packages\", \"io.delta:delta-core_2.12:0.7.0\") \\\n", " .config(\"spark.delta.logStore.class\", \"org.apache.spark.sql.delta.storage.HDFSLogStore\") \\\n", " .config(\"spark.sql.extensions\", \"io.delta.sql.DeltaSparkSessionExtension\") \\\n", " .config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\") \\\n", " .config(\"spark.databricks.delta.retentionDurationCheck.enabled\", \"false\") \\\n", " .config(\"spark.hadoop.dfs.client.use.datanode.hostname\", \"true\") \\\n", " .config(\"spark.driver.memory\", \"6g\") \\\n", " .config(\"spark.executor.memory\", \"6g\") \\\n", " .getOrCreate()\n", "\n", "print(spark)\n", "\n", "# read WSI delta table\n", "wsi_table = spark.read.format(\"delta\") .load(\"../PRO_12-123/tables/WSI_toy_data_set\").toPandas()\n", "\n", "# insert spoof patient ids\n", "patient_id=[1,2,3,4,5]\n", "wsi_table['patient_id']=patient_id\n", "\n", "wsi_table\n", "\n", "# convert back to a spark table (update table)\n", "x = spark.createDataFrame(wsi_table)\n", "x.write.format(\"delta\").mode(\"overwrite\").option(\"mergeSchema\", \"true\").save(\"../PRO_12-123/tables/WSI_toy_data_set\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reduce the delta table down to a single layer so all data can be read as a parquet table." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DataFrame[]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from delta.tables import *\n", "wsi_table = DeltaTable.forPath(spark, \"../PRO_12-123/tables/WSI_toy_data_set\") \n", "wsi_table.vacuum(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we may view the WSI table! This table should have the metadata associated with the WSI slides that you just collected, including the patient IDs. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pathmodificationTimelengthwsi_record_uuidslide_idmetadatapatient_id
0file:/home/rosed2/vmount/PRO_12-123/data/toy_d...2021-12-20 20:14:21.163240691747WSI-754715472db565938da32db34ccb625a10a2c9905d...01OV002-ed65cf94-8bc6-492b-9149-adc16f{'aperio_StripeWidth': '1000', 'aperio_User': ...2
1file:/home/rosed2/vmount/PRO_12-123/data/toy_d...2021-12-20 20:14:56.701262796337WSI-214d5f6dc60e5e9b4a697067a7e897b49c7fdc72eb...01OV007-9b90eb78-2f50-4aeb-b010-d642f9{'aperio_StripeWidth': '1000', 'aperio_User': ...1
2file:/home/rosed2/vmount/PRO_12-123/data/toy_d...2021-12-20 20:15:59.008215796305WSI-9dbf3d29fac304fe27dac95b20bfff7ed338293cae...01OV008-7579323e-2fae-43a9-b00f-a15c28{'aperio_StripeWidth': '1000', 'aperio_User': ...3
3file:/home/rosed2/vmount/PRO_12-123/data/toy_d...2021-12-20 20:13:47.194237047223WSI-c9bd1f11b93b81f9e0ed532082e2d64b7e9daac496...01OV002-bd8cdc70-3d46-40ae-99c4-90ef77{'aperio_StripeWidth': '1000', 'aperio_User': ...4
4file:/home/rosed2/vmount/PRO_12-123/data/toy_d...2021-12-20 20:15:28.100207479411WSI-1ac0d0b8337e30284de658fab536204fd7d9c9ef45...01OV008-308ad404-7079-4ff8-8232-12ee2e{'aperio_Left': '29.130108', 'aperio_StripeWid...5
\n", "
" ], "text/plain": [ " path modificationTime \\\n", "0 file:/home/rosed2/vmount/PRO_12-123/data/toy_d... 2021-12-20 20:14:21.163 \n", "1 file:/home/rosed2/vmount/PRO_12-123/data/toy_d... 2021-12-20 20:14:56.701 \n", "2 file:/home/rosed2/vmount/PRO_12-123/data/toy_d... 2021-12-20 20:15:59.008 \n", "3 file:/home/rosed2/vmount/PRO_12-123/data/toy_d... 2021-12-20 20:13:47.194 \n", "4 file:/home/rosed2/vmount/PRO_12-123/data/toy_d... 2021-12-20 20:15:28.100 \n", "\n", " length wsi_record_uuid \\\n", "0 240691747 WSI-754715472db565938da32db34ccb625a10a2c9905d... \n", "1 262796337 WSI-214d5f6dc60e5e9b4a697067a7e897b49c7fdc72eb... \n", "2 215796305 WSI-9dbf3d29fac304fe27dac95b20bfff7ed338293cae... \n", "3 237047223 WSI-c9bd1f11b93b81f9e0ed532082e2d64b7e9daac496... \n", "4 207479411 WSI-1ac0d0b8337e30284de658fab536204fd7d9c9ef45... \n", "\n", " slide_id \\\n", "0 01OV002-ed65cf94-8bc6-492b-9149-adc16f \n", "1 01OV007-9b90eb78-2f50-4aeb-b010-d642f9 \n", "2 01OV008-7579323e-2fae-43a9-b00f-a15c28 \n", "3 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77 \n", "4 01OV008-308ad404-7079-4ff8-8232-12ee2e \n", "\n", " metadata patient_id \n", "0 {'aperio_StripeWidth': '1000', 'aperio_User': ... 2 \n", "1 {'aperio_StripeWidth': '1000', 'aperio_User': ... 1 \n", "2 {'aperio_StripeWidth': '1000', 'aperio_User': ... 3 \n", "3 {'aperio_StripeWidth': '1000', 'aperio_User': ... 4 \n", "4 {'aperio_Left': '29.130108', 'aperio_StripeWid... 5 " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read WSI delta table\n", "wsi_table = spark.read.format(\"delta\") \\\n", " .load(\"../PRO_12-123/tables/WSI_toy_data_set\").toPandas()\n", "\n", "# view table\n", "wsi_table\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the table is depicted above, congratulations, you have successfully run the Whole Slide Image (WSI) ETL to database the slides!\n", "\n", "## Run the regional annotation ETL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The whole slide images that you downloaded are images of ovarian cancer, but not every pixel on each slide is a tumor. In fact, the images show tumor cells, normal ovarian cells and more. A non-expert annotated this slide for demo purposes only.\n", "\n", "The regional annotation ETL performs the following steps\n", "\n", "- Downloads DSA json annotations\n", "- Converts DSA jsons to GeoJSON format, which is compatible with downstream applications\n", "- Saves configs in your `~/vmount/PRO_12-123/configs/REGIONAL_METADATA_RESULTS`\n", "- Saves parquet table in your `~/vmount/PRO_12-123/tables/REGIONAL_METADATA_RESULTS `\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To run the regional annotation ETL, we use the `dsa_annotation` CLI. For more details on the dsa_annotation, and the annotations we support, please checkout the `7_dsa-annotation.ipynb` notebook.\n", "\n", "**Note**: details of your DSA instance is specified as `DSA_URI` in `../conf/dsa_regional_annotation.yaml` and should be updated to reflect your DSA setup. If you are using the docker, replace the `localhost` with the IP you get from running:\n", "\n", "```docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' luna_tutorial_girder_1```\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2021-12-20 20:41:21,000 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [ (INFO)>, ]\n", "2021-12-20 20:41:21,002 - INFO - luna.common.config - loading config file ../conf/dsa_regional_annotation.yaml\n", "2021-12-20 20:41:21,007 - INFO - luna.common.config - loading config file ../conf/dsa_app_config.yaml\n", "2021-12-20 20:41:21,009 - INFO - root - data template: ../conf/dsa_regional_annotation.yaml\n", "2021-12-20 20:41:21,010 - INFO - root - config_file: ../conf/dsa_app_config.yaml\n", "2021-12-20 20:41:21,048 - INFO - root - config files copied to ../PRO_12-123/configs/REGIONAL_METADATA_RESULTS\n", "2021-12-20 20:41:21,153 - INFO - luna.pathology.cli.dsa.dsa_annotations - Table output directory: ../PRO_12-123/tables/REGIONAL_METADATA_RESULTS\n", "Successfully connected to DSA\n", "collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}\n", "Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073\n", "retreived collection uuid\n", "No stylesheet in collection: 61c0e3dd4903324c7fe38073\n", "2021-12-20 20:41:21,390 - INFO - luna.pathology.cli.dsa.dsa_annotations - Retrieved collection metadata\n", "2021-12-20 20:41:22,746 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [ (INFO)>, ]\n", "2021-12-20 20:41:22,747 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [ (INFO)>, ]\n", "2021-12-20 20:41:22,747 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [ (INFO)>, ]\n", "2021-12-20 20:41:22,750 - INFO - luna.pathology.cli.dsa.dsa_annotations - \n", "collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}\n", "Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073\n", "collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}\n", "Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073\n", "collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}\n", "Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073\n", "Image file 01OV008-7579323e-2fae-43a9-b00f-a15c28.svs found with id: 61c0e4644903324c7fe3809f\n", "Starting request for annotation\n", "Image file 01OV008-308ad404-7079-4ff8-8232-12ee2e.svs found with id: 61c0e4464903324c7fe38094\n", "Starting request for annotation\n", "Image file 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.svs found with id: 61c0e4224903324c7fe38089\n", "Starting request for annotation\n", "2021-12-20 20:41:22,885 - INFO - luna.common.config - loading config file /home/rosed2/vmount/conf/datastore.cfg\n", "collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}\n", "Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073\n", "2021-12-20 20:41:22,901 - INFO - luna.common.DataStore - Configured datastore with {'GRAPH_STORE_ENABLED': False, 'GRAPH_URI': 'neo4j://localhost:7687', 'GRAPH_USER': 'neo4j', 'GRAPH_PASSWORD': 'password', 'OBJECT_STORE_ENABLED': False, 'MINIO_URI': 'localhost:8001', 'MINIO_USER': 'minio', 'MINIO_PASSWORD': 'password', 'DOC_STORE_ENABLED': False, 'MONGODB_URI': 'mongodb://localhost:27017/'}\n", "collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}\n", "Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073\n", "Image file 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs found with id: 61c0e3dd4903324c7fe38075\n", "Starting request for annotation\n", "2021-12-20 20:41:22,911 - INFO - luna.common.DataStore - Datstore file backend= ../PRO_12-123/slides\n", "Image file 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs found with id: 61c0e4014903324c7fe38080\n", "Starting request for annotation\n", "2021-12-20 20:41:22,940 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV007-9b90eb78-2f50-4aeb-b010-d642f9/admin/REGIONAL_METADATA_RESULTS_DSA_JSON/DEFAULT_LABELS\n", "2021-12-20 20:41:22,981 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV007-9b90eb78-2f50-4aeb-b010-d642f9/CONCAT/REGIONAL_METADATA_RESULTS/DEFAULT_LABELS\n", "2021-12-20 20:41:23,010 - INFO - luna.common.config - loading config file /home/rosed2/vmount/conf/datastore.cfg\n", "2021-12-20 20:41:23,012 - INFO - luna.common.config - loading config file /home/rosed2/vmount/conf/datastore.cfg\n", "2021-12-20 20:41:23,016 - INFO - luna.common.DataStore - Configured datastore with {'GRAPH_STORE_ENABLED': False, 'GRAPH_URI': 'neo4j://localhost:7687', 'GRAPH_USER': 'neo4j', 'GRAPH_PASSWORD': 'password', 'OBJECT_STORE_ENABLED': False, 'MINIO_URI': 'localhost:8001', 'MINIO_USER': 'minio', 'MINIO_PASSWORD': 'password', 'DOC_STORE_ENABLED': False, 'MONGODB_URI': 'mongodb://localhost:27017/'}\n", "2021-12-20 20:41:23,022 - INFO - luna.common.DataStore - Configured datastore with {'GRAPH_STORE_ENABLED': False, 'GRAPH_URI': 'neo4j://localhost:7687', 'GRAPH_USER': 'neo4j', 'GRAPH_PASSWORD': 'password', 'OBJECT_STORE_ENABLED': False, 'MINIO_URI': 'localhost:8001', 'MINIO_USER': 'minio', 'MINIO_PASSWORD': 'password', 'DOC_STORE_ENABLED': False, 'MONGODB_URI': 'mongodb://localhost:27017/'}\n", "2021-12-20 20:41:23,023 - INFO - luna.common.DataStore - Datstore file backend= ../PRO_12-123/slides\n", "2021-12-20 20:41:23,024 - INFO - luna.common.DataStore - Datstore file backend= ../PRO_12-123/slides\n", "2021-12-20 20:41:23,033 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV008-308ad404-7079-4ff8-8232-12ee2e/admin/REGIONAL_METADATA_RESULTS_DSA_JSON/DEFAULT_LABELS\n", "2021-12-20 20:41:23,037 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV008-7579323e-2fae-43a9-b00f-a15c28/admin/REGIONAL_METADATA_RESULTS_DSA_JSON/DEFAULT_LABELS\n", "2021-12-20 20:41:23,037 - INFO - luna.common.DataStore - Configured datastore with {'GRAPH_STORE_ENABLED': False, 'GRAPH_URI': 'neo4j://localhost:7687', 'GRAPH_USER': 'neo4j', 'GRAPH_PASSWORD': 'password', 'OBJECT_STORE_ENABLED': False, 'MINIO_URI': 'localhost:8001', 'MINIO_USER': 'minio', 'MINIO_PASSWORD': 'password', 'DOC_STORE_ENABLED': False, 'MONGODB_URI': 'mongodb://localhost:27017/'}\n", "2021-12-20 20:41:23,044 - INFO - luna.common.DataStore - Datstore file backend= ../PRO_12-123/slides\n", "2021-12-20 20:41:23,054 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV008-308ad404-7079-4ff8-8232-12ee2e/CONCAT/REGIONAL_METADATA_RESULTS/DEFAULT_LABELS\n", "2021-12-20 20:41:23,054 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV002-bd8cdc70-3d46-40ae-99c4-90ef77/admin/REGIONAL_METADATA_RESULTS_DSA_JSON/DEFAULT_LABELS\n", "2021-12-20 20:41:23,059 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV008-7579323e-2fae-43a9-b00f-a15c28/CONCAT/REGIONAL_METADATA_RESULTS/DEFAULT_LABELS\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "2021-12-20 20:41:23,068 - INFO - luna.common.DataStore - Configured datastore with {'GRAPH_STORE_ENABLED': False, 'GRAPH_URI': 'neo4j://localhost:7687', 'GRAPH_USER': 'neo4j', 'GRAPH_PASSWORD': 'password', 'OBJECT_STORE_ENABLED': False, 'MINIO_URI': 'localhost:8001', 'MINIO_USER': 'minio', 'MINIO_PASSWORD': 'password', 'DOC_STORE_ENABLED': False, 'MONGODB_URI': 'mongodb://localhost:27017/'}\n", "2021-12-20 20:41:23,073 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV002-bd8cdc70-3d46-40ae-99c4-90ef77/CONCAT/REGIONAL_METADATA_RESULTS/DEFAULT_LABELS\n", "2021-12-20 20:41:23,075 - INFO - luna.common.DataStore - Datstore file backend= ../PRO_12-123/slides\n", "2021-12-20 20:41:23,083 - INFO - luna.pathology.cli.dsa.dsa_annotations - Annotation for slide 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs generated successfully\n", "2021-12-20 20:41:23,097 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV002-ed65cf94-8bc6-492b-9149-adc16f/admin/REGIONAL_METADATA_RESULTS_DSA_JSON/DEFAULT_LABELS\n", "2021-12-20 20:41:23,101 - INFO - luna.pathology.cli.dsa.dsa_annotations - Annotation for slide 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs generated successfully\n", "2021-12-20 20:41:23,108 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV002-ed65cf94-8bc6-492b-9149-adc16f/CONCAT/REGIONAL_METADATA_RESULTS/DEFAULT_LABELS\n", "2021-12-20 20:41:23,109 - INFO - luna.pathology.cli.dsa.dsa_annotations - Annotation for slide 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs generated successfully\n", "2021-12-20 20:41:23,121 - INFO - luna.pathology.cli.dsa.dsa_annotations - Annotation for slide 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs generated successfully\n", "2021-12-20 20:41:23,130 - INFO - luna.pathology.cli.dsa.dsa_annotations - Annotation for slide 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs generated successfully\n", "2021-12-20 20:41:24,580 - INFO - root - Code block 'generate DSA annotation geojson table' took: 3.5710523679999824s\n" ] } ], "source": [ "!dsa_annotation \\\n", "-d ../conf/dsa_regional_annotation.yaml \\\n", "-a ../conf/dsa_app_config.yaml \\\n", "-u admin -p password" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check that the regional annotation ETL was correctly run, after the Jupyter cell finishes, you may load the regional annotations table! This table contains the metadata saved from running the ETL. It includes paths to the bitmap files, numpy files, and geoJSON files that were mentioned before. To load the table, run the following code cell: " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
project_nameslide_iduserdsa_json_pathgeojson_pathdate_updateddate_createdlabelsetannotation_nameannotation_type
0TCGA collection01OV007-9b90eb78-2f50-4aeb-b010-d642f9CONCAT../PRO_12-123/slides/01OV007-9b90eb78-2f50-4ae...../PRO_12-123/slides/01OV007-9b90eb78-2f50-4ae...2021-12-20T20:15:02.631000+00:002021-12-20T20:15:02.631000+00:00DEFAULT_LABELSov_regionalRegionalAnnotationJSON
1TCGA collection01OV008-308ad404-7079-4ff8-8232-12ee2eCONCAT../PRO_12-123/slides/01OV008-308ad404-7079-4ff...../PRO_12-123/slides/01OV008-308ad404-7079-4ff...2021-12-20T20:15:32.585000+00:002021-12-20T20:15:32.585000+00:00DEFAULT_LABELSov_regionalRegionalAnnotationJSON
2TCGA collection01OV008-7579323e-2fae-43a9-b00f-a15c28CONCAT../PRO_12-123/slides/01OV008-7579323e-2fae-43a...../PRO_12-123/slides/01OV008-7579323e-2fae-43a...2021-12-20T20:16:04.062000+00:002021-12-20T20:16:04.062000+00:00DEFAULT_LABELSov_regionalRegionalAnnotationJSON
3TCGA collection01OV002-bd8cdc70-3d46-40ae-99c4-90ef77CONCAT../PRO_12-123/slides/01OV002-bd8cdc70-3d46-40a...../PRO_12-123/slides/01OV002-bd8cdc70-3d46-40a...2021-12-20T20:13:53.187000+00:002021-12-20T20:13:53.187000+00:00DEFAULT_LABELSov_regionalRegionalAnnotationJSON
4TCGA collection01OV002-ed65cf94-8bc6-492b-9149-adc16fCONCAT../PRO_12-123/slides/01OV002-ed65cf94-8bc6-492...../PRO_12-123/slides/01OV002-ed65cf94-8bc6-492...2021-12-20T20:14:26.703000+00:002021-12-20T20:14:26.703000+00:00DEFAULT_LABELSov_regionalRegionalAnnotationJSON
\n", "
" ], "text/plain": [ " project_name slide_id user \\\n", "0 TCGA collection 01OV007-9b90eb78-2f50-4aeb-b010-d642f9 CONCAT \n", "1 TCGA collection 01OV008-308ad404-7079-4ff8-8232-12ee2e CONCAT \n", "2 TCGA collection 01OV008-7579323e-2fae-43a9-b00f-a15c28 CONCAT \n", "3 TCGA collection 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77 CONCAT \n", "4 TCGA collection 01OV002-ed65cf94-8bc6-492b-9149-adc16f CONCAT \n", "\n", " dsa_json_path \\\n", "0 ../PRO_12-123/slides/01OV007-9b90eb78-2f50-4ae... \n", "1 ../PRO_12-123/slides/01OV008-308ad404-7079-4ff... \n", "2 ../PRO_12-123/slides/01OV008-7579323e-2fae-43a... \n", "3 ../PRO_12-123/slides/01OV002-bd8cdc70-3d46-40a... \n", "4 ../PRO_12-123/slides/01OV002-ed65cf94-8bc6-492... \n", "\n", " geojson_path \\\n", "0 ../PRO_12-123/slides/01OV007-9b90eb78-2f50-4ae... \n", "1 ../PRO_12-123/slides/01OV008-308ad404-7079-4ff... \n", "2 ../PRO_12-123/slides/01OV008-7579323e-2fae-43a... \n", "3 ../PRO_12-123/slides/01OV002-bd8cdc70-3d46-40a... \n", "4 ../PRO_12-123/slides/01OV002-ed65cf94-8bc6-492... \n", "\n", " date_updated date_created \\\n", "0 2021-12-20T20:15:02.631000+00:00 2021-12-20T20:15:02.631000+00:00 \n", "1 2021-12-20T20:15:32.585000+00:00 2021-12-20T20:15:32.585000+00:00 \n", "2 2021-12-20T20:16:04.062000+00:00 2021-12-20T20:16:04.062000+00:00 \n", "3 2021-12-20T20:13:53.187000+00:00 2021-12-20T20:13:53.187000+00:00 \n", "4 2021-12-20T20:14:26.703000+00:00 2021-12-20T20:14:26.703000+00:00 \n", "\n", " labelset annotation_name annotation_type \n", "0 DEFAULT_LABELS ov_regional RegionalAnnotationJSON \n", "1 DEFAULT_LABELS ov_regional RegionalAnnotationJSON \n", "2 DEFAULT_LABELS ov_regional RegionalAnnotationJSON \n", "3 DEFAULT_LABELS ov_regional RegionalAnnotationJSON \n", "4 DEFAULT_LABELS ov_regional RegionalAnnotationJSON " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyarrow.parquet import read_table\n", "\n", "regional_annotation_table = read_table(\"../PRO_12-123/tables/REGIONAL_METADATA_RESULTS\").to_pandas()\n", "regional_annotation_table\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, you have successfully set up your workspace, dowloaded the data, and run both the pathology and regional annotation ETLs to prepare your data. You are ready to move on to the tiling notebook!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }