Dataset Preparation Tutorial
Contents
Dataset Preparation Tutorial¶
Welcome to the dataset preparation tutorial! In this notebook, we will download the toy data set for the tutorial and prepare the necessary tables used for later analysis. Here are the steps we will review:
Verify prerequisites
Create a new project workspace
Review sample dataset
Build the proxy table
Run regional annotation ETL
NOTE: All of the configuration files for this tutorial have been provided in the container, but you will have to download the input data and add it to the container’s volume mount as shown in the steps below.
NOTE: The current working directory is ‘~/vmount/notebooks’. All file and directory paths specified in the configuration files are relative to the current working directory.
1. Verify prerequisites¶
Here are the software prerequisites for executing tasks with luna packages. These prerequiristes have already been baked into this docker container. Too view the setup, please see the corresponding dockerfile.
[1]:
!python3 --version
!java -version
%env JAVA_HOME=/usr
!echo PYSPARK_PYTHON: $PYSPARK_PYTHON
!echo PYSPARK_DRIVER_PYTHON: $PYSPARK_DRIVER_PYTHON
!echo JAVA_HOME: $JAVA_HOME
!echo LUNA_HOME: $LUNA_HOME
!which jupyter
!pip list | grep luna-
import luna
luna.__path__
import luna.pathology
luna.pathology.__path__
Python 3.6.9
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.18.04)
OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)
env: JAVA_HOME=/usr
PYSPARK_PYTHON: /usr/bin/python3
PYSPARK_DRIVER_PYTHON: /usr/bin/python3
JAVA_HOME: /usr
LUNA_HOME: /home/rosed2/vmount
/home/rosed2/.local/bin/jupyter
pyluna-common 0.1.0
pyluna-core 0.1.0
pyluna-pathology 0.1.0
[1]:
['/home/rosed2/.local/lib/python3.6/site-packages/luna/pathology']
2. Create a new project workspace¶
Next, we create a luna home space and place the configuration files there. Using a manifest file, we will create a project workspace for your configurations, data, models, and outputs to go for this tutorial.
[2]:
%%bash
mkdir -p ~/luna
cp -R ~/vmount/conf ~/luna
cat ~/luna/conf/manifest.yaml
python3 -m luna.project.generate --manifest_file ~/luna/conf/manifest.yaml
tree ~/vmount/PRO_12-123
# project manifest template
# MIND project id
PROJECT: PRO_12-123
# IRB
IRB:
# project title
TITLE: pathology-tutorial
# project description
DESCRIPTION: End-to-end pathology analysis tutorial
DATA_MODALITIES: pathology
ROOT_PATH: ../
/home/rosed2/vmount/PRO_12-123
├── data
│ └── toy_data_set
│ ├── 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs
│ ├── 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs
│ ├── 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.svs
│ ├── 01OV008-308ad404-7079-4ff8-8232-12ee2e.svs
│ └── 01OV008-7579323e-2fae-43a9-b00f-a15c28.svs
└── manifest.yaml
2 directories, 6 files
2021-12-20 20:14:04,669 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [<StreamHandler <stderr> (INFO)>, <RotatingFileHandler /home/rosed2/vmount/notebooks/data-processing.log (INFO)>]
2021-12-20 20:14:04,671 - INFO - luna.common.config - loading config file /home/rosed2/luna/conf/manifest.yaml
2021-12-20 20:14:04,695 - INFO - root - config files copied to ../PRO_12-123
2021-12-20 20:14:04,696 - INFO - root - Code block 'generate project folder' took: 0.02510554599984971s
You should now see a new directory called PRO_12-123 with the manifest file in it. This will be your project workspace!
3. Review sample dataset¶
The data that you will be using for this tutorial is a set of 5 whole slide images of ovarian cancer H&E slides, available in the svs file format. Whole slide imaging refers to the scanning of conventional glass slides for research purposes; in this case, these are slides that oncologists have used to inspecting cancer samples!
While bringing up the DSA container, we already ran a script to get the data, and set up DSA. The vmount/provision.py script ran these steps:
Set up admin user and default assetstore
Download sample data from public kitware site. to
~/vmount/PRO_12-123/data/toy_data_set/Create a collection and add slides/annotations to your local DSA
[5]:
%%bash
tree ~/vmount/PRO_12-123/data/toy_data_set
/home/rosed2/vmount/PRO_12-123/data/toy_data_set
├── 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs
├── 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs
├── 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.svs
├── 01OV008-308ad404-7079-4ff8-8232-12ee2e.svs
└── 01OV008-7579323e-2fae-43a9-b00f-a15c28.svs
0 directories, 5 files
If you want to import your own data, you can do so from your local filesystem as well as an object store. For more details, refer to the girder user documentation
To import images from your local filesystem,
Login to DSA with admin/password
Add images to your local computer at
vmount/assetstoreNavigate to Admin Console -> Assetstores
From the default assetstore, click on Import data
Specify the path to the images you wish to import. e.g.
/assetstore/yourimageand click import
As the /assetstore mount is available to DSA, this import should be much faster than uploading the image through the Upload files in the UI.
4. Build the proxy table¶
Now, we will run the Whole Slide Image (WSI) ETL to build a meta-data catalog of the slides in a proxy table.
For reference, ETL stands for extract-transform-load; it is the method that often involves cleaning data, transforming data types, and loading data into different systems.
[6]:
!cat ~/luna/conf/wsi_config.yaml
REQUESTOR: viki mancoridis # The name of the requestor. You are likely the requestor
REQUESTOR_DEPARTMENT: computational oncology # The department to which the requestor belongs
REQUESTOR_EMAIL: MancoriV@mskcc.org # The email address of the requestor
PROJECT: PRO_12-123 # The project name decided by data coordination
SOURCE: toy_set # Source name of the input data file
MODALITY: radiology # Data modality
DATA_TYPE: WSI # Data type within this modality
COMMENTS: # Description of template defined by requestor. You may leave blank
DATE: 2021-07-06 # The date on which the request was made, likely today
DATASET_NAME: toy_data_set # Name to be given to the dataset
ETL_TYPE: proxy # Type of ETL
FILE_TYPE: svs # Input source file
FORMAT_TYPE: delta # Format type of the output proxy table
NUM_PARTITION: 1 # Number of partitions for the delta table creation
HOST: localhost # IP or hostname of machine where source data fil(s) reside
ROOT_PATH: ../ # File path to the root of your local folder of data
SOURCE_PATH: ../PRO_12-123/data/toy_data_set # Path to your specific data folder
LANDING_PATH: . # Path for tables and file transfer
RAW_DATA_PATH: . # Path to data transfer on destination machine
INCLUDE: --include=.svs # Specifies inclusion of svs files
CHUNK_FILE: ../PRO_12-123/data/chunks.txt # Output text will redirect here
FILE_COUNT: 5 # Number of files for the table
DATA_SIZE: 5000000000 # Upper bound for the number of bytes of data
BWLIMIT: 5G # Amount of network bandwidth to utilize for the data transfer
[7]:
%%bash
python3 -m luna.pathology.proxy_table.generate \
-d ~/luna/conf/wsi_config.yaml \
-a ~/luna/conf/app_config.yaml \
-p delta
:: loading settings :: url = jar:file:/home/rosed2/.local/lib/python3.6/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
root
|-- path: string (nullable = true)
|-- modificationTime: timestamp (nullable = true)
|-- length: long (nullable = true)
|-- wsi_record_uuid: string (nullable = true)
|-- slide_id: string (nullable = true)
|-- metadata: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+
| path| modificationTime| length| wsi_record_uuid| slide_id| metadata|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+
|file:/home/rosed2...|2021-12-20 20:14:...|262796337|WSI-214d5f6dc60e5...|01OV007-9b90eb78-...|{aperio_User -> b...|
|file:/home/rosed2...|2021-12-20 20:14:...|240691747|WSI-754715472db56...|01OV002-ed65cf94-...|{aperio_User -> b...|
|file:/home/rosed2...|2021-12-20 20:13:...|237047223|WSI-c9bd1f11b93b8...|01OV002-bd8cdc70-...|{aperio_User -> b...|
|file:/home/rosed2...|2021-12-20 20:15:...|215796305|WSI-9dbf3d29fac30...|01OV008-7579323e-...|{aperio_User -> b...|
|file:/home/rosed2...|2021-12-20 20:15:...|207479411|WSI-1ac0d0b8337e3...|01OV008-308ad404-...|{aperio_User -> b...|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+
2021-12-20 20:29:08,048 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [<StreamHandler <stderr> (INFO)>, <RotatingFileHandler /home/rosed2/vmount/notebooks/data-processing.log (INFO)>]
2021-12-20 20:29:08,050 - INFO - root - data_ingestions_template: /home/rosed2/luna/conf/wsi_config.yaml
2021-12-20 20:29:08,052 - INFO - root - config_file: /home/rosed2/luna/conf/app_config.yaml
2021-12-20 20:29:08,053 - INFO - root - processes: ['delta']
2021-12-20 20:29:08,054 - INFO - luna.common.config - loading config file /home/rosed2/luna/conf/app_config.yaml
2021-12-20 20:29:08,058 - INFO - luna.common.config - loading config file /home/rosed2/luna/conf/wsi_config.yaml
2021-12-20 20:29:08,063 - INFO - luna.common.config - validating config /home/rosed2/luna/conf/wsi_config.yaml against schema /home/rosed2/.local/lib/python3.6/site-packages/luna/pathology/proxy_table/data_ingestion_template_schema.yml for DATA_CFG
2021-12-20 20:29:08,114 - INFO - root - config files copied to ../PRO_12-123/configs/WSI_toy_data_set
Warning: Ignoring non-Spark config property: fs.defaultFS
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/rosed2/.local/lib/python3.6/site-packages/pyspark/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Ivy Default Cache set to: /home/rosed2/.ivy2/cache
The jars for the packages stored in: /home/rosed2/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-7ac9b111-a068-4c2c-b2a1-edcffb4c556e;1.0
confs: [default]
found io.delta#delta-core_2.12;0.7.0 in central
found org.antlr#antlr4;4.7 in central
found org.antlr#antlr4-runtime;4.7 in central
found org.antlr#antlr-runtime;3.5.2 in central
found org.antlr#ST4;4.0.8 in central
found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
found org.glassfish#javax.json;1.0.4 in central
found com.ibm.icu#icu4j;58.2 in central
downloading https://repo1.maven.org/maven2/io/delta/delta-core_2.12/0.7.0/delta-core_2.12-0.7.0.jar ...
[SUCCESSFUL ] io.delta#delta-core_2.12;0.7.0!delta-core_2.12.jar (141ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr4/4.7/antlr4-4.7.jar ...
[SUCCESSFUL ] org.antlr#antlr4;4.7!antlr4.jar (79ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr4-runtime/4.7/antlr4-runtime-4.7.jar ...
[SUCCESSFUL ] org.antlr#antlr4-runtime;4.7!antlr4-runtime.jar (38ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr-runtime/3.5.2/antlr-runtime-3.5.2.jar ...
[SUCCESSFUL ] org.antlr#antlr-runtime;3.5.2!antlr-runtime.jar (29ms)
downloading https://repo1.maven.org/maven2/org/antlr/ST4/4.0.8/ST4-4.0.8.jar ...
[SUCCESSFUL ] org.antlr#ST4;4.0.8!ST4.jar (32ms)
downloading https://repo1.maven.org/maven2/org/abego/treelayout/org.abego.treelayout.core/1.0.3/org.abego.treelayout.core-1.0.3.jar ...
[SUCCESSFUL ] org.abego.treelayout#org.abego.treelayout.core;1.0.3!org.abego.treelayout.core.jar(bundle) (25ms)
downloading https://repo1.maven.org/maven2/org/glassfish/javax.json/1.0.4/javax.json-1.0.4.jar ...
[SUCCESSFUL ] org.glassfish#javax.json;1.0.4!javax.json.jar(bundle) (18ms)
downloading https://repo1.maven.org/maven2/com/ibm/icu/icu4j/58.2/icu4j-58.2.jar ...
[SUCCESSFUL ] com.ibm.icu#icu4j;58.2!icu4j.jar (509ms)
:: resolution report :: resolve 3367ms :: artifacts dl 878ms
:: modules in use:
com.ibm.icu#icu4j;58.2 from central in [default]
io.delta#delta-core_2.12;0.7.0 from central in [default]
org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]
org.antlr#ST4;4.0.8 from central in [default]
org.antlr#antlr-runtime;3.5.2 from central in [default]
org.antlr#antlr4;4.7 from central in [default]
org.antlr#antlr4-runtime;4.7 from central in [default]
org.glassfish#javax.json;1.0.4 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 8 | 8 | 8 | 0 || 8 | 8 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-7ac9b111-a068-4c2c-b2a1-edcffb4c556e
confs: [default]
8 artifacts copied, 0 already retrieved (15071kB/60ms)
21/12/20 20:29:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-12-20 20:29:17,725 - INFO - __main__ - generating binary proxy table...
2021-12-20 20:29:17,767 - INFO - __main__ - Writing to ../PRO_12-123/tables/WSI_toy_data_set
2021-12-20 20:29:19,552 - INFO - __main__ - Code block 'load wsi metadata' took: 1.783534656000029s
Current mem limits: -1 of max -1
Setting mem limits to 1073741824 of max 1073741824
Current mem limits: -1 of max -1
Setting mem limits to 1073741824 of max 1073741824
Current mem limits: -1 of max -1
Setting mem limits to 1073741824 of max 1073741824
Current mem limits: -1 of max -1
Setting mem limits to 1073741824 of max 1073741824
Current mem limits: -1 of max -1
Setting mem limits to 1073741824 of max 1073741824
2021-12-20 20:29:27,968 - INFO - __main__ - Processed 5 whole slide images out of total 5 files
Current mem limits: 1073741824 of max 1073741824
Current mem limits: 1073741824 of max 1073741824
Current mem limits: 1073741824 of max 1073741824
Current mem limits: 1073741824 of max 1073741824
Current mem limits: 1073741824 of max 1073741824
2021-12-20 20:29:30,010 - INFO - root - Code block 'generate proxy table' took: 21.959749678999742s
This step may take a while. At the end, your proxy table should be generated!
Before we view the table, we must first update it to associate patient ID’s with the slides. This is necessary for correctly training and validating the machine learning model in the coming notebooks. Once the slides are divided into “tiles” in the next notebook, the tiles are split between the training and validation sets for the ML model. If the tiles do not have patient ID’s associated with them, then it is possible for tiles from one individual to appear in both the training and validation of the model; this would cause researchers to have an exaggerated interpretation of the model’s accuracy, since we would essentially be validating the model on information that is too near to what it has already seen.
Note that we will not be using patient IDs associated with MSK. Instead, we will be using spoof IDs that will suffice for this tutorial. When running this workflow with real data, make sure to include the IDs safely and securely. Run the following block of code to add a ‘patient_id’ column to the table and store it using Spark.
[8]:
from pyspark.sql import SparkSession
# setup spark session
spark = SparkSession.builder \
.appName("test") \
.master('local[*]') \
.config("spark.driver.host", "127.0.0.1") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.HDFSLogStore") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.databricks.delta.retentionDurationCheck.enabled", "false") \
.config("spark.hadoop.dfs.client.use.datanode.hostname", "true") \
.config("spark.driver.memory", "6g") \
.config("spark.executor.memory", "6g") \
.getOrCreate()
print(spark)
# read WSI delta table
wsi_table = spark.read.format("delta") .load("../PRO_12-123/tables/WSI_toy_data_set").toPandas()
# insert spoof patient ids
patient_id=[1,2,3,4,5]
wsi_table['patient_id']=patient_id
wsi_table
# convert back to a spark table (update table)
x = spark.createDataFrame(wsi_table)
x.write.format("delta").mode("overwrite").option("mergeSchema", "true").save("../PRO_12-123/tables/WSI_toy_data_set")
<pyspark.sql.session.SparkSession object at 0x7f1b4b6e2ac8>
Reduce the delta table down to a single layer so all data can be read as a parquet table.
[9]:
from delta.tables import *
wsi_table = DeltaTable.forPath(spark, "../PRO_12-123/tables/WSI_toy_data_set")
wsi_table.vacuum(0)
[9]:
DataFrame[]
Next, we may view the WSI table! This table should have the metadata associated with the WSI slides that you just collected, including the patient IDs.
[16]:
# read WSI delta table
wsi_table = spark.read.format("delta") \
.load("../PRO_12-123/tables/WSI_toy_data_set").toPandas()
# view table
wsi_table
[16]:
| path | modificationTime | length | wsi_record_uuid | slide_id | metadata | patient_id | |
|---|---|---|---|---|---|---|---|
| 0 | file:/home/rosed2/vmount/PRO_12-123/data/toy_d... | 2021-12-20 20:14:21.163 | 240691747 | WSI-754715472db565938da32db34ccb625a10a2c9905d... | 01OV002-ed65cf94-8bc6-492b-9149-adc16f | {'aperio_StripeWidth': '1000', 'aperio_User': ... | 2 |
| 1 | file:/home/rosed2/vmount/PRO_12-123/data/toy_d... | 2021-12-20 20:14:56.701 | 262796337 | WSI-214d5f6dc60e5e9b4a697067a7e897b49c7fdc72eb... | 01OV007-9b90eb78-2f50-4aeb-b010-d642f9 | {'aperio_StripeWidth': '1000', 'aperio_User': ... | 1 |
| 2 | file:/home/rosed2/vmount/PRO_12-123/data/toy_d... | 2021-12-20 20:15:59.008 | 215796305 | WSI-9dbf3d29fac304fe27dac95b20bfff7ed338293cae... | 01OV008-7579323e-2fae-43a9-b00f-a15c28 | {'aperio_StripeWidth': '1000', 'aperio_User': ... | 3 |
| 3 | file:/home/rosed2/vmount/PRO_12-123/data/toy_d... | 2021-12-20 20:13:47.194 | 237047223 | WSI-c9bd1f11b93b81f9e0ed532082e2d64b7e9daac496... | 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77 | {'aperio_StripeWidth': '1000', 'aperio_User': ... | 4 |
| 4 | file:/home/rosed2/vmount/PRO_12-123/data/toy_d... | 2021-12-20 20:15:28.100 | 207479411 | WSI-1ac0d0b8337e30284de658fab536204fd7d9c9ef45... | 01OV008-308ad404-7079-4ff8-8232-12ee2e | {'aperio_Left': '29.130108', 'aperio_StripeWid... | 5 |
If the table is depicted above, congratulations, you have successfully run the Whole Slide Image (WSI) ETL to database the slides!
Run the regional annotation ETL¶
The whole slide images that you downloaded are images of ovarian cancer, but not every pixel on each slide is a tumor. In fact, the images show tumor cells, normal ovarian cells and more. A non-expert annotated this slide for demo purposes only.
The regional annotation ETL performs the following steps
Downloads DSA json annotations
Converts DSA jsons to GeoJSON format, which is compatible with downstream applications
Saves configs in your
~/vmount/PRO_12-123/configs/REGIONAL_METADATA_RESULTSSaves parquet table in your
~/vmount/PRO_12-123/tables/REGIONAL_METADATA_RESULTS
To run the regional annotation ETL, we use the dsa_annotation CLI. For more details on the dsa_annotation, and the annotations we support, please checkout the 7_dsa-annotation.ipynb notebook.
Note: details of your DSA instance is specified as DSA_URI in ../conf/dsa_regional_annotation.yaml and should be updated to reflect your DSA setup. If you are using the docker, replace the localhost with the IP you get from running:
docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' luna_tutorial_girder_1
[13]:
!dsa_annotation \
-d ../conf/dsa_regional_annotation.yaml \
-a ../conf/dsa_app_config.yaml \
-u admin -p password
2021-12-20 20:41:21,000 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [<StreamHandler <stderr> (INFO)>, <RotatingFileHandler /home/rosed2/vmount/notebooks/data-processing.log (INFO)>]
2021-12-20 20:41:21,002 - INFO - luna.common.config - loading config file ../conf/dsa_regional_annotation.yaml
2021-12-20 20:41:21,007 - INFO - luna.common.config - loading config file ../conf/dsa_app_config.yaml
2021-12-20 20:41:21,009 - INFO - root - data template: ../conf/dsa_regional_annotation.yaml
2021-12-20 20:41:21,010 - INFO - root - config_file: ../conf/dsa_app_config.yaml
2021-12-20 20:41:21,048 - INFO - root - config files copied to ../PRO_12-123/configs/REGIONAL_METADATA_RESULTS
2021-12-20 20:41:21,153 - INFO - luna.pathology.cli.dsa.dsa_annotations - Table output directory: ../PRO_12-123/tables/REGIONAL_METADATA_RESULTS
Successfully connected to DSA
collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}
Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073
retreived collection uuid
No stylesheet in collection: 61c0e3dd4903324c7fe38073
2021-12-20 20:41:21,390 - INFO - luna.pathology.cli.dsa.dsa_annotations - Retrieved collection metadata
2021-12-20 20:41:22,746 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [<StreamHandler <stderr> (INFO)>, <RotatingFileHandler /home/rosed2/vmount/notebooks/data-processing.log (INFO)>]
2021-12-20 20:41:22,747 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [<StreamHandler <stderr> (INFO)>, <RotatingFileHandler /home/rosed2/vmount/notebooks/data-processing.log (INFO)>]
2021-12-20 20:41:22,747 - INFO - root - FYI: Initalized logger, log file at: data-processing.log with handlers: [<StreamHandler <stderr> (INFO)>, <RotatingFileHandler /home/rosed2/vmount/notebooks/data-processing.log (INFO)>]
2021-12-20 20:41:22,750 - INFO - luna.pathology.cli.dsa.dsa_annotations - <Client: 'tcp://127.0.0.1:36525' processes=3 threads=3, memory=1.50 GB>
collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}
Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073
collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}
Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073
collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}
Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073
Image file 01OV008-7579323e-2fae-43a9-b00f-a15c28.svs found with id: 61c0e4644903324c7fe3809f
Starting request for annotation
Image file 01OV008-308ad404-7079-4ff8-8232-12ee2e.svs found with id: 61c0e4464903324c7fe38094
Starting request for annotation
Image file 01OV007-9b90eb78-2f50-4aeb-b010-d642f9.svs found with id: 61c0e4224903324c7fe38089
Starting request for annotation
2021-12-20 20:41:22,885 - INFO - luna.common.config - loading config file /home/rosed2/vmount/conf/datastore.cfg
collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}
Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073
2021-12-20 20:41:22,901 - INFO - luna.common.DataStore - Configured datastore with {'GRAPH_STORE_ENABLED': False, 'GRAPH_URI': 'neo4j://localhost:7687', 'GRAPH_USER': 'neo4j', 'GRAPH_PASSWORD': 'password', 'OBJECT_STORE_ENABLED': False, 'MINIO_URI': 'localhost:8001', 'MINIO_USER': 'minio', 'MINIO_PASSWORD': 'password', 'DOC_STORE_ENABLED': False, 'MONGODB_URI': 'mongodb://localhost:27017/'}
collection_id_dict {'_accessLevel': 2, '_id': '61c0e3dd4903324c7fe38073', '_modelType': 'collection', '_textScore': 15.0, 'created': '2021-12-20T20:13:17.286000+00:00', 'description': '', 'meta': {}, 'name': 'TCGA collection', 'public': True, 'size': 1163811023, 'updated': '2021-12-20T20:13:17.286000+00:00'}
Collection TCGA collection found with id: 61c0e3dd4903324c7fe38073
Image file 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77.svs found with id: 61c0e3dd4903324c7fe38075
Starting request for annotation
2021-12-20 20:41:22,911 - INFO - luna.common.DataStore - Datstore file backend= ../PRO_12-123/slides
Image file 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs found with id: 61c0e4014903324c7fe38080
Starting request for annotation
2021-12-20 20:41:22,940 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV007-9b90eb78-2f50-4aeb-b010-d642f9/admin/REGIONAL_METADATA_RESULTS_DSA_JSON/DEFAULT_LABELS
2021-12-20 20:41:22,981 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV007-9b90eb78-2f50-4aeb-b010-d642f9/CONCAT/REGIONAL_METADATA_RESULTS/DEFAULT_LABELS
2021-12-20 20:41:23,010 - INFO - luna.common.config - loading config file /home/rosed2/vmount/conf/datastore.cfg
2021-12-20 20:41:23,012 - INFO - luna.common.config - loading config file /home/rosed2/vmount/conf/datastore.cfg
2021-12-20 20:41:23,016 - INFO - luna.common.DataStore - Configured datastore with {'GRAPH_STORE_ENABLED': False, 'GRAPH_URI': 'neo4j://localhost:7687', 'GRAPH_USER': 'neo4j', 'GRAPH_PASSWORD': 'password', 'OBJECT_STORE_ENABLED': False, 'MINIO_URI': 'localhost:8001', 'MINIO_USER': 'minio', 'MINIO_PASSWORD': 'password', 'DOC_STORE_ENABLED': False, 'MONGODB_URI': 'mongodb://localhost:27017/'}
2021-12-20 20:41:23,022 - INFO - luna.common.DataStore - Configured datastore with {'GRAPH_STORE_ENABLED': False, 'GRAPH_URI': 'neo4j://localhost:7687', 'GRAPH_USER': 'neo4j', 'GRAPH_PASSWORD': 'password', 'OBJECT_STORE_ENABLED': False, 'MINIO_URI': 'localhost:8001', 'MINIO_USER': 'minio', 'MINIO_PASSWORD': 'password', 'DOC_STORE_ENABLED': False, 'MONGODB_URI': 'mongodb://localhost:27017/'}
2021-12-20 20:41:23,023 - INFO - luna.common.DataStore - Datstore file backend= ../PRO_12-123/slides
2021-12-20 20:41:23,024 - INFO - luna.common.DataStore - Datstore file backend= ../PRO_12-123/slides
2021-12-20 20:41:23,033 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV008-308ad404-7079-4ff8-8232-12ee2e/admin/REGIONAL_METADATA_RESULTS_DSA_JSON/DEFAULT_LABELS
2021-12-20 20:41:23,037 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV008-7579323e-2fae-43a9-b00f-a15c28/admin/REGIONAL_METADATA_RESULTS_DSA_JSON/DEFAULT_LABELS
2021-12-20 20:41:23,037 - INFO - luna.common.DataStore - Configured datastore with {'GRAPH_STORE_ENABLED': False, 'GRAPH_URI': 'neo4j://localhost:7687', 'GRAPH_USER': 'neo4j', 'GRAPH_PASSWORD': 'password', 'OBJECT_STORE_ENABLED': False, 'MINIO_URI': 'localhost:8001', 'MINIO_USER': 'minio', 'MINIO_PASSWORD': 'password', 'DOC_STORE_ENABLED': False, 'MONGODB_URI': 'mongodb://localhost:27017/'}
2021-12-20 20:41:23,044 - INFO - luna.common.DataStore - Datstore file backend= ../PRO_12-123/slides
2021-12-20 20:41:23,054 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV008-308ad404-7079-4ff8-8232-12ee2e/CONCAT/REGIONAL_METADATA_RESULTS/DEFAULT_LABELS
2021-12-20 20:41:23,054 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV002-bd8cdc70-3d46-40ae-99c4-90ef77/admin/REGIONAL_METADATA_RESULTS_DSA_JSON/DEFAULT_LABELS
2021-12-20 20:41:23,059 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV008-7579323e-2fae-43a9-b00f-a15c28/CONCAT/REGIONAL_METADATA_RESULTS/DEFAULT_LABELS
2021-12-20 20:41:23,068 - INFO - luna.common.DataStore - Configured datastore with {'GRAPH_STORE_ENABLED': False, 'GRAPH_URI': 'neo4j://localhost:7687', 'GRAPH_USER': 'neo4j', 'GRAPH_PASSWORD': 'password', 'OBJECT_STORE_ENABLED': False, 'MINIO_URI': 'localhost:8001', 'MINIO_USER': 'minio', 'MINIO_PASSWORD': 'password', 'DOC_STORE_ENABLED': False, 'MONGODB_URI': 'mongodb://localhost:27017/'}
2021-12-20 20:41:23,073 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV002-bd8cdc70-3d46-40ae-99c4-90ef77/CONCAT/REGIONAL_METADATA_RESULTS/DEFAULT_LABELS
2021-12-20 20:41:23,075 - INFO - luna.common.DataStore - Datstore file backend= ../PRO_12-123/slides
2021-12-20 20:41:23,083 - INFO - luna.pathology.cli.dsa.dsa_annotations - Annotation for slide 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs generated successfully
2021-12-20 20:41:23,097 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV002-ed65cf94-8bc6-492b-9149-adc16f/admin/REGIONAL_METADATA_RESULTS_DSA_JSON/DEFAULT_LABELS
2021-12-20 20:41:23,101 - INFO - luna.pathology.cli.dsa.dsa_annotations - Annotation for slide 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs generated successfully
2021-12-20 20:41:23,108 - INFO - luna.common.DataStore - Save -> ../PRO_12-123/slides/01OV002-ed65cf94-8bc6-492b-9149-adc16f/CONCAT/REGIONAL_METADATA_RESULTS/DEFAULT_LABELS
2021-12-20 20:41:23,109 - INFO - luna.pathology.cli.dsa.dsa_annotations - Annotation for slide 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs generated successfully
2021-12-20 20:41:23,121 - INFO - luna.pathology.cli.dsa.dsa_annotations - Annotation for slide 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs generated successfully
2021-12-20 20:41:23,130 - INFO - luna.pathology.cli.dsa.dsa_annotations - Annotation for slide 01OV002-ed65cf94-8bc6-492b-9149-adc16f.svs generated successfully
2021-12-20 20:41:24,580 - INFO - root - Code block 'generate DSA annotation geojson table' took: 3.5710523679999824s
To check that the regional annotation ETL was correctly run, after the Jupyter cell finishes, you may load the regional annotations table! This table contains the metadata saved from running the ETL. It includes paths to the bitmap files, numpy files, and geoJSON files that were mentioned before. To load the table, run the following code cell:
[15]:
from pyarrow.parquet import read_table
regional_annotation_table = read_table("../PRO_12-123/tables/REGIONAL_METADATA_RESULTS").to_pandas()
regional_annotation_table
[15]:
| project_name | slide_id | user | dsa_json_path | geojson_path | date_updated | date_created | labelset | annotation_name | annotation_type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | TCGA collection | 01OV007-9b90eb78-2f50-4aeb-b010-d642f9 | CONCAT | ../PRO_12-123/slides/01OV007-9b90eb78-2f50-4ae... | ../PRO_12-123/slides/01OV007-9b90eb78-2f50-4ae... | 2021-12-20T20:15:02.631000+00:00 | 2021-12-20T20:15:02.631000+00:00 | DEFAULT_LABELS | ov_regional | RegionalAnnotationJSON |
| 1 | TCGA collection | 01OV008-308ad404-7079-4ff8-8232-12ee2e | CONCAT | ../PRO_12-123/slides/01OV008-308ad404-7079-4ff... | ../PRO_12-123/slides/01OV008-308ad404-7079-4ff... | 2021-12-20T20:15:32.585000+00:00 | 2021-12-20T20:15:32.585000+00:00 | DEFAULT_LABELS | ov_regional | RegionalAnnotationJSON |
| 2 | TCGA collection | 01OV008-7579323e-2fae-43a9-b00f-a15c28 | CONCAT | ../PRO_12-123/slides/01OV008-7579323e-2fae-43a... | ../PRO_12-123/slides/01OV008-7579323e-2fae-43a... | 2021-12-20T20:16:04.062000+00:00 | 2021-12-20T20:16:04.062000+00:00 | DEFAULT_LABELS | ov_regional | RegionalAnnotationJSON |
| 3 | TCGA collection | 01OV002-bd8cdc70-3d46-40ae-99c4-90ef77 | CONCAT | ../PRO_12-123/slides/01OV002-bd8cdc70-3d46-40a... | ../PRO_12-123/slides/01OV002-bd8cdc70-3d46-40a... | 2021-12-20T20:13:53.187000+00:00 | 2021-12-20T20:13:53.187000+00:00 | DEFAULT_LABELS | ov_regional | RegionalAnnotationJSON |
| 4 | TCGA collection | 01OV002-ed65cf94-8bc6-492b-9149-adc16f | CONCAT | ../PRO_12-123/slides/01OV002-ed65cf94-8bc6-492... | ../PRO_12-123/slides/01OV002-ed65cf94-8bc6-492... | 2021-12-20T20:14:26.703000+00:00 | 2021-12-20T20:14:26.703000+00:00 | DEFAULT_LABELS | ov_regional | RegionalAnnotationJSON |
At this point, you have successfully set up your workspace, dowloaded the data, and run both the pathology and regional annotation ETLs to prepare your data. You are ready to move on to the tiling notebook!
[ ]: