{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Training Tutorial\n",
    "\n",
    "Welcome to the model training tutorial! In this tutorial, we will train a neural network to classify tiles from our toy data set and visualize its efficacy. For now, the classifier code in full can be found at: https://github.com/msk-mind/sandbox/blob/master/classifier/crc_tile_classifier/tile_classifier.py. Our model is essentially a wrapper around PyTorch's ResNet 18 deep residual network; the LUNA team modified it to suit their work with tiling the slides. Here are the steps we will review in this notebook:\n",
    "\n",
    "- Initialize project space for model\n",
    "- Import libraries and set constants\n",
    "- Define parameters and set output directory\n",
    "- Load and prepare data\n",
    "- Define model\n",
    "- Train and save model\n",
    "- Visualize TensorBoard logging"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize project space for model\n",
    "\n",
    "First, you must initialize a directory for the model and its outputs. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "mkdir PRO_12-123/model/\n",
    "mkdir PRO_12-123/model/outputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import libraries and set constants\n",
    "\n",
    "Since we are not using CLIs with built-in import statements for this notebook, it is necessary to include this on our own. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 274,
   "metadata": {},
   "outputs": [],
   "source": [
    "# todo: remove libs\n",
    "import datetime\n",
    "import sys\n",
    "import os\n",
    "\n",
    "from collections import Counter\n",
    "from typing import List, Optional, Tuple\n",
    "\n",
    "sys.path.append(os.pardir)\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import pickle\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import pyarrow.dataset as ds\n",
    "import pyarrow.parquet as pq\n",
    "\n",
    "import torch\n",
    "\n",
    "import torch.nn as nn\n",
    "import torch.optim as optim\n",
    "import torchvision.models as models\n",
    "\n",
    "from PIL import Image\n",
    "from pyarrow import fs\n",
    "from tensorboard import notebook\n",
    "from torchvision import transforms\n",
    "from torch.utils.data import Dataset, DataLoader\n",
    "from torch.utils.data.sampler import SubsetRandomSampler\n",
    "\n",
    "from classifier.model import Classifier\n",
    "from classifier.data import (\n",
    "    TileDataset,\n",
    "    get_stratified_sampler,\n",
    "    get_group_stratified_sampler,\n",
    ")\n",
    "from classifier.utils import set_seed, seed_workers\n",
    "\n",
    "# constants\n",
    "torch.set_default_tensor_type(torch.FloatTensor)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define parameters and set output directories\n",
    "\n",
    "Next, you will define parameters that are crucial to your model. You will also  Here are some notes to explain the parameters:\n",
    "\n",
    "- `batch_size` refers to the number of training examples used in one iteration (number of times the algorithm's parameters are updated); this model runs with mini-batch mode, where the batch size is less than the total dataset size but greater than one. \n",
    "- `validation_split` refers to the percentage of data that will be used to validate the efficacy of the model as opposed to data that will be used to train the model.\n",
    "- `shuffle_dataset` is set to true in order for the model to randomize the order of samples within batches between epochs \n",
    "- `random_seed` is the starting point for generating random numbers in a number sequence. Randomness will be used in many different aspects of the machine learning algorithm, such as shuffling the datset between epochs and randomly initializing weights for the parameters. \n",
    "- `lr` refers to the learning rate of the gradient descent optimizer.\n",
    "- `epochs` refers to the number of times the algorithm will loop through the entire dataset during training. Setting this variable to 10 specifies that the dataset will be run through 10 times before the model is finished training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 275,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define params\n",
    "batch_size = 64\n",
    "validation_split = 0.50\n",
    "shuffle_dataset = False\n",
    "random_seed = 42 \n",
    "lr = 1e-3\n",
    "epochs = 10\n",
    "n_workers = 20\n",
    "\n",
    "# set random seed\n",
    "set_seed(random_seed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will also set your output directory. This directory will include the checkpoints from the model as well as outputs that will enable the TensorBoard extension to generate figures documenting the model's progression as it trains on the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 276,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set output directory\n",
    "output_dir = 'PRO_12-123/model/outputs/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and prepare data\n",
    "\n",
    "First, we must specify attributes of our data: the label set and path. The regional annotations referenced many properties of each image, and we will train our model on five of the most prevalent properties. Note that the `tumor` class must correspond to the first label in the label set; otherwise, our inference will be incorrect further down the pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 277,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load data\n",
    "label_set = {\n",
    "    \"tumor\":0,\n",
    "    \"vessels\":1,\n",
    "    \"stroma\":2\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We must also specify the path to our data and load it into a pandas dataframe. We may see the labels present in our data and the number of instances in each data by taking a closer look at the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 278,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "regional_label\n",
      "adipocytes              45\n",
      "arteries              3921\n",
      "lympho_poor_tumor     3806\n",
      "lympho_rich_stroma     991\n",
      "lympho_rich_tumor     5739\n",
      "veins                  194\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "path_to_tile_manifest = 'PRO_12-123/tiles/ov_tileset'\n",
    "df = pq.ParquetDataset(path_to_tile_manifest).read().to_pandas()\n",
    "\n",
    "print(df.groupby(\"regional_label\").size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make classification simpler and more effective, we will do additional processing of the data. Specifically, we will group together `lympho_poor_tumor` and `lympho_rich tumor` under a generic `tumor` class, link `arteries` and `veins` under a `vessels` class, as well as remove classes with relatively few samples to get rid of noise. After doing so, we may check the regional labels again. Note that the tumor classes grew in size and that there are fewer labels in total. Our regional labels now match the `label_set` that we defined earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 290,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "regional_label\n",
      "stroma      991\n",
      "tumor      9545\n",
      "vessels    4115\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "df['regional_label'] = df['regional_label'].str.replace('arteries', 'vessels')\n",
    "df['regional_label'] = df['regional_label'].str.replace('veins', 'vessels')\n",
    "\n",
    "df['regional_label'] = df['regional_label'].str.replace('lympho_poor_tumor', 'tumor')\n",
    "df['regional_label'] = df['regional_label'].str.replace('lympho_rich_tumor', 'tumor')\n",
    "\n",
    "df['regional_label'] = df['regional_label'].str.replace('lympho_rich_stroma', 'stroma')\n",
    "\n",
    "df = df[df['regional_label'] != 'adipocytes']\n",
    "\n",
    "print(df.groupby(\"regional_label\").size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TileDataset is a class that inherits from the PyTorch Dataset class. It transforms our dataset into a Tensor, specifies the labelset, and defines functions to calculate the length of the dataset and to get an item from the dataset. The main difference between this class and the PyTorch class is the latter function; to retrieve an item from the dataset, TileDataset uses image offset to retrieve a tile from an image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 291,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_local = TileDataset(df, label_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With just a little more processing, we can reset the indices in our dataframe, convert patient IDs to numerical values if need be (in this case, our patient IDs are already numeric, so this line is commented out), and convert regional labels to integers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 292,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>patient_id</th>\n",
       "      <th>id_slide_container</th>\n",
       "      <th>address</th>\n",
       "      <th>coordinates</th>\n",
       "      <th>otsu_score</th>\n",
       "      <th>purple_score</th>\n",
       "      <th>regional_label</th>\n",
       "      <th>tile_image_offset</th>\n",
       "      <th>tile_image_length</th>\n",
       "      <th>tile_image_size_xy</th>\n",
       "      <th>tile_image_mode</th>\n",
       "      <th>data_path</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4</td>\n",
       "      <td>2551028</td>\n",
       "      <td>x128_y72_z20</td>\n",
       "      <td>(128, 72)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>4.446781e+08</td>\n",
       "      <td>49152.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>RGB</td>\n",
       "      <td>/gpfs/mskmindhdp_emc/user/shared_data_folder/p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>2551028</td>\n",
       "      <td>x128_y73_z20</td>\n",
       "      <td>(128, 73)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>4.447273e+08</td>\n",
       "      <td>49152.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>RGB</td>\n",
       "      <td>/gpfs/mskmindhdp_emc/user/shared_data_folder/p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>2551028</td>\n",
       "      <td>x128_y74_z20</td>\n",
       "      <td>(128, 74)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>4.447764e+08</td>\n",
       "      <td>49152.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>RGB</td>\n",
       "      <td>/gpfs/mskmindhdp_emc/user/shared_data_folder/p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>2551028</td>\n",
       "      <td>x128_y75_z20</td>\n",
       "      <td>(128, 75)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>4.448256e+08</td>\n",
       "      <td>49152.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>RGB</td>\n",
       "      <td>/gpfs/mskmindhdp_emc/user/shared_data_folder/p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>2551028</td>\n",
       "      <td>x128_y76_z20</td>\n",
       "      <td>(128, 76)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>4.448748e+08</td>\n",
       "      <td>49152.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>RGB</td>\n",
       "      <td>/gpfs/mskmindhdp_emc/user/shared_data_folder/p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14646</th>\n",
       "      <td>1</td>\n",
       "      <td>2551571</td>\n",
       "      <td>x453_y148_z20</td>\n",
       "      <td>(453, 148)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "      <td>3.706257e+09</td>\n",
       "      <td>49152.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>RGB</td>\n",
       "      <td>/gpfs/mskmindhdp_emc/user/shared_data_folder/p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14647</th>\n",
       "      <td>1</td>\n",
       "      <td>2551571</td>\n",
       "      <td>x454_y146_z20</td>\n",
       "      <td>(454, 146)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "      <td>3.713237e+09</td>\n",
       "      <td>49152.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>RGB</td>\n",
       "      <td>/gpfs/mskmindhdp_emc/user/shared_data_folder/p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14648</th>\n",
       "      <td>1</td>\n",
       "      <td>2551571</td>\n",
       "      <td>x454_y147_z20</td>\n",
       "      <td>(454, 147)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "      <td>3.713286e+09</td>\n",
       "      <td>49152.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>RGB</td>\n",
       "      <td>/gpfs/mskmindhdp_emc/user/shared_data_folder/p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14649</th>\n",
       "      <td>1</td>\n",
       "      <td>2551571</td>\n",
       "      <td>x454_y148_z20</td>\n",
       "      <td>(454, 148)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "      <td>3.713335e+09</td>\n",
       "      <td>49152.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>RGB</td>\n",
       "      <td>/gpfs/mskmindhdp_emc/user/shared_data_folder/p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14650</th>\n",
       "      <td>1</td>\n",
       "      <td>2551571</td>\n",
       "      <td>x455_y143_z20</td>\n",
       "      <td>(455, 143)</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2</td>\n",
       "      <td>3.719725e+09</td>\n",
       "      <td>49152.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>RGB</td>\n",
       "      <td>/gpfs/mskmindhdp_emc/user/shared_data_folder/p...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>14651 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      patient_id id_slide_container        address coordinates  otsu_score  \\\n",
       "0              4            2551028   x128_y72_z20   (128, 72)         1.0   \n",
       "1              4            2551028   x128_y73_z20   (128, 73)         1.0   \n",
       "2              4            2551028   x128_y74_z20   (128, 74)         1.0   \n",
       "3              4            2551028   x128_y75_z20   (128, 75)         1.0   \n",
       "4              4            2551028   x128_y76_z20   (128, 76)         1.0   \n",
       "...          ...                ...            ...         ...         ...   \n",
       "14646          1            2551571  x453_y148_z20  (453, 148)         1.0   \n",
       "14647          1            2551571  x454_y146_z20  (454, 146)         1.0   \n",
       "14648          1            2551571  x454_y147_z20  (454, 147)         1.0   \n",
       "14649          1            2551571  x454_y148_z20  (454, 148)         1.0   \n",
       "14650          1            2551571  x455_y143_z20  (455, 143)         1.0   \n",
       "\n",
       "       purple_score  regional_label  tile_image_offset  tile_image_length  \\\n",
       "0               1.0               1       4.446781e+08            49152.0   \n",
       "1               1.0               1       4.447273e+08            49152.0   \n",
       "2               1.0               1       4.447764e+08            49152.0   \n",
       "3               1.0               1       4.448256e+08            49152.0   \n",
       "4               1.0               1       4.448748e+08            49152.0   \n",
       "...             ...             ...                ...                ...   \n",
       "14646           1.0               0       3.706257e+09            49152.0   \n",
       "14647           1.0               0       3.713237e+09            49152.0   \n",
       "14648           1.0               0       3.713286e+09            49152.0   \n",
       "14649           1.0               0       3.713335e+09            49152.0   \n",
       "14650           1.0               2       3.719725e+09            49152.0   \n",
       "\n",
       "       tile_image_size_xy tile_image_mode  \\\n",
       "0                   128.0             RGB   \n",
       "1                   128.0             RGB   \n",
       "2                   128.0             RGB   \n",
       "3                   128.0             RGB   \n",
       "4                   128.0             RGB   \n",
       "...                   ...             ...   \n",
       "14646               128.0             RGB   \n",
       "14647               128.0             RGB   \n",
       "14648               128.0             RGB   \n",
       "14649               128.0             RGB   \n",
       "14650               128.0             RGB   \n",
       "\n",
       "                                               data_path  \n",
       "0      /gpfs/mskmindhdp_emc/user/shared_data_folder/p...  \n",
       "1      /gpfs/mskmindhdp_emc/user/shared_data_folder/p...  \n",
       "2      /gpfs/mskmindhdp_emc/user/shared_data_folder/p...  \n",
       "3      /gpfs/mskmindhdp_emc/user/shared_data_folder/p...  \n",
       "4      /gpfs/mskmindhdp_emc/user/shared_data_folder/p...  \n",
       "...                                                  ...  \n",
       "14646  /gpfs/mskmindhdp_emc/user/shared_data_folder/p...  \n",
       "14647  /gpfs/mskmindhdp_emc/user/shared_data_folder/p...  \n",
       "14648  /gpfs/mskmindhdp_emc/user/shared_data_folder/p...  \n",
       "14649  /gpfs/mskmindhdp_emc/user/shared_data_folder/p...  \n",
       "14650  /gpfs/mskmindhdp_emc/user/shared_data_folder/p...  \n",
       "\n",
       "[14651 rows x 12 columns]"
      ]
     },
     "execution_count": 292,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# reset index\n",
    "df_nh = df.reset_index()\n",
    "# convert patient_ids to numerical values\n",
    "# df_nh['patient_id'] = [id[2:] for id in df_nh['patient_id']]\n",
    "# convert labels to ints\n",
    "df_nh['regional_label'] = [label_set[label] for label in df_nh['regional_label']]\n",
    "df_nh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our model would have biased predictions if data from the same patient appeared in both the training and the validation set. This is because the model would be validating on near-identical data to data it has already seen.\n",
    "\n",
    "Additionally, in order to maximize the efficacy of our model, it is important to protect against an unbalanced amount of data from each label set appearing in the test and validation sets. For instance, if our model was trained on mostly tumor and arteries data (data with many instances in the dataset), then it might incorrectly identify data labeled as a vein if it saw that data in the validation set.\n",
    "\n",
    "For these two reasons, it is critical to stratisfy our data by patient ID while balancing regional labels. We pass our data through a function to do this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 293,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10804\n",
      "2125\n"
     ]
    }
   ],
   "source": [
    "# stratify by patient id while balancing regional_label\n",
    "train_sampler, val_sampler = get_group_stratified_sampler(\n",
    "    df_nh, df_nh[\"regional_label\"], df_nh[\"patient_id\"], split=validation_split\n",
    ")\n",
    "print(next(iter(train_sampler)))\n",
    "print(next(iter(val_sampler)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 294,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "127\n",
      "103\n"
     ]
    }
   ],
   "source": [
    "train_loader = DataLoader(\n",
    "        dataset_local,\n",
    "        num_workers=n_workers,\n",
    "        worker_init_fn=seed_workers,\n",
    "        shuffle=shuffle_dataset,\n",
    "        batch_size=batch_size,\n",
    "        sampler=train_sampler,\n",
    "    )\n",
    "\n",
    "validation_loader = DataLoader(\n",
    "        dataset_local,\n",
    "        num_workers=n_workers,\n",
    "        worker_init_fn=seed_workers,\n",
    "        shuffle=shuffle_dataset,\n",
    "        batch_size=batch_size,\n",
    "        sampler=val_sampler,\n",
    "    )\n",
    "\n",
    "data = {\"train\": train_loader, \"val\": validation_loader}\n",
    "print(len(train_loader))\n",
    "print(len(validation_loader))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will define the model. This model uses a ResNet18 architecture built into PyTorch, a commonly used deep residual network, and optimized the Adam gradient descent optimizer "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 295,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define model\n",
    "network = models.resnet18(num_classes=len(label_set))\n",
    "optimizer = optim.Adam(network.parameters(), lr=lr)\n",
    "\n",
    "# class balanced cross entropy \n",
    "train_labels = df_nh[\"regional_label\"][train_sampler]\n",
    "#class_weights = torch.Tensor([1/count for count in Counter(train_labels).values()])\n",
    "\n",
    "#criterion = nn.CrossEntropyLoss(weight = class_weights.to(device='cuda'))\n",
    "criterion = nn.CrossEntropyLoss() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train and save model\n",
    "\n",
    "Finally, we are ready to train our classifier and save it in the output file!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 296,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "validating\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/classifier/analyze.py:30: RuntimeWarning: invalid value encountered in true_divide\n",
      "  cm_perc = cm / cm_sum.astype(float) * 100\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "saved model checkpoint\n",
      "1\n",
      "2\n",
      "validating\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/classifier/analyze.py:30: RuntimeWarning: invalid value encountered in true_divide\n",
      "  cm_perc = cm / cm_sum.astype(float) * 100\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "saved model checkpoint\n",
      "3\n",
      "4\n",
      "validating\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/classifier/analyze.py:30: RuntimeWarning: invalid value encountered in true_divide\n",
      "  cm_perc = cm / cm_sum.astype(float) * 100\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "saved model checkpoint\n",
      "5\n",
      "6\n",
      "validating\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/classifier/analyze.py:30: RuntimeWarning: invalid value encountered in true_divide\n",
      "  cm_perc = cm / cm_sum.astype(float) * 100\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "saved model checkpoint\n",
      "7\n",
      "8\n",
      "validating\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/gpfs/mskmindhdp_emc/user/shared_data_folder/pathology-tutorial/classifier/analyze.py:30: RuntimeWarning: invalid value encountered in true_divide\n",
      "  cm_perc = cm / cm_sum.astype(float) * 100\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "saved model checkpoint\n",
      "9\n",
      "CPU times: user 2min 25s, sys: 1min 38s, total: 4min 3s\n",
      "Wall time: 5min 9s\n"
     ]
    }
   ],
   "source": [
    "%%time \n",
    "\n",
    "model = Classifier(\n",
    "        network,\n",
    "        criterion,\n",
    "        optimizer,\n",
    "        data,\n",
    "        label_set,\n",
    "        output_dir=output_dir,\n",
    "    )\n",
    "\n",
    "for n_epoch in range(epochs):\n",
    "    print(n_epoch)\n",
    "    _ = model.train(n_epoch)\n",
    "\n",
    "    if n_epoch % 2 == 0:\n",
    "        print(\"validating\")\n",
    "        _ = model.validate(n_epoch)\n",
    "        model.save_ckpt(os.path.join(output_dir, 'ckpts'))\n",
    "\n",
    "pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize TensorBoard logging\n",
    "\n",
    "Tensorboard is a tool used to inspect and understand machine learning models. We will use it to provide measurements for our model and view the outputs. First, we must load the TensorBoard notebook extension and import corresponding package. In the following code block, we do so and list the TensorBoard instances running on our ports currently:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 297,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The tensorboard extension is already loaded. To reload it, use:\n",
      "  %reload_ext tensorboard\n",
      "Known TensorBoard instances:\n",
      "  - port 6009: logdir tensorboard (started 9 days, 18:48:12 ago; pid 125888)\n",
      "  - port 6010: logdir tensorboard/ (started 9 days, 18:29:58 ago; pid 2118)\n",
      "  - port 6011: logdir ./PRO_12-123/model/outputs/tensorboard (started 9 days, 17:53:38 ago; pid 63424)\n"
     ]
    }
   ],
   "source": [
    "%load_ext tensorboard\n",
    "\n",
    "notebook.list()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To kill any of the processes running on your ports, you may run the lsof kill command. Replace 6007 with your desired port."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 298,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# kill $(lsof -t -i:6008)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TensorBoard can generate figures that describe the accuracies and failings of our model based on the output files that we directed to *output_dir*. To view these figures, we can run a simple command that points the log directory to our local destination. When you run the following code block, you should see a TensorBoard extension within this notebook.\n",
    "\n",
    "The first page, under the header **SCALARS**, depicts the progression of our model's accuracy and loss scalars as it trains on the data. We can expect the accuracy to rise almost logistically and the loss to tend to zero. \n",
    "\n",
    "The second page, under the header **IMAGES**, shows the validation confusion matrix. Note that the model should have higher marks for the labels that were more abundant in the training set, namely the tumors and arteries. Since the model is trained on so little data, do not be surprised if the confusion matrix differs quite drastically from the identity matrix.\n",
    "\n",
    "The remaining pages will depict additional figures for our model, including a diagram relating our input and output to the model (residual neural network). Feel free to spend time exploring this extension!"
   ]
  },
    {
     "cell_type": "code",
     "execution_count": 299,
     "metadata": {},
     "outputs": [
      {
       "data": {
        "text/plain": [
         "Reusing TensorBoard on port 6011 (pid 63424), started 9 days, 17:53:38 ago. (Use '!kill 63424' to kill it.)"
        ]
       },
       "metadata": {},
       "output_type": "display_data"
      }
     ],
     "source": [
      "tensorboard --logdir=./PRO_12-123/model/outputs/tensorboard --bind_all"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "![Tensorboard Screenshot](img/tensorboard-screenshot.png)"
     ]
    },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, you have completed this notebook. Congratulations! You have successfully trained a deep learning model on your toy dataset! You are now ready to continue to the inference stage of the pipeline. Please navigate to the inference-and-visualization notebook."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}