{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deep Otello AI\n",
    "\n",
    "The game reversi is a very good game to apply deep learning methods to.\n",
    "\n",
    "Othello, also known as reversi, is a board game first published in 1883 by either Lewis Waterman or John W. Mollet in England (each one was denouncing the other as fraud).\n",
    "It is a strict turn-based zero-sum game with a clear Markov chain and no hidden states, unlike card games with an unknown distribution of cards or unknown player allegiance.\n",
    "The game is played with one set of stones with two colors, which is much easier to abstract than chess with its 6 unique pieces.\n",
    "The game board is symmetrical and allows for playing with rotating the state around an axis or flipping/mirroring the board, which can allow for a breaking of sequences or interesting ANN architectures, quadruple the data generation by simulation, or interesting test cases where symmetry in turns should be observable if the AI reaches an \"objective\" policy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The game rules\n",
    "\n",
    "Othello is a turn-based, two-player board game played on an 8 x 8 board, with a similar geometry to a chess game. The game pieces are black on one side and white on the other.\n",
    "\n",
    "![Othello game board example](reversi_example.png)\n",
    "\n",
    "The players take turns placing their stones on the board, and the objective is to surround the opponent's stones with your own stones. A player can only place a stone when it surrounds at least one of the opponent's stones with their own stones, either horizontally, vertically, or diagonally. When a player places a stone, all the surrounded stones will flip to become the player's color. If a player cannot make a move, they are skipped. The game ends when both players cannot make any more moves. The player with the most stones on the board wins, and any unclaimed fields go to the player with the most stones of their color on the board. The game starts with four stones placed in the center of the board, with each player getting two, which are placed diagonally opposite to each other.\n",
    "\n",
    "\n",
    "<img alt=\"Startaufstellung.png\" src=\"Startaufstellung.png\"/>\n",
    "\n",
    "## Some common Othello strategies\n",
    "\n",
    "The placement of stones on the board is always a careful balance of attack and defense. Occupying large homogeneous stretches on the board can make it easier for the opponent to attack. The board's corners provide safety, from which occupied territory is impossible to lose, but they are difficult to obtain. The enemy must be forced to allow reaching the corners or calculate the cost of giving a stable base to the opponent. Some Othello computer strategies implement greedy algorithms based on a modified score for each field. Different values serve as score modifiers for a traditional greedy algorithm. When a player's stone captures a field, the score reached is multiplied by the modifier. The total score is the score reached by the player minus the score of the opponent. The scores change during the game and converge towards one, which gives some indications of what to expect from an Othello AI.\n",
    "\n",
    "<img alt=\"ComputerPossitionScore\" src=\"computer-score.png\"/>\n",
    "\n",
    "\n",
    "## Initial design decisions\n",
    "\n",
    "At the beginning of this project, I made some design decisions. The first one was that I did not want to use a gym library because it limits the data formats accessible. I chose to implement the whole game as an entry in a stack of NumPy arrays to be able to accommodate interfacing with a neural network easier and to use SciPy pattern recognition tools to implement some game mechanics for a fast simulation cycle. In the array format, stones from the player are marked as 1, and stones by the enemy are marked as -1. I chose to ignore player colors as far as I could; instead, a player perspective was used, which allowed changing the perspective with a flipping of the sign (multiplying with -1). The array format should also allow for data multiplication or the breaking of strict sequences by flipping the game along one of the four axes (horizontal, vertical, transpose along both diagonals).\n",
    "\n",
    "I wanted to implement different agents as classes that act on those game stacks. Since computation time is critical, all computational results are saved. The analysis of those is then repeated in real-time. If a recalculation of such a section is required, the save file can be deleted, and the code should be executed again.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:09.663807Z",
     "end_time": "2023-03-30T23:51:12.506631Z"
    }
   },
   "outputs": [],
   "source": [
    "%load_ext blackcellmagic\n",
    "%load_ext line_profiler\n",
    "%load_ext memory_profiler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports and dependencies\n",
    "\n",
    "The following direct dependencies where used for this project:\n",
    "```toml\n",
    "jupyter = \"^1.0.0\"\n",
    "matplotlib = \"^3.6.3\"\n",
    "numpy = \"^1.24.1\"\n",
    "pytest = \"^7.2.1\"\n",
    "python = \"3.10.*\"\n",
    "scipy = \"^1.10.0\"\n",
    "tqdm = \"^4.64.1\"\n",
    "jupyterlab = \"^3.6.1\"\n",
    "torchvision = \"^0.14.1\"\n",
    "torchaudio = \"^0.13.1\"\n",
    "```\n",
    "* `Jupyter` and `jupyterlab` on pycharm was used as an IDE / Ipython was used to implement this code.\n",
    "* `matplotlib` was used for visualisation and statistics.\n",
    "* `numpy` was used for array support and mathematical functions\n",
    "* `tqdm` was used for progress bars\n",
    "* `scipy` contains fast pattern recognition tools for images. It was used to make an initial estimation about where possible turns should be.\n",
    "* `torch` supplied the ANN functionalities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:09.705213Z",
     "end_time": "2023-03-30T23:51:12.569337Z"
    }
   },
   "outputs": [],
   "source": [
    "import pickle\n",
    "import abc\n",
    "import itertools\n",
    "import os.path\n",
    "from abc import ABC\n",
    "from enum import Enum\n",
    "from typing import Final\n",
    "from IPython.display import clear_output, display\n",
    "from pathlib import Path\n",
    "import glob\n",
    "import copy\n",
    "from functools import lru_cache, wraps\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "from torch.nn import functional\n",
    "from ipywidgets import interact\n",
    "from scipy.ndimage import binary_dilation\n",
    "from tqdm.notebook import tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constants\n",
    "\n",
    "Some general constants needed to be defined. Such as board game size and Player and Enemy representations. Also, directional offsets and the initial placement of blocks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:09.735610Z",
     "end_time": "2023-03-30T23:51:12.615852Z"
    }
   },
   "outputs": [],
   "source": [
    "BOARD_SIZE: Final[int] = 8  # defines the board side length as 8\n",
    "PLAYER: Final[int] = 1  # defines the number symbolising the player as 1\n",
    "ENEMY: Final[int] = -1  # defines the number symbolising the enemy as -1\n",
    "EXAMPLE_STACK_SIZE: Final[int] = 1000  # defines the game stack size for examples\n",
    "IMPOSSIBLE: Final[np.ndarray] = np.array([-1, -1], dtype=int)\n",
    "IMPOSSIBLE.setflags(write=False)\n",
    "SIMULATE_TURNS: Final[int] = 70\n",
    "VERIFY_POLICY: Final[bool] = False\n",
    "TRAINING_RESULT_PATH: Final[Path] = Path(\"training_data\")\n",
    "if not os.path.exists(TRAINING_RESULT_PATH):\n",
    "    os.mkdir(TRAINING_RESULT_PATH)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The directions array contains all the numerical offsets needed to move along one of the 8 directions in a 2 dimensional grid. This will allow an iteration over the game board.\n",
    "\n",
    "![8-directions.png](8-directions.png \"Offset in 8 directions\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:09.784263Z",
     "end_time": "2023-03-30T23:51:12.653589Z"
    }
   },
   "outputs": [],
   "source": [
    "DIRECTIONS: Final[np.ndarray] = np.array(\n",
    "    [[i, j] for i in range(-1, 2) for j in range(-1, 2) if j != 0 or i != 0],\n",
    "    dtype=int,\n",
    ")\n",
    "DIRECTIONS.setflags(write=False)\n",
    "DIRECTIONS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another constant needed is the initial start square at the center of the board."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:09.819690Z",
     "end_time": "2023-03-30T23:51:12.680563Z"
    }
   },
   "outputs": [],
   "source": [
    "START_SQUARE: Final[np.ndarray] = np.array(\n",
    "    [[ENEMY, PLAYER], [PLAYER, ENEMY]], dtype=int\n",
    ")\n",
    "START_SQUARE.setflags(write=False)\n",
    "START_SQUARE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating new boards\n",
    "\n",
    "The first function implemented and tested is a function to generate the starting environment as a stack of games.\n",
    "As described above I simply placed a 2 by 2 square in the center of an empty stack of boards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:09.855798Z",
     "end_time": "2023-03-30T23:51:12.745419Z"
    }
   },
   "outputs": [],
   "source": [
    "def get_new_games(number_of_games: int) -> np.ndarray:\n",
    "    \"\"\"Generates a stack of initialised game boards.\n",
    "\n",
    "    Args:\n",
    "        number_of_games: The size of the board stack.\n",
    "\n",
    "    Returns: The generates stack of games as a stack n x 8 x 8.\n",
    "\n",
    "    \"\"\"\n",
    "    empty = np.zeros([number_of_games, BOARD_SIZE, BOARD_SIZE], dtype=int)\n",
    "    empty[:, 3:5, 3:5] = START_SQUARE\n",
    "    return empty\n",
    "\n",
    "\n",
    "get_new_games(1)[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:09.873794Z",
     "end_time": "2023-03-30T23:51:12.776309Z"
    }
   },
   "outputs": [],
   "source": [
    "test_number_of_games = 3\n",
    "assert get_new_games(test_number_of_games).shape == (\n",
    "    test_number_of_games,\n",
    "    BOARD_SIZE,\n",
    "    BOARD_SIZE,\n",
    ")\n",
    "np.testing.assert_equal(\n",
    "    get_new_games(test_number_of_games).sum(axis=1),\n",
    "    np.zeros(\n",
    "        [\n",
    "            test_number_of_games,\n",
    "            8,\n",
    "        ]\n",
    "    ),\n",
    ")\n",
    "np.testing.assert_equal(\n",
    "    get_new_games(test_number_of_games).sum(axis=2),\n",
    "    np.zeros(\n",
    "        [\n",
    "            test_number_of_games,\n",
    "            8,\n",
    "        ]\n",
    "    ),\n",
    ")\n",
    "assert np.all(get_new_games(test_number_of_games)[:, 3:4, 3:4] != 0)\n",
    "del test_number_of_games"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualisation tools\n",
    "\n",
    "In this section a visualisation help was implemented for debugging of the game and a proper display of the results.\n",
    "For this visualisation ChatGPT was used as a prompted code generator that was later reviewed and refactored by hand to integrate seamlessly into the project as a whole.\n",
    "White stones represent the player, black stones the enemy. A single plot can be used as a subplot when the `ax` argument is used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:09.900138Z",
     "end_time": "2023-03-30T23:51:12.915811Z"
    }
   },
   "outputs": [],
   "source": [
    "def plot_othello_board(\n",
    "    board: np.ndarray | torch.Tensor,\n",
    "    action: np.ndarray | None = None,\n",
    "    score: float | None = None,\n",
    "    ax=None,\n",
    ") -> None:\n",
    "    \"\"\"Plots a single otello board.\n",
    "\n",
    "    If a matplot axis object is given the board will be plotted into that axis. If not an axis object will be generated.\n",
    "    The image generated will be shown directly.\n",
    "\n",
    "    Args:\n",
    "        board: The bord that should be plotted. Only a single games is allowed. A numpy array of the form 8x8 is expected.\n",
    "        action: The action taken on each board.\n",
    "        score: The score reached with the turn.\n",
    "        ax: If needed a matplotlib axis object can be defined that is used to place the board as a sublot into a bigger context.\n",
    "    \"\"\"\n",
    "    # convert a tensor into an array\n",
    "    if isinstance(board, torch.Tensor):\n",
    "        board = board.cpu().detach().numpy()\n",
    "\n",
    "    # ensure the shape of the array fits\n",
    "    assert board.shape == (8, 8)\n",
    "    plot_all = False\n",
    "\n",
    "    # create a figure if no axis is given\n",
    "    if ax is None:\n",
    "        fig_size = 3\n",
    "        plot_all = True\n",
    "        fig, ax = plt.subplots(figsize=(fig_size, fig_size))\n",
    "\n",
    "    # set the background color\n",
    "    ax.set_facecolor(\"#0f6b28\")\n",
    "\n",
    "    # plot the actions\n",
    "    if action is not None:\n",
    "        ax.scatter(action[0], action[1], s=350 if plot_all else 200, c=\"red\")\n",
    "\n",
    "    # plot black and white stones\n",
    "    for x_pos, y_pos in itertools.product(range(BOARD_SIZE), range(BOARD_SIZE)):\n",
    "        if board[x_pos, y_pos] == ENEMY:\n",
    "            color = \"white\"\n",
    "        elif board[x_pos, y_pos] == PLAYER:\n",
    "            color = \"black\"\n",
    "        else:\n",
    "            continue\n",
    "        ax.scatter(x_pos, y_pos, s=260 if plot_all else 140, c=color)\n",
    "\n",
    "    # plot the lines separating the fields\n",
    "    for x_pos in range(-1, 8):\n",
    "        ax.axhline(x_pos + 0.5, color=\"black\", lw=2)\n",
    "        ax.axvline(x_pos + 0.5, color=\"black\", lw=2)\n",
    "\n",
    "    # define the size of the plot\n",
    "    ax.set_xlim(-0.5, 7.5)\n",
    "    ax.set_ylim(7.5, -0.5)\n",
    "\n",
    "    # set the axis labels\n",
    "    ax.set_xticks(np.arange(8))\n",
    "    ax.set_xticklabels(list(\"ABCDEFGH\"))\n",
    "    ax.set_yticks(np.arange(8))\n",
    "    ax.set_yticklabels(list(\"12345678\"))\n",
    "\n",
    "    # overrides the x_label text if a score should be plotted\n",
    "    if score is None:\n",
    "        ax.set_xlabel(\n",
    "            f\"B{np.sum(board == PLAYER)} / {np.sum(board == 0)} / W{np.sum(board == ENEMY)}\"\n",
    "        )\n",
    "    else:\n",
    "        ax.set_xlabel(f\"Score: {score}\")\n",
    "    if plot_all:\n",
    "        plt.tight_layout()\n",
    "        plt.show()\n",
    "\n",
    "\n",
    "plot_othello_board(get_new_games(1)[0], action=np.array([3, 3]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:10.199792Z",
     "end_time": "2023-03-30T23:51:12.916815Z"
    }
   },
   "outputs": [],
   "source": [
    "PLOTS_PER_ROW = 4\n",
    "\n",
    "\n",
    "def plot_othello_boards(\n",
    "    boards: np.ndarray,\n",
    "    actions: np.ndarray | None = None,\n",
    "    scores: np.ndarray | None = None,\n",
    ") -> None:\n",
    "    \"\"\"Plots multiple boards into subplots.\n",
    "\n",
    "    The plots are shown directly.\n",
    "\n",
    "    Args:\n",
    "        boards: Plots the boards given into subplots. The maximum number of boards accepted is 70.\n",
    "        actions: A list of actions taken on each of the boards.\n",
    "        scores: A list of scores reached at each board.\n",
    "    \"\"\"\n",
    "    # checking if the array input shapes do fit\n",
    "    assert len(boards.shape) == 3\n",
    "    assert boards.shape[1:] == (BOARD_SIZE, BOARD_SIZE)\n",
    "    assert boards.shape[0] < 70\n",
    "\n",
    "    if actions is not None:\n",
    "        assert len(actions.shape) == 2\n",
    "        assert actions.shape[1] == 2\n",
    "        assert boards.shape[0] == actions.shape[0]\n",
    "\n",
    "    if scores is not None:\n",
    "        assert len(scores.shape) == 1\n",
    "        assert boards.shape[0] == scores.shape[0]\n",
    "\n",
    "    # plots the boards\n",
    "    rows = int(np.ceil(boards.shape[0] / PLOTS_PER_ROW))\n",
    "    fig, axs = plt.subplots(rows, PLOTS_PER_ROW, figsize=(12, 3 * rows))\n",
    "    for game_index, ax in enumerate(axs.flatten()):\n",
    "        if game_index >= boards.shape[0]:\n",
    "            fig.delaxes(ax)\n",
    "        else:\n",
    "            action = actions[game_index] if actions is not None else None\n",
    "            score = scores[game_index] if scores is not None else None\n",
    "            plot_othello_board(boards[game_index], action=action, score=score, ax=ax)\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:10.215799Z",
     "end_time": "2023-03-30T23:51:12.916815Z"
    }
   },
   "outputs": [],
   "source": [
    "def drop_duplicate_boards(\n",
    "    boards: np.ndarray,\n",
    "    actions: np.ndarray | None,\n",
    ") -> tuple[np.ndarray, np.ndarray | None]:\n",
    "    \"\"\"Takes a sequence of boards and drops all boards that are unchanged.\n",
    "\n",
    "    Args:\n",
    "        boards: A list of boards to be reduced.\n",
    "        actions: A list of actions to be reduced alongside the bords.\n",
    "\n",
    "    Returns:\n",
    "        A sequence of boards where boards that where equal are dropped.\n",
    "    \"\"\"\n",
    "    non_duplicates = ~np.all(boards == np.roll(boards, axis=0, shift=1), axis=(1, 2))\n",
    "    return (\n",
    "        boards[non_duplicates],\n",
    "        np.roll(actions, axis=0, shift=1)[non_duplicates]\n",
    "        if actions is not None\n",
    "        else None,\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hash Otello Boards\n",
    "\n",
    "A challenge for training any reinforcement learning algorithm is how to properly calibrate the exploration rate.\n",
    "To make huge numbers of boards comparable it is easier to work with hashes than with the acutal boards. For that purpose a functionalty to hash a board and a stack of boards was added."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [],
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:10.232792Z",
     "end_time": "2023-03-30T23:51:12.992902Z"
    }
   },
   "outputs": [],
   "source": [
    "def hash_board(board: np.ndarray) -> int:\n",
    "    assert board.shape == (8, 8) or board.shape == (64,)\n",
    "    return hash(tuple(board.reshape(-1)))\n",
    "\n",
    "\n",
    "def count_unique_boards(boards: np.ndarray) -> int:\n",
    "    return np.unique(\n",
    "        np.apply_along_axis(hash_board, axis=1, arr=boards.reshape(-1, 64))\n",
    "    ).size\n",
    "\n",
    "\n",
    "a = count_unique_boards(np.random.randint(-1, 2, size=(10_000, 8, 8)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find possible actions to take\n",
    "\n",
    "The frist step in the implementation of an AI like this is to get an overview over the possible actions that can be taken in a situation.\n",
    "Here was the design choice taken to first find fields that are empty and have at least one neighbouring enemy stone.\n",
    "This was implemented with element wise check for a stone and a binary dilation marking all fields neighboring an enemy stone.\n",
    "For that the `SURROUNDING` mask was used. Both aries are then element wise combined using and.\n",
    "The resulting array contains all filed where a turn could potentially be made. Those are then check in detail.\n",
    "The previous element wise operations on the numpy array increase the spead for this operation dramatically.\n",
    "\n",
    "The check for a possible turn is done in detail by following each direction step by step as long as there are enemy stones in that direction.\n",
    "If the board end is reached or en empty filed before reaching a field occupied by the player that direction does not surround enemy stones.\n",
    "If one direction surrounds enemy stone a turn is possible.\n",
    "This detailed step is implemented as a recursion and need to go at leas one step to return True."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [],
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:10.362177Z",
     "end_time": "2023-03-30T23:51:13.002888Z"
    }
   },
   "outputs": [],
   "source": [
    "SURROUNDING: Final = np.array(\n",
    "    [[[1, 1, 1], [1, 0, 1], [1, 1, 1]]]\n",
    ")  # defines the binary dilation mask to check if a field is next to an enemy stones\n",
    "SURROUNDING"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To optimize the computation of this game, the `lru_cache` decorator was utilized. LRU cache stores the hash of the arguments and returns the previously calculated result of a computationally heavy operation. However, since Numpy arrays are mutable and unhashable, a code snippet was modified to include conversion to tuples, caching layer, and reconversion to Numpy arrays. This allows for the caching to be implemented. As a result, the calculation time of possible actions to take was reduced to only 30% of the time it takes without the lru_cache decorator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [],
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:10.380173Z",
     "end_time": "2023-03-30T23:51:13.003895Z"
    }
   },
   "outputs": [],
   "source": [
    "# Source https://gist.github.com/Susensio/61f4fee01150caaac1e10fc5f005eb75\n",
    "def np_cache(*lru_args, array_argument_elements: tuple[int, ...] = (0,), **lru_kwargs):\n",
    "    \"\"\"\n",
    "    LRU cache implementation for functions whose parameter at ``array_argument_index`` is a numpy array of dimensions <= 2\n",
    "\n",
    "    Example:\n",
    "    >>> from sem_env.utils.cache import np_cache\n",
    "    >>> array = np.array([[1, 2, 3], [4, 5, 6]])\n",
    "    >>> @np_cache(maxsize=256)\n",
    "    ... def multiply(array, factor):\n",
    "    ...     return factor * array\n",
    "    >>> multiply(array, 2)\n",
    "    >>> multiply(array, 2)\n",
    "    >>> multiply.cache_info()\n",
    "    CacheInfo(hits=1, misses=1, maxsize=256, currsize=1)\n",
    "    \"\"\"\n",
    "\n",
    "    def decorator(function):\n",
    "        @wraps(function)\n",
    "        def wrapper(*args, **kwargs):\n",
    "            args = list(args)\n",
    "            for array_argument_index in array_argument_elements:\n",
    "                np_array = args[array_argument_index]\n",
    "                if len(np_array.shape) > 2:\n",
    "                    raise RuntimeError(\n",
    "                        f\"np_cache is currently only supported for arrays of dim. less than 3 but got shape: {np_array.shape}\"\n",
    "                    )\n",
    "                hashable_array = tuple(np_array.reshape(-1))\n",
    "\n",
    "                args[array_argument_index] = hashable_array\n",
    "            return cached_wrapper(*args, **kwargs)\n",
    "\n",
    "        @lru_cache(*lru_args, **lru_kwargs)\n",
    "        def cached_wrapper(*args, **kwargs):\n",
    "            args = list(args)\n",
    "            for array_argument_index in array_argument_elements:\n",
    "                hashable_array = args[array_argument_index]\n",
    "                array = np.array(hashable_array).reshape(8, 8)\n",
    "                args[array_argument_index] = array\n",
    "            return function(*args, **kwargs)\n",
    "\n",
    "        # copy lru_cache attributes over too\n",
    "        wrapper.cache_info = cached_wrapper.cache_info\n",
    "        wrapper.cache_clear = cached_wrapper.cache_clear\n",
    "        return wrapper\n",
    "\n",
    "    return decorator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:10.405144Z",
     "end_time": "2023-03-30T23:51:30.997965Z"
    }
   },
   "outputs": [],
   "source": [
    "def _recursive_steps(\n",
    "    board: np.ndarray,\n",
    "    rec_direction: np.ndarray,\n",
    "    rec_position: np.ndarray,\n",
    "    step_one: int = 0,\n",
    ") -> int:\n",
    "    \"\"\"Check if a player can place a stone on the board specified in the direction specified and direction specified.\n",
    "\n",
    "    Args:\n",
    "        board: The board that should be checked for a playable action.\n",
    "        rec_direction: The direction that should be checked.\n",
    "        rec_position: The position that should be checked.\n",
    "        step_one: Defines if the call of this function is the firs or not. Should be kept to the default value for proper functionality.\n",
    "\n",
    "    Returns:\n",
    "        True if a turn is possible for possition and direction on the board defined.\n",
    "    \"\"\"\n",
    "    rec_position = rec_position + rec_direction\n",
    "    if np.any((rec_position >= BOARD_SIZE) | (rec_position < 0)):\n",
    "        return 0\n",
    "    next_field = board[tuple(rec_position.tolist())]\n",
    "    if next_field == 0:\n",
    "        return 0\n",
    "    if next_field == -1:\n",
    "        return _recursive_steps(\n",
    "            board, rec_direction, rec_position, step_one=step_one + 1\n",
    "        )\n",
    "    if next_field == 1:\n",
    "        return step_one\n",
    "\n",
    "\n",
    "@np_cache(maxsize=2000, array_argument_elements=(0, 1))\n",
    "def _get_possible_turns_for_board(\n",
    "    board: np.ndarray, poss_turns: np.ndarray\n",
    ") -> np.ndarray:\n",
    "    \"\"\"Calcualtes where turns are possible.\n",
    "\n",
    "    Args:\n",
    "        board: The board that should be checked for a playable action.\n",
    "        poss_turns: An array of actions that could be possible. All true fileds are empty and next to an enemy stone.\n",
    "    \"\"\"\n",
    "    for idx, idy in itertools.product(range(BOARD_SIZE), range(BOARD_SIZE)):\n",
    "        if poss_turns[idx, idy]:\n",
    "            position = idx, idy\n",
    "            poss_turns[idx, idy] = any(\n",
    "                _recursive_steps(board[:, :], direction, position) > 0\n",
    "                for direction in DIRECTIONS\n",
    "            )\n",
    "    return poss_turns\n",
    "\n",
    "\n",
    "def get_possible_turns(boards: np.ndarray, tqdm_on: bool = False) -> np.ndarray:\n",
    "    \"\"\"Analyses a stack of boards.\n",
    "\n",
    "    Args:\n",
    "        boards: A stack of boards to check.\n",
    "        tqdm_on: Uses tqdm to track the progress.\n",
    "\n",
    "    Returns:\n",
    "        A stack of game boards containing boolean values showing where turns are possible for the player.\n",
    "    \"\"\"\n",
    "    assert len(boards.shape) == 3, \"The number fo input dimensions does not fit.\"\n",
    "    assert boards.shape[1:] == (\n",
    "        BOARD_SIZE,\n",
    "        BOARD_SIZE,\n",
    "    ), \"The input dimensions do not fit.\"\n",
    "\n",
    "    poss_turns = boards == 0  # checks where fields are empty.\n",
    "    poss_turns &= binary_dilation(\n",
    "        boards == -1, SURROUNDING\n",
    "    )  # checks where fields are next to an enemy filed an empty\n",
    "    iterate_over = range(boards.shape[0])\n",
    "\n",
    "    if tqdm_on:\n",
    "        iterate_over = tqdm(iterate_over, total=np.prod(boards.shape))\n",
    "    for game in iterate_over:\n",
    "        poss_turns[game] = _get_possible_turns_for_board(boards[game], poss_turns[game])\n",
    "    return poss_turns\n",
    "\n",
    "\n",
    "# some simple testing to ensure the function works after simple changes\n",
    "# this testing is complete, its more of a smoke-test\n",
    "test_array = get_new_games(3)\n",
    "expected_result = np.zeros_like(test_array, dtype=bool)\n",
    "expected_result[:, 4, 5] = expected_result[:, 2, 3] = True\n",
    "expected_result[:, 5, 4] = expected_result[:, 3, 2] = True\n",
    "np.testing.assert_equal(get_possible_turns(test_array), expected_result)\n",
    "\n",
    "\n",
    "%timeit get_possible_turns(get_new_games(10))  # checks turn possibility evaluation time for 10 initial games\n",
    "# %timeit get_possible_turns(get_new_games(EXAMPLE_STACK_SIZE))  # check turn possibility evaluation time for EXAMPLE_STACK_SIZE initial games\n",
    "\n",
    "# shows a singe game\n",
    "get_possible_turns(get_new_games(3))[:1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Besides the ability to generate an array of possible turns there needs to be a functions that check if a given turn is possible.\n",
    "On is needed for the action space validation. The other is for validating a players turn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:31.003968Z",
     "end_time": "2023-03-30T23:51:31.089196Z"
    }
   },
   "outputs": [],
   "source": [
    "def move_possible(board: np.ndarray, move: np.ndarray) -> bool:\n",
    "    \"\"\"Checks if a turn is possible.\n",
    "\n",
    "    Checks if a turn is possible. If no turn is possible to input array [-1, -1] is expected.\n",
    "\n",
    "    Args:\n",
    "        board: A board where it should be checkt if a turn is possible.\n",
    "        move: The move that should be taken. Expected is the index of the filed where a stone should be placed [x, y]. If no placement is possible [-1, -1] is expected as an input.\n",
    "\n",
    "    Returns:\n",
    "        True if the move is possible\n",
    "    \"\"\"\n",
    "    if np.all(move == -1):\n",
    "        return not np.any(get_possible_turns(np.reshape(board, (1, 8, 8))))\n",
    "    return any(\n",
    "        _recursive_steps(board[:, :], direction, move) > 0 for direction in DIRECTIONS\n",
    "    )\n",
    "\n",
    "\n",
    "# Some testing for this function and the underlying recursive functions that are called.\n",
    "assert move_possible(get_new_games(1)[0], np.array([2, 3])) is True\n",
    "assert move_possible(get_new_games(1)[0], np.array([3, 2])) is True\n",
    "assert move_possible(get_new_games(1)[0], np.array([2, 2])) is False\n",
    "assert move_possible(np.zeros((8, 8)), np.array([3, 2])) is False\n",
    "assert move_possible(np.ones((8, 8)) * 1, np.array([-1, -1])) is True\n",
    "assert move_possible(np.ones((8, 8)) * -1, np.array([-1, -1])) is True\n",
    "assert move_possible(np.ones((8, 8)) * 0, np.array([-1, -1])) is True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:31.021960Z",
     "end_time": "2023-03-30T23:51:31.106199Z"
    }
   },
   "outputs": [],
   "source": [
    "def moves_possible(boards: np.ndarray, moves: np.ndarray) -> np.ndarray:\n",
    "    \"\"\"Checks if a stack of moves can be executed on a stack of boards.\n",
    "\n",
    "    Args:\n",
    "        boards: A board where the next stone should be placed.\n",
    "        moves: A stack stones to be placed. Each move is formatted as an array in the form of [x, y] if no turn is possible the value [-1, -1] is expected.\n",
    "\n",
    "    Returns:\n",
    "        An array marking for each and every game and move in the stack if the move can be executed.\n",
    "    \"\"\"\n",
    "    arr_moves_possible = np.zeros(boards.shape[0], dtype=bool)\n",
    "    for game in range(boards.shape[0]):\n",
    "        if np.all(\n",
    "            moves[game] == -1\n",
    "        ):  # can be all or any. All should be faster since most times neither value will be -1.\n",
    "            arr_moves_possible[game] = not np.any(\n",
    "                get_possible_turns(np.reshape(boards[game], (1, 8, 8)))\n",
    "            )\n",
    "        else:\n",
    "            arr_moves_possible[game] = any(\n",
    "                _recursive_steps(boards[game, :, :], direction, moves[game]) > 0\n",
    "                for direction in DIRECTIONS\n",
    "            )\n",
    "    return arr_moves_possible\n",
    "\n",
    "\n",
    "np.testing.assert_array_equal(\n",
    "    moves_possible(np.ones((3, 8, 8)) * 1, np.array([[-1, -1]] * 3)),\n",
    "    np.array([True] * 3),\n",
    ")\n",
    "\n",
    "np.testing.assert_array_equal(\n",
    "    moves_possible(get_new_games(3), np.array([[2, 3], [3, 2], [3, 2]])),\n",
    "    np.array([True] * 3),\n",
    ")\n",
    "np.testing.assert_array_equal(\n",
    "    moves_possible(get_new_games(3), np.array([[2, 2], [1, 1], [0, 0]])),\n",
    "    np.array([False] * 3),\n",
    ")\n",
    "np.testing.assert_array_equal(\n",
    "    moves_possible(np.ones((3, 8, 8)) * -1, np.array([[-1, -1]] * 3)),\n",
    "    np.array([True] * 3),\n",
    ")\n",
    "np.testing.assert_array_equal(\n",
    "    moves_possible(np.zeros((3, 8, 8)), np.array([[-1, -1]] * 3)),\n",
    "    np.array([True] * 3),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Execute a chosen action\n",
    "\n",
    "After an evaluation what turns are possible there needs to be a function that executes a turn.\n",
    "This next sections does that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:31.046196Z",
     "end_time": "2023-03-30T23:51:31.106199Z"
    }
   },
   "outputs": [],
   "source": [
    "class InvalidTurn(ValueError):\n",
    "    \"\"\"\n",
    "    This error is thrown if a given turn is not valid.\n",
    "    \"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:31.074204Z",
     "end_time": "2023-03-30T23:51:42.365334Z"
    }
   },
   "outputs": [],
   "source": [
    "def do_moves(boards: np.ndarray, moves: np.ndarray) -> np.ndarray:\n",
    "    \"\"\"Executes a single move on a stack o Othello boards.\n",
    "\n",
    "    Args:\n",
    "        boards: A stack of Othello boards where the next stone should be placed.\n",
    "        moves: A stack of stone placement orders for the game. Formatted as coordinates in an array [x, y] of the place where the stone should be placed. Should contain [-1,-1] if no new placement is possible.\n",
    "\n",
    "    Returns:\n",
    "        The new state of the board.\n",
    "    \"\"\"\n",
    "\n",
    "    def _do_directional_move(\n",
    "        board: np.ndarray, rec_move: np.ndarray, rev_direction, step_one=True\n",
    "    ) -> bool:\n",
    "        \"\"\"Changes the color of enemy stones in one direction.\n",
    "\n",
    "        This function works recursive. The argument step_one should always be used in its default value.\n",
    "\n",
    "        Args:\n",
    "            board: A bord on which a stone was placed.\n",
    "            rec_move: The position on the board in x and y where this function is called from. Will be moved by recursive called.\n",
    "            rev_direction: The position where the stone was placed. Inside this recursion it will also be the last step that was checked.\n",
    "            step_one: Set to true if this is the first step in the recursion. False later on.\n",
    "\n",
    "        Returns:\n",
    "            True if a stone could be flipped.\n",
    "            All changes are made on the view of the numpy array and therefore not included in the return value.\n",
    "        \"\"\"\n",
    "        rec_position = rec_move + rev_direction\n",
    "        if np.any((rec_position >= 8) | (rec_position < 0)):\n",
    "            return False\n",
    "        next_field = board[tuple(rec_position.tolist())]\n",
    "        if next_field == 0:\n",
    "            return False\n",
    "        if next_field == 1:\n",
    "            return not step_one\n",
    "        if next_field == -1:\n",
    "            if _do_directional_move(board, rec_position, rev_direction, step_one=False):\n",
    "                board[tuple(rec_position.tolist())] = 1\n",
    "                return True\n",
    "            return False\n",
    "\n",
    "    def _do_move(_board: np.ndarray, move: np.ndarray) -> None:\n",
    "        \"\"\"Executes a turn on a board.\n",
    "\n",
    "        Args:\n",
    "            _board: The game board on wich to place a stone.\n",
    "            move: The coordinates of a stone that should be placed. Should be formatted as an array of the form [x, y]. The value [-1, -1] is expected if no turn is possible.\n",
    "\n",
    "        Returns:\n",
    "            All changes are made on the view of the numpy array.\n",
    "        \"\"\"\n",
    "        if np.all(move == -1):\n",
    "            if not move_possible(_board, move):\n",
    "                raise InvalidTurn(\"An action should be taken. A turn is possible.\")\n",
    "            return\n",
    "\n",
    "        # noinspection PyTypeChecker\n",
    "        if _board[tuple(move.tolist())] != 0:\n",
    "            raise InvalidTurn(\"This turn is not possible.\")\n",
    "\n",
    "        action = False\n",
    "        for direction in DIRECTIONS:\n",
    "            if _do_directional_move(_board, move, direction):\n",
    "                action = True\n",
    "        if not action:\n",
    "            raise InvalidTurn(\"This turn is not possible.\")\n",
    "\n",
    "        # noinspection PyTypeChecker\n",
    "        _board[tuple(move.tolist())] = 1\n",
    "\n",
    "    boards = boards.copy()\n",
    "    for game in range(boards.shape[0]):\n",
    "        _do_move(boards[game], moves[game])\n",
    "    return boards\n",
    "\n",
    "\n",
    "%timeit do_moves(get_new_games(EXAMPLE_STACK_SIZE), np.array([[2, 3]] * EXAMPLE_STACK_SIZE))[0]\n",
    "\n",
    "plot_othello_board(\n",
    "    do_moves(\n",
    "        get_new_games(EXAMPLE_STACK_SIZE), np.array([[2, 3]] * EXAMPLE_STACK_SIZE)\n",
    "    )[0],\n",
    "    action=np.array([2, 3]),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## An abstract reversi game policy\n",
    "\n",
    "For an easy use of policies an abstract class containing the policy generation / requests an action in an inherited instance of this class.\n",
    "This class filters the policy to only propose valid actions. Inherited instance do not need to care about this. This super class also manges exploration and exploitation with the epsilon value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:42.374342Z",
     "end_time": "2023-03-30T23:51:42.422334Z"
    }
   },
   "outputs": [],
   "source": [
    "class GamePolicy(ABC):\n",
    "    \"\"\"\n",
    "    A game policy. Proposes where to place a stone next.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, epsilon: float):\n",
    "        \"\"\"\n",
    "\n",
    "        Args:\n",
    "            epsilon: the epsilon / greedy value. Should be between zero and one. Set the mixture of policy and exploration. One means only the policy is used. Zero means only random policies are used. All mixtures inbetween between are possible.\n",
    "        \"\"\"\n",
    "        if 0 > epsilon > 1:\n",
    "            raise ValueError(\"Epsilon should be between zero and one.\")\n",
    "        self._epsilon: float = epsilon\n",
    "\n",
    "    @property\n",
    "    def epsilon(self):\n",
    "        return self._epsilon\n",
    "\n",
    "    @property\n",
    "    @abc.abstractmethod\n",
    "    def policy_name(self) -> str:\n",
    "        \"\"\"The name of this policy\"\"\"\n",
    "        raise NotImplementedError()\n",
    "\n",
    "    @abc.abstractmethod\n",
    "    def _internal_policy(self, boards: np.ndarray) -> np.ndarray:\n",
    "        \"\"\"The internal policy is an unfiltered policy. It should only be called from inside this function\n",
    "\n",
    "        Args:\n",
    "            boards: A board where a policy should be calculated for.\n",
    "\n",
    "        Returns:\n",
    "            The policy for this board. Should have the same size as the boards array.\n",
    "        \"\"\"\n",
    "        raise NotImplementedError()\n",
    "\n",
    "    def get_policy(self, boards: np.ndarray) -> np.ndarray:\n",
    "        \"\"\"Calculates the policy that should be followed.\n",
    "\n",
    "        Calculates the policy that should be followed.\n",
    "        This function does include the usage of epsilon to configure greediness and exploration.\n",
    "\n",
    "        Args:\n",
    "            boards: A set of boards that show the environment where the policy should be calculated for.\n",
    "\n",
    "        Returns:\n",
    "            A vector of indices. Should be formatted as an array of the form [x, y]. The value [-1, -1] is expected if no turn is possible.\n",
    "        \"\"\"\n",
    "        assert len(boards.shape) == 3\n",
    "        assert boards.shape[1:] == (BOARD_SIZE, BOARD_SIZE)\n",
    "\n",
    "        if self.epsilon <= 0:\n",
    "            policies = np.random.rand(*boards.shape)\n",
    "        else:\n",
    "            policies = self._internal_policy(boards)\n",
    "            if self.epsilon < 1:\n",
    "                random_choices = self.epsilon <= np.random.rand((boards.shape[0]))\n",
    "                policies[random_choices] = np.random.rand(np.sum(random_choices), 8, 8)\n",
    "\n",
    "        # todo possibly change this function to only validate the purpose turn and not all turns\n",
    "        possible_turns = get_possible_turns(boards)\n",
    "        policies[possible_turns == False] = -1.0\n",
    "        max_indices = [\n",
    "            np.unravel_index(policy.argmax(), policy.shape) for policy in policies\n",
    "        ]\n",
    "        policy_vector = np.array(max_indices, dtype=int)\n",
    "        no_turn_possible = np.all(policy_vector == 0, 1) & (policies[:, 0, 0] == -1.0)\n",
    "\n",
    "        policy_vector[no_turn_possible, :] = IMPOSSIBLE\n",
    "        return policy_vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some first policies\n",
    "\n",
    "To quantify the quality of a game AI there needs to be some benchmarks.\n",
    "The easiest benchmark is to play against a random player.\n",
    "The easiest player to use as a benchmark is the random player.\n",
    "For this and testing purpose the random policy was implemented."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:42.400400Z",
     "end_time": "2023-03-30T23:51:42.424333Z"
    }
   },
   "outputs": [],
   "source": [
    "class RandomPolicy(GamePolicy):\n",
    "    \"\"\"\n",
    "    A policy playing a random turn by setting epsilon to 0.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, epsilon: float = 0):\n",
    "        _ = epsilon\n",
    "        super().__init__(epsilon=0)\n",
    "\n",
    "    @property\n",
    "    def policy_name(self) -> str:\n",
    "        return \"random\"\n",
    "\n",
    "    def _internal_policy(self, boards: np.ndarray) -> np.ndarray:\n",
    "        pass\n",
    "\n",
    "\n",
    "rnd_policy = RandomPolicy(1)\n",
    "assert rnd_policy.policy_name == \"random\"\n",
    "assert rnd_policy.epsilon == 0\n",
    "\n",
    "rnd_policy_result = rnd_policy.get_policy(get_new_games(10))\n",
    "assert np.any((5 >= rnd_policy_result) & (rnd_policy_result >= 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An alternative benchmark policy is a greedy policy that takes always the maximum number of stones possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:42.416329Z",
     "end_time": "2023-03-30T23:51:42.453635Z"
    }
   },
   "outputs": [],
   "source": [
    "class GreedyPolicy(GamePolicy):\n",
    "    \"\"\"\n",
    "    A policy playing always one of the strongest turns.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, epsilon: float = 1):\n",
    "        _ = epsilon\n",
    "        super().__init__(1)\n",
    "\n",
    "    @property\n",
    "    def policy_name(self) -> str:\n",
    "        return \"greedy_policy\"\n",
    "\n",
    "    def _internal_policy(self, boards: np.ndarray) -> np.ndarray:\n",
    "        policies = np.random.rand(*boards.shape)\n",
    "        poss_turns = boards == 0  # checks where fields are empty.\n",
    "        poss_turns &= binary_dilation(boards == -1, SURROUNDING)\n",
    "        for game, idx, idy in itertools.product(\n",
    "            range(boards.shape[0]), range(BOARD_SIZE), range(BOARD_SIZE)\n",
    "        ):\n",
    "\n",
    "            if poss_turns[game, idx, idy]:\n",
    "                position = idx, idy\n",
    "                policies[game, idx, idy] += np.sum(\n",
    "                    np.array(\n",
    "                        list(\n",
    "                            _recursive_steps(boards[game, :, :], direction, position)\n",
    "                            for direction in DIRECTIONS\n",
    "                        )\n",
    "                    )\n",
    "                )\n",
    "        return policies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Putting the game simulation together\n",
    "Now it's time to bring all together for a proper simulation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Playing a single turn\n",
    "\n",
    "The next function needed is used to request a policy, verify that the turn is legit and place a stone and turn enemy stones if possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:42.448649Z",
     "end_time": "2023-03-30T23:51:42.564324Z"
    }
   },
   "outputs": [],
   "source": [
    "def single_turn(\n",
    "    current_boards: np, policy: GamePolicy\n",
    ") -> tuple[np.ndarray, np.ndarray]:\n",
    "    \"\"\"Execute a single turn on a board.\n",
    "\n",
    "    Places a new stone on the board. Turns captured enemy stones.\n",
    "\n",
    "    Args:\n",
    "        current_boards: The current board before the game.\n",
    "        policy: The game policy to be used.\n",
    "\n",
    "    Returns:\n",
    "        The new game board and the policy vector containing the index of the action used.\n",
    "    \"\"\"\n",
    "    policy_results = policy.get_policy(current_boards)\n",
    "\n",
    "    # if the constant VERIFY_POLICY is set to true the policy is verified. Should be good though.\n",
    "    # todo deactivate the policy verification after some testing.\n",
    "    if VERIFY_POLICY:\n",
    "        assert np.all(moves_possible(current_boards, policy_results)), (\n",
    "            current_boards[(moves_possible(current_boards, policy_results) == False)],\n",
    "            policy_results[(moves_possible(current_boards, policy_results) == False)],\n",
    "            np.where(moves_possible(current_boards, policy_results) == False),\n",
    "        )\n",
    "    return do_moves(current_boards, policy_results), policy_results"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "%timeit single_turn(get_new_games(EXAMPLE_STACK_SIZE), RandomPolicy(1))\n",
    "VERIFY_POLICY = False  # type: ignore\n",
    "%timeit single_turn(get_new_games(EXAMPLE_STACK_SIZE), RandomPolicy(1))\n",
    "VERIFY_POLICY = True  # type: ignore\n",
    "_turn_result = single_turn(get_new_games(EXAMPLE_STACK_SIZE), RandomPolicy(1))\n",
    "plot_othello_boards(_turn_result[0][:8], _turn_result[1][:8])\n",
    "del _turn_result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simulate a stack of games\n",
    "This function will simulate a stack of games and return an array of policies and histories.\n",
    "\n",
    "This will return an arrays with the size of (70 x n x 8 x 8) and (70 x n x 2).\n",
    "The first will contain the boards. The second will contain the actions. If no action is taken the action will be noted as played in (-1, -1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2023-03-30T23:51:42.461638Z",
     "end_time": "2023-03-30T23:51:42.565331Z"
    }
   },
   "outputs": [],
   "source": [
    "def simulate_game(\n",
    "    nr_of_games: int,\n",
    "    policies: tuple[GamePolicy, GamePolicy],\n",
    "    tqdm_on: bool = False,\n",
    ") -> tuple[np.ndarray, np.ndarray]:\n",
    "    \"\"\"Simulates a stack of games.\n",
    "\n",
    "    Args:\n",
    "        nr_of_games: The number of games that should be simulated.\n",
    "        policies: The policies that should be used to simulate the game.\n",
    "        tqdm_on: Switches tqdm on.\n",
    "\n",
    "    Returns:\n",
    "        A stack of board histories and actions.\n",
    "    \"\"\"\n",
    "    board_history_stack = np.zeros((SIMULATE_TURNS, nr_of_games, 8, 8), dtype=np.int8)\n",
    "    action_history_stack = np.zeros((SIMULATE_TURNS, nr_of_games, 2), dtype=np.int8)\n",
    "    current_boards = get_new_games(nr_of_games)\n",
    "    for turn_index in tqdm(range(SIMULATE_TURNS)) if tqdm_on else range(SIMULATE_TURNS):\n",
    "        policy_index = turn_index % 2\n",
    "        policy = policies[policy_index]\n",
    "        board_history_stack[turn_index, :, :, :] = current_boards\n",
    "        if policy_index == 1:\n",
    "            current_boards *= -1\n",
    "        current_boards, action_taken = single_turn(current_boards, policy)\n",
    "        action_history_stack[turn_index, :] = action_taken\n",
    "\n",
    "        if policy_index == 1:\n",
    "            current_boards *= -1\n",
    "\n",
    "    return board_history_stack, action_history_stack"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first thing to do now is try out how the player act."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Simulating games\n",
    "\n",
    "Since now a simulator, a tool for visualisation and two policies exist a few games need to be simulated to verify proper function off all three elements.\n",
    "\n",
    "### Random vs. random policy\n",
    "First there is a simulation of a game between two random polices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001B[1;31m---------------------------------------------------------------------------\u001B[0m",
      "\u001B[1;31mKeyboardInterrupt\u001B[0m                         Traceback (most recent call last)",
      "Cell \u001B[1;32mIn[105], line 6\u001B[0m\n\u001B[0;32m      2\u001B[0m simulation_results \u001B[38;5;241m=\u001B[39m simulate_game(\u001B[38;5;241m1\u001B[39m, (RandomPolicy(\u001B[38;5;241m1\u001B[39m), RandomPolicy(\u001B[38;5;241m1\u001B[39m)))\n\u001B[0;32m      3\u001B[0m _unique_bords, _unique_actions \u001B[38;5;241m=\u001B[39m drop_duplicate_boards(\n\u001B[0;32m      4\u001B[0m     simulation_results[\u001B[38;5;241m0\u001B[39m]\u001B[38;5;241m.\u001B[39mreshape(\u001B[38;5;241m-\u001B[39m\u001B[38;5;241m1\u001B[39m, \u001B[38;5;241m8\u001B[39m, \u001B[38;5;241m8\u001B[39m), simulation_results[\u001B[38;5;241m1\u001B[39m]\u001B[38;5;241m.\u001B[39mreshape(\u001B[38;5;241m-\u001B[39m\u001B[38;5;241m1\u001B[39m, \u001B[38;5;241m2\u001B[39m)\n\u001B[0;32m      5\u001B[0m )\n\u001B[1;32m----> 6\u001B[0m \u001B[43mplot_othello_boards\u001B[49m\u001B[43m(\u001B[49m\u001B[43m_unique_bords\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mactions\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43m_unique_actions\u001B[49m\u001B[43m)\u001B[49m\n",
      "Cell \u001B[1;32mIn[90], line 43\u001B[0m, in \u001B[0;36mplot_othello_boards\u001B[1;34m(boards, actions, scores)\u001B[0m\n\u001B[0;32m     41\u001B[0m         plot_othello_board(boards[game_index], action\u001B[38;5;241m=\u001B[39maction, score\u001B[38;5;241m=\u001B[39mscore, ax\u001B[38;5;241m=\u001B[39max)\n\u001B[0;32m     42\u001B[0m plt\u001B[38;5;241m.\u001B[39mtight_layout()\n\u001B[1;32m---> 43\u001B[0m \u001B[43mplt\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mshow\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\pyplot.py:445\u001B[0m, in \u001B[0;36mshow\u001B[1;34m(*args, **kwargs)\u001B[0m\n\u001B[0;32m    401\u001B[0m \u001B[38;5;250m\u001B[39m\u001B[38;5;124;03m\"\"\"\u001B[39;00m\n\u001B[0;32m    402\u001B[0m \u001B[38;5;124;03mDisplay all open figures.\u001B[39;00m\n\u001B[0;32m    403\u001B[0m \n\u001B[1;32m   (...)\u001B[0m\n\u001B[0;32m    442\u001B[0m \u001B[38;5;124;03mexplicitly there.\u001B[39;00m\n\u001B[0;32m    443\u001B[0m \u001B[38;5;124;03m\"\"\"\u001B[39;00m\n\u001B[0;32m    444\u001B[0m _warn_if_gui_out_of_main_thread()\n\u001B[1;32m--> 445\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m _get_backend_mod()\u001B[38;5;241m.\u001B[39mshow(\u001B[38;5;241m*\u001B[39margs, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkwargs)\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib_inline\\backend_inline.py:90\u001B[0m, in \u001B[0;36mshow\u001B[1;34m(close, block)\u001B[0m\n\u001B[0;32m     88\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m     89\u001B[0m     \u001B[38;5;28;01mfor\u001B[39;00m figure_manager \u001B[38;5;129;01min\u001B[39;00m Gcf\u001B[38;5;241m.\u001B[39mget_all_fig_managers():\n\u001B[1;32m---> 90\u001B[0m         \u001B[43mdisplay\u001B[49m\u001B[43m(\u001B[49m\n\u001B[0;32m     91\u001B[0m \u001B[43m            \u001B[49m\u001B[43mfigure_manager\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcanvas\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfigure\u001B[49m\u001B[43m,\u001B[49m\n\u001B[0;32m     92\u001B[0m \u001B[43m            \u001B[49m\u001B[43mmetadata\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43m_fetch_figure_metadata\u001B[49m\u001B[43m(\u001B[49m\u001B[43mfigure_manager\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mcanvas\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfigure\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m     93\u001B[0m \u001B[43m        \u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m     94\u001B[0m \u001B[38;5;28;01mfinally\u001B[39;00m:\n\u001B[0;32m     95\u001B[0m     show\u001B[38;5;241m.\u001B[39m_to_draw \u001B[38;5;241m=\u001B[39m []\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\IPython\\core\\display_functions.py:298\u001B[0m, in \u001B[0;36mdisplay\u001B[1;34m(include, exclude, metadata, transient, display_id, raw, clear, *objs, **kwargs)\u001B[0m\n\u001B[0;32m    296\u001B[0m     publish_display_data(data\u001B[38;5;241m=\u001B[39mobj, metadata\u001B[38;5;241m=\u001B[39mmetadata, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkwargs)\n\u001B[0;32m    297\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m--> 298\u001B[0m     format_dict, md_dict \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;43mformat\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mobj\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43minclude\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43minclude\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mexclude\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mexclude\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m    299\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m format_dict:\n\u001B[0;32m    300\u001B[0m         \u001B[38;5;66;03m# nothing to display (e.g. _ipython_display_ took over)\u001B[39;00m\n\u001B[0;32m    301\u001B[0m         \u001B[38;5;28;01mcontinue\u001B[39;00m\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\IPython\\core\\formatters.py:177\u001B[0m, in \u001B[0;36mDisplayFormatter.format\u001B[1;34m(self, obj, include, exclude)\u001B[0m\n\u001B[0;32m    175\u001B[0m md \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m\n\u001B[0;32m    176\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[1;32m--> 177\u001B[0m     data \u001B[38;5;241m=\u001B[39m \u001B[43mformatter\u001B[49m\u001B[43m(\u001B[49m\u001B[43mobj\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m    178\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m:\n\u001B[0;32m    179\u001B[0m     \u001B[38;5;66;03m# FIXME: log the exception\u001B[39;00m\n\u001B[0;32m    180\u001B[0m     \u001B[38;5;28;01mraise\u001B[39;00m\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\decorator.py:232\u001B[0m, in \u001B[0;36mdecorate.<locals>.fun\u001B[1;34m(*args, **kw)\u001B[0m\n\u001B[0;32m    230\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m kwsyntax:\n\u001B[0;32m    231\u001B[0m     args, kw \u001B[38;5;241m=\u001B[39m fix(args, kw, sig)\n\u001B[1;32m--> 232\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m caller(func, \u001B[38;5;241m*\u001B[39m(extras \u001B[38;5;241m+\u001B[39m args), \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkw)\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\IPython\\core\\formatters.py:221\u001B[0m, in \u001B[0;36mcatch_format_error\u001B[1;34m(method, self, *args, **kwargs)\u001B[0m\n\u001B[0;32m    219\u001B[0m \u001B[38;5;250m\u001B[39m\u001B[38;5;124;03m\"\"\"show traceback on failed format call\"\"\"\u001B[39;00m\n\u001B[0;32m    220\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[1;32m--> 221\u001B[0m     r \u001B[38;5;241m=\u001B[39m method(\u001B[38;5;28mself\u001B[39m, \u001B[38;5;241m*\u001B[39margs, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkwargs)\n\u001B[0;32m    222\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m \u001B[38;5;167;01mNotImplementedError\u001B[39;00m:\n\u001B[0;32m    223\u001B[0m     \u001B[38;5;66;03m# don't warn on NotImplementedErrors\u001B[39;00m\n\u001B[0;32m    224\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_check_return(\u001B[38;5;28;01mNone\u001B[39;00m, args[\u001B[38;5;241m0\u001B[39m])\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\IPython\\core\\formatters.py:338\u001B[0m, in \u001B[0;36mBaseFormatter.__call__\u001B[1;34m(self, obj)\u001B[0m\n\u001B[0;32m    336\u001B[0m     \u001B[38;5;28;01mpass\u001B[39;00m\n\u001B[0;32m    337\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m--> 338\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mprinter\u001B[49m\u001B[43m(\u001B[49m\u001B[43mobj\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m    339\u001B[0m \u001B[38;5;66;03m# Finally look for special method names\u001B[39;00m\n\u001B[0;32m    340\u001B[0m method \u001B[38;5;241m=\u001B[39m get_real_method(obj, \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mprint_method)\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\IPython\\core\\pylabtools.py:152\u001B[0m, in \u001B[0;36mprint_figure\u001B[1;34m(fig, fmt, bbox_inches, base64, **kwargs)\u001B[0m\n\u001B[0;32m    149\u001B[0m     \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01mmatplotlib\u001B[39;00m\u001B[38;5;21;01m.\u001B[39;00m\u001B[38;5;21;01mbackend_bases\u001B[39;00m \u001B[38;5;28;01mimport\u001B[39;00m FigureCanvasBase\n\u001B[0;32m    150\u001B[0m     FigureCanvasBase(fig)\n\u001B[1;32m--> 152\u001B[0m fig\u001B[38;5;241m.\u001B[39mcanvas\u001B[38;5;241m.\u001B[39mprint_figure(bytes_io, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkw)\n\u001B[0;32m    153\u001B[0m data \u001B[38;5;241m=\u001B[39m bytes_io\u001B[38;5;241m.\u001B[39mgetvalue()\n\u001B[0;32m    154\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m fmt \u001B[38;5;241m==\u001B[39m \u001B[38;5;124m'\u001B[39m\u001B[38;5;124msvg\u001B[39m\u001B[38;5;124m'\u001B[39m:\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\backend_bases.py:2338\u001B[0m, in \u001B[0;36mFigureCanvasBase.print_figure\u001B[1;34m(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, pad_inches, bbox_extra_artists, backend, **kwargs)\u001B[0m\n\u001B[0;32m   2332\u001B[0m     renderer \u001B[38;5;241m=\u001B[39m _get_renderer(\n\u001B[0;32m   2333\u001B[0m         \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mfigure,\n\u001B[0;32m   2334\u001B[0m         functools\u001B[38;5;241m.\u001B[39mpartial(\n\u001B[0;32m   2335\u001B[0m             print_method, orientation\u001B[38;5;241m=\u001B[39morientation)\n\u001B[0;32m   2336\u001B[0m     )\n\u001B[0;32m   2337\u001B[0m     \u001B[38;5;28;01mwith\u001B[39;00m \u001B[38;5;28mgetattr\u001B[39m(renderer, \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124m_draw_disabled\u001B[39m\u001B[38;5;124m\"\u001B[39m, nullcontext)():\n\u001B[1;32m-> 2338\u001B[0m         \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfigure\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdraw\u001B[49m\u001B[43m(\u001B[49m\u001B[43mrenderer\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m   2340\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m bbox_inches:\n\u001B[0;32m   2341\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m bbox_inches \u001B[38;5;241m==\u001B[39m \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mtight\u001B[39m\u001B[38;5;124m\"\u001B[39m:\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\artist.py:95\u001B[0m, in \u001B[0;36m_finalize_rasterization.<locals>.draw_wrapper\u001B[1;34m(artist, renderer, *args, **kwargs)\u001B[0m\n\u001B[0;32m     93\u001B[0m \u001B[38;5;129m@wraps\u001B[39m(draw)\n\u001B[0;32m     94\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mdraw_wrapper\u001B[39m(artist, renderer, \u001B[38;5;241m*\u001B[39margs, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkwargs):\n\u001B[1;32m---> 95\u001B[0m     result \u001B[38;5;241m=\u001B[39m draw(artist, renderer, \u001B[38;5;241m*\u001B[39margs, \u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkwargs)\n\u001B[0;32m     96\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m renderer\u001B[38;5;241m.\u001B[39m_rasterizing:\n\u001B[0;32m     97\u001B[0m         renderer\u001B[38;5;241m.\u001B[39mstop_rasterizing()\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\artist.py:72\u001B[0m, in \u001B[0;36mallow_rasterization.<locals>.draw_wrapper\u001B[1;34m(artist, renderer)\u001B[0m\n\u001B[0;32m     69\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m artist\u001B[38;5;241m.\u001B[39mget_agg_filter() \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[0;32m     70\u001B[0m         renderer\u001B[38;5;241m.\u001B[39mstart_filter()\n\u001B[1;32m---> 72\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mdraw\u001B[49m\u001B[43m(\u001B[49m\u001B[43martist\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mrenderer\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m     73\u001B[0m \u001B[38;5;28;01mfinally\u001B[39;00m:\n\u001B[0;32m     74\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m artist\u001B[38;5;241m.\u001B[39mget_agg_filter() \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\figure.py:3125\u001B[0m, in \u001B[0;36mFigure.draw\u001B[1;34m(self, renderer)\u001B[0m\n\u001B[0;32m   3122\u001B[0m         \u001B[38;5;66;03m# ValueError can occur when resizing a window.\u001B[39;00m\n\u001B[0;32m   3124\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mpatch\u001B[38;5;241m.\u001B[39mdraw(renderer)\n\u001B[1;32m-> 3125\u001B[0m \u001B[43mmimage\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_draw_list_compositing_images\u001B[49m\u001B[43m(\u001B[49m\n\u001B[0;32m   3126\u001B[0m \u001B[43m    \u001B[49m\u001B[43mrenderer\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43martists\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43msuppressComposite\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m   3128\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m sfig \u001B[38;5;129;01min\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39msubfigs:\n\u001B[0;32m   3129\u001B[0m     sfig\u001B[38;5;241m.\u001B[39mdraw(renderer)\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\image.py:131\u001B[0m, in \u001B[0;36m_draw_list_compositing_images\u001B[1;34m(renderer, parent, artists, suppress_composite)\u001B[0m\n\u001B[0;32m    129\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m not_composite \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m has_images:\n\u001B[0;32m    130\u001B[0m     \u001B[38;5;28;01mfor\u001B[39;00m a \u001B[38;5;129;01min\u001B[39;00m artists:\n\u001B[1;32m--> 131\u001B[0m         \u001B[43ma\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdraw\u001B[49m\u001B[43m(\u001B[49m\u001B[43mrenderer\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m    132\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[0;32m    133\u001B[0m     \u001B[38;5;66;03m# Composite any adjacent images together\u001B[39;00m\n\u001B[0;32m    134\u001B[0m     image_group \u001B[38;5;241m=\u001B[39m []\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\artist.py:72\u001B[0m, in \u001B[0;36mallow_rasterization.<locals>.draw_wrapper\u001B[1;34m(artist, renderer)\u001B[0m\n\u001B[0;32m     69\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m artist\u001B[38;5;241m.\u001B[39mget_agg_filter() \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[0;32m     70\u001B[0m         renderer\u001B[38;5;241m.\u001B[39mstart_filter()\n\u001B[1;32m---> 72\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mdraw\u001B[49m\u001B[43m(\u001B[49m\u001B[43martist\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mrenderer\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m     73\u001B[0m \u001B[38;5;28;01mfinally\u001B[39;00m:\n\u001B[0;32m     74\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m artist\u001B[38;5;241m.\u001B[39mget_agg_filter() \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\axes\\_base.py:3066\u001B[0m, in \u001B[0;36m_AxesBase.draw\u001B[1;34m(self, renderer)\u001B[0m\n\u001B[0;32m   3063\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m artists_rasterized:\n\u001B[0;32m   3064\u001B[0m     _draw_rasterized(\u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mfigure, artists_rasterized, renderer)\n\u001B[1;32m-> 3066\u001B[0m \u001B[43mmimage\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_draw_list_compositing_images\u001B[49m\u001B[43m(\u001B[49m\n\u001B[0;32m   3067\u001B[0m \u001B[43m    \u001B[49m\u001B[43mrenderer\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43martists\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfigure\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43msuppressComposite\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m   3069\u001B[0m renderer\u001B[38;5;241m.\u001B[39mclose_group(\u001B[38;5;124m'\u001B[39m\u001B[38;5;124maxes\u001B[39m\u001B[38;5;124m'\u001B[39m)\n\u001B[0;32m   3070\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mstale \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mFalse\u001B[39;00m\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\image.py:131\u001B[0m, in \u001B[0;36m_draw_list_compositing_images\u001B[1;34m(renderer, parent, artists, suppress_composite)\u001B[0m\n\u001B[0;32m    129\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m not_composite \u001B[38;5;129;01mor\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m has_images:\n\u001B[0;32m    130\u001B[0m     \u001B[38;5;28;01mfor\u001B[39;00m a \u001B[38;5;129;01min\u001B[39;00m artists:\n\u001B[1;32m--> 131\u001B[0m         \u001B[43ma\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdraw\u001B[49m\u001B[43m(\u001B[49m\u001B[43mrenderer\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m    132\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[0;32m    133\u001B[0m     \u001B[38;5;66;03m# Composite any adjacent images together\u001B[39;00m\n\u001B[0;32m    134\u001B[0m     image_group \u001B[38;5;241m=\u001B[39m []\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\artist.py:72\u001B[0m, in \u001B[0;36mallow_rasterization.<locals>.draw_wrapper\u001B[1;34m(artist, renderer)\u001B[0m\n\u001B[0;32m     69\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m artist\u001B[38;5;241m.\u001B[39mget_agg_filter() \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[0;32m     70\u001B[0m         renderer\u001B[38;5;241m.\u001B[39mstart_filter()\n\u001B[1;32m---> 72\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mdraw\u001B[49m\u001B[43m(\u001B[49m\u001B[43martist\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mrenderer\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m     73\u001B[0m \u001B[38;5;28;01mfinally\u001B[39;00m:\n\u001B[0;32m     74\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m artist\u001B[38;5;241m.\u001B[39mget_agg_filter() \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\collections.py:972\u001B[0m, in \u001B[0;36m_CollectionWithSizes.draw\u001B[1;34m(self, renderer)\u001B[0m\n\u001B[0;32m    969\u001B[0m \u001B[38;5;129m@artist\u001B[39m\u001B[38;5;241m.\u001B[39mallow_rasterization\n\u001B[0;32m    970\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21mdraw\u001B[39m(\u001B[38;5;28mself\u001B[39m, renderer):\n\u001B[0;32m    971\u001B[0m     \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mset_sizes(\u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_sizes, \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mfigure\u001B[38;5;241m.\u001B[39mdpi)\n\u001B[1;32m--> 972\u001B[0m     \u001B[38;5;28;43msuper\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mdraw\u001B[49m\u001B[43m(\u001B[49m\u001B[43mrenderer\u001B[49m\u001B[43m)\u001B[49m\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\artist.py:72\u001B[0m, in \u001B[0;36mallow_rasterization.<locals>.draw_wrapper\u001B[1;34m(artist, renderer)\u001B[0m\n\u001B[0;32m     69\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m artist\u001B[38;5;241m.\u001B[39mget_agg_filter() \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[0;32m     70\u001B[0m         renderer\u001B[38;5;241m.\u001B[39mstart_filter()\n\u001B[1;32m---> 72\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mdraw\u001B[49m\u001B[43m(\u001B[49m\u001B[43martist\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mrenderer\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m     73\u001B[0m \u001B[38;5;28;01mfinally\u001B[39;00m:\n\u001B[0;32m     74\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m artist\u001B[38;5;241m.\u001B[39mget_agg_filter() \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;129;01mnot\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\collections.py:388\u001B[0m, in \u001B[0;36mCollection.draw\u001B[1;34m(self, renderer)\u001B[0m\n\u001B[0;32m    386\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[0;32m    387\u001B[0m     combined_transform \u001B[38;5;241m=\u001B[39m transform\n\u001B[1;32m--> 388\u001B[0m extents \u001B[38;5;241m=\u001B[39m \u001B[43mpaths\u001B[49m\u001B[43m[\u001B[49m\u001B[38;5;241;43m0\u001B[39;49m\u001B[43m]\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mget_extents\u001B[49m\u001B[43m(\u001B[49m\u001B[43mcombined_transform\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m    389\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m (extents\u001B[38;5;241m.\u001B[39mwidth \u001B[38;5;241m<\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mfigure\u001B[38;5;241m.\u001B[39mbbox\u001B[38;5;241m.\u001B[39mwidth\n\u001B[0;32m    390\u001B[0m         \u001B[38;5;129;01mand\u001B[39;00m extents\u001B[38;5;241m.\u001B[39mheight \u001B[38;5;241m<\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mfigure\u001B[38;5;241m.\u001B[39mbbox\u001B[38;5;241m.\u001B[39mheight):\n\u001B[0;32m    391\u001B[0m     do_single_path_optimization \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mTrue\u001B[39;00m\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\path.py:633\u001B[0m, in \u001B[0;36mPath.get_extents\u001B[1;34m(self, transform, **kwargs)\u001B[0m\n\u001B[0;32m    631\u001B[0m \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[0;32m    632\u001B[0m     xys \u001B[38;5;241m=\u001B[39m []\n\u001B[1;32m--> 633\u001B[0m     \u001B[38;5;28;01mfor\u001B[39;00m curve, code \u001B[38;5;129;01min\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39miter_bezier(\u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkwargs):\n\u001B[0;32m    634\u001B[0m         \u001B[38;5;66;03m# places where the derivative is zero can be extrema\u001B[39;00m\n\u001B[0;32m    635\u001B[0m         _, dzeros \u001B[38;5;241m=\u001B[39m curve\u001B[38;5;241m.\u001B[39maxis_aligned_extrema()\n\u001B[0;32m    636\u001B[0m         \u001B[38;5;66;03m# as can the ends of the curve\u001B[39;00m\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\path.py:443\u001B[0m, in \u001B[0;36mPath.iter_bezier\u001B[1;34m(self, **kwargs)\u001B[0m\n\u001B[0;32m    441\u001B[0m first_vert \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m\n\u001B[0;32m    442\u001B[0m prev_vert \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mNone\u001B[39;00m\n\u001B[1;32m--> 443\u001B[0m \u001B[38;5;28;01mfor\u001B[39;00m verts, code \u001B[38;5;129;01min\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39miter_segments(\u001B[38;5;241m*\u001B[39m\u001B[38;5;241m*\u001B[39mkwargs):\n\u001B[0;32m    444\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m first_vert \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[0;32m    445\u001B[0m         \u001B[38;5;28;01mif\u001B[39;00m code \u001B[38;5;241m!=\u001B[39m Path\u001B[38;5;241m.\u001B[39mMOVETO:\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\matplotlib\\path.py:416\u001B[0m, in \u001B[0;36mPath.iter_segments\u001B[1;34m(self, transform, remove_nans, clip, snap, stroke_width, simplify, curves, sketch)\u001B[0m\n\u001B[0;32m    414\u001B[0m     \u001B[38;5;28;01mfor\u001B[39;00m i \u001B[38;5;129;01min\u001B[39;00m \u001B[38;5;28mrange\u001B[39m(extra_vertices):\n\u001B[0;32m    415\u001B[0m         \u001B[38;5;28mnext\u001B[39m(codes)\n\u001B[1;32m--> 416\u001B[0m         curr_vertices \u001B[38;5;241m=\u001B[39m \u001B[43mnp\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mappend\u001B[49m\u001B[43m(\u001B[49m\u001B[43mcurr_vertices\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;28;43mnext\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mvertices\u001B[49m\u001B[43m)\u001B[49m\u001B[43m)\u001B[49m\n\u001B[0;32m    417\u001B[0m \u001B[38;5;28;01myield\u001B[39;00m curr_vertices, code\n",
      "File \u001B[1;32m<__array_function__ internals>:200\u001B[0m, in \u001B[0;36mappend\u001B[1;34m(*args, **kwargs)\u001B[0m\n",
      "File \u001B[1;32m~\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\reversi-SkjoUH1O-py3.10\\lib\\site-packages\\numpy\\lib\\function_base.py:5499\u001B[0m, in \u001B[0;36mappend\u001B[1;34m(arr, values, axis)\u001B[0m\n\u001B[0;32m   5497\u001B[0m     values \u001B[38;5;241m=\u001B[39m ravel(values)\n\u001B[0;32m   5498\u001B[0m     axis \u001B[38;5;241m=\u001B[39m arr\u001B[38;5;241m.\u001B[39mndim\u001B[38;5;241m-\u001B[39m\u001B[38;5;241m1\u001B[39m\n\u001B[1;32m-> 5499\u001B[0m \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mconcatenate\u001B[49m\u001B[43m(\u001B[49m\u001B[43m(\u001B[49m\u001B[43marr\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mvalues\u001B[49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43maxis\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43maxis\u001B[49m\u001B[43m)\u001B[49m\n",
      "File \u001B[1;32m<__array_function__ internals>:200\u001B[0m, in \u001B[0;36mconcatenate\u001B[1;34m(*args, **kwargs)\u001B[0m\n",
      "\u001B[1;31mKeyboardInterrupt\u001B[0m: "
     ]
    }
   ],
   "source": [
    "np.random.seed(0)\n",
    "simulation_results = simulate_game(1, (RandomPolicy(1), RandomPolicy(1)))\n",
    "_unique_bords, _unique_actions = drop_duplicate_boards(\n",
    "    simulation_results[0].reshape(-1, 8, 8), simulation_results[1].reshape(-1, 2)\n",
    ")\n",
    "plot_othello_boards(_unique_bords, actions=_unique_actions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%memit simulate_game(100, (RandomPolicy(1), RandomPolicy(1)))\n",
    "%timeit simulate_game(100, (RandomPolicy(1), RandomPolicy(1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Greedy vs. greedy policy\n",
    "Then there is a simulation of a game between two greedy policies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "np.random.seed(1)\n",
    "simulation_results = simulate_game(1, (GreedyPolicy(1), GreedyPolicy(1)))\n",
    "_unique_bords, _unique_actions = drop_duplicate_boards(\n",
    "    simulation_results[0].reshape(-1, 8, 8), simulation_results[1].reshape(-1, 2)\n",
    ")\n",
    "plot_othello_boards(_unique_bords, actions=_unique_actions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%memit simulate_game(100, (GreedyPolicy(1), GreedyPolicy(1)))\n",
    "%timeit simulate_game(100, (GreedyPolicy(1), GreedyPolicy(1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random vs. greedy policy\n",
    "\n",
    "Last there was a simulation between random and greedy policy. Random playing as black and the greedy as white."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "np.random.seed(2)\n",
    "simulation_results = simulate_game(1, (RandomPolicy(1), GreedyPolicy(1)))\n",
    "_unique_bords, _unique_actions = drop_duplicate_boards(\n",
    "    simulation_results[0].reshape(-1, 8, 8), simulation_results[1].reshape(-1, 2)\n",
    ")\n",
    "plot_othello_boards(_unique_bords, actions=_unique_actions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Greedy vs. random policy\n",
    "\n",
    "Last there was a simulation between the greedy policy as black and the random policy as white."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "np.random.seed(3)\n",
    "simulation_results = simulate_game(1, (GreedyPolicy(1), RandomPolicy(1)))\n",
    "_unique_bords, _unique_actions = drop_duplicate_boards(\n",
    "    simulation_results[0].reshape(-1, 8, 8), simulation_results[1].reshape(-1, 2)\n",
    ")\n",
    "plot_othello_boards(_unique_bords, actions=_unique_actions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Statistical examination of the natural action space and game result\n",
    "As for many project some evaluation of the project is in order.\n",
    "\n",
    "1. What is the expected distribution of scores\n",
    "2. What is the expected distribution of possible actions\n",
    "\n",
    "    a. over time\n",
    "    \n",
    "    b. ober space\n",
    "\n",
    "The easiest and robustest way to analyse this is when analyzing randomly played games."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this purpose we played a sample of 10,000 games and saved them for later analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists(\"rnd_history.npy\") and not os.path.exists(\"rnd_action.npy\"):\n",
    "    simulation_results = simulate_game(\n",
    "        10_000, (RandomPolicy(1), RandomPolicy(1)), tqdm_on=True\n",
    "    )\n",
    "    _board_history, _action_history = simulation_results\n",
    "    np.save(\"rnd_history.npy\", _board_history.astype(np.int8))\n",
    "    np.save(\"rnd_action.npy\", _action_history.astype(np.int8))\n",
    "else:\n",
    "    _board_history = np.load(\"rnd_history.npy\")\n",
    "    _action_history = np.load(\"rnd_action.npy\")\n",
    "_board_history.shape, _action_history.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For those 10k games the possible actions where evaluated and saved for each and every turn in the game."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists(\"turn_possible.npy\"):\n",
    "    __board_history = _board_history.copy()\n",
    "    __board_history[1::2] = __board_history[1::2] * -1\n",
    "\n",
    "    _poss_turns = get_possible_turns(\n",
    "        __board_history.reshape((-1, 8, 8)), tqdm_on=True\n",
    "    ).reshape((SIMULATE_TURNS, -1, 8, 8))\n",
    "    np.save(\"turn_possible.npy\", _poss_turns)\n",
    "    del __board_history\n",
    "_poss_turns = np.load(\"turn_possible.npy\")\n",
    "_poss_turns.shape, _action_history.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those possible turms then where counted for all games in the history stack.\n",
    "\n",
    "### Action space over time / tree search size estimation\n",
    "The action space size can be drawn into a histogram by turn and a curve over the mean action space size.\n",
    "This can be used to analyse in which area of the game that cant be solved absolutely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "count_poss_turns = np.sum(_poss_turns, axis=(2, 3))\n",
    "mean_possibility_count = np.mean(count_poss_turns, axis=1)\n",
    "std_possibility_count = np.std(count_poss_turns, axis=1)\n",
    "cum_prod = count_poss_turns\n",
    "\n",
    "\n",
    "@interact(turn=(0, 69))\n",
    "def poss_turn_count(turn):\n",
    "    fig, axes = plt.subplots(2, 2, figsize=(15, 8))\n",
    "    ax1, ax2, ax3, ax4 = axes.flatten()\n",
    "    _mean_possibility_count = mean_possibility_count.copy()\n",
    "    _std_possibility_count = std_possibility_count.copy()\n",
    "    _mean_possibility_count[_mean_possibility_count <= 1] = 1\n",
    "    _std_possibility_count[_std_possibility_count <= 1] = 1\n",
    "    # np.cumprod(_mean_possibility_count[::-1], axis=0)[::-1]\n",
    "    # todo what happens here=\n",
    "    fig.suptitle(\n",
    "        f\"Action space size analysis\\nThe total size is estimated to be around {np.prod(_mean_possibility_count):.2E}\"\n",
    "    )\n",
    "    ax1.hist(count_poss_turns[turn], density=True)\n",
    "    ax1.set_title(f\"Histogram of the action space size for turn {turn}\")\n",
    "    ax1.set_xlabel(\"Action space size\")\n",
    "    ax1.set_ylabel(\"Action space size probability\")\n",
    "    ax2.set_title(f\"Mean size of the action space per turn\")\n",
    "    ax2.set_xlabel(\"Turn\")\n",
    "    ax2.set_ylabel(\"Average possible moves\")\n",
    "\n",
    "    ax2.errorbar(\n",
    "        range(70),\n",
    "        mean_possibility_count,\n",
    "        yerr=std_possibility_count,\n",
    "        label=\"Mean action space size with error bars\",\n",
    "    )\n",
    "    ax2.scatter(turn, mean_possibility_count[turn], marker=\"x\")\n",
    "    ax2.legend()\n",
    "\n",
    "    action_space_cumprod = np.cumprod(_mean_possibility_count[::-1], axis=0)[::-1]\n",
    "    ax4.plot(range(70), action_space_cumprod)\n",
    "\n",
    "    ax4.scatter(turn, action_space_cumprod[turn], marker=\"x\")\n",
    "    ax4.set_yscale(\"log\", base=10)\n",
    "    ax4.set_xlabel(\"Turn\")\n",
    "    ax4.set_ylabel(\"Mean remaining total action space size\")\n",
    "    ax4.set_title(\n",
    "        f\"Remaining action space at {turn} = {action_space_cumprod[turn].round():.2E}\"\n",
    "    )\n",
    "    fig.delaxes(ax3)\n",
    "    fig.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The action space analysis can be used to siwtch betwen a \"normal\" algorithm and a ANN powered alogrithm. If the average remaining decision tree is small enough.\n",
    "Depending on the performence of the network and the speed requirements.\n",
    "After step 52 it should be easely possible. But if wanted to this could start at step 50 or even 48."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is interesting to see that the action space for the first player (black) is much smaller than for the second player. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "black = mean_possibility_count[0::2]\n",
    "white = mean_possibility_count[1::2]\n",
    "df = pd.DataFrame(\n",
    "    [\n",
    "        {\n",
    "            \"white\": np.prod(np.extract(white, white)),\n",
    "            \"black\": np.prod(np.extract(black, black)),\n",
    "        }\n",
    "    ],\n",
    "    index=[\"Total mean action-space\"],\n",
    ").T\n",
    "del white, black\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Posiblilty that an action is possible at a specifc turn\n",
    "\n",
    "The diagramm below shows where and wann a stone can be placed at each of the turns.\n",
    "This can be used to compare learning behavior for different Policies and show for example the behavior around the corners.\n",
    "A very low possiblity for a corner would mean that the AI tires not to give the corners to the enemy an tries to capture them themselve if possible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mean_poss_turn = np.mean(_poss_turns, axis=1)\n",
    "\n",
    "\n",
    "@interact(turn=(0, 69))\n",
    "def turn_distribution_heatmap(turn):\n",
    "    turn_possibility_on_field = mean_poss_turn[turn]\n",
    "\n",
    "    sns.heatmap(\n",
    "        turn_possibility_on_field,\n",
    "        linewidth=0.5,\n",
    "        square=True,\n",
    "        annot=True,\n",
    "        xticklabels=\"ABCDEFGH\",\n",
    "        yticklabels=list(range(1, 9)),\n",
    "    )\n",
    "    plt.title(f\"Headmap of where stones can be placed on turn {turn}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Statistic of skipped actions\n",
    "\n",
    "Not all turns can be played. Ploted as a mean over the curse of the game it can be clearly seen that the first time a turn can be skipped is is turn 9 and increases over time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def history_changed(board_history: np.ndarray) -> np.ndarray:\n",
    "    \"\"\"Calculates if the board changed between actions.\n",
    "\n",
    "    Args:\n",
    "        board_history: A history of game baords. Shaped (70 * n * 8 * 8)\n",
    "    \"\"\"\n",
    "    return ~np.all(\n",
    "        np.roll(board_history, shift=1, axis=0) == board_history, axis=(2, 3)\n",
    "    )\n",
    "\n",
    "\n",
    "plt.title(\"Share of turns skipped\")\n",
    "plt.plot(1 - np.mean(history_changed(_board_history), axis=1))\n",
    "plt.xlabel(\"Turn\")\n",
    "plt.ylabel(\"Factor of skipped turns\")\n",
    "plt.yscale(\"log\", base=10)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Hash branching\n",
    "To calibrate the explration rate properly we compared all the games in a stack of games. The graph shows the number of unique game boards at each of the game turns.\n",
    "As can be seen below for random games the games start to be unique very fast.\n",
    "For a proper directed exploration I assume the rate needs to be calbrated that the game still have some duplications of the best knwon game at the end of an game simulatin left."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def calculate_board_branching(board_history) -> pd.Series:\n",
    "    assert len(board_history.shape) == 4\n",
    "    assert board_history.shape[-2:] == (8, 8)\n",
    "    assert board_history.shape[0] == SIMULATE_TURNS\n",
    "    return pd.Series(\n",
    "        [count_unique_boards(board_history[turn]) for turn in range(SIMULATE_TURNS)]\n",
    "    )\n",
    "\n",
    "\n",
    "_ = calculate_board_branching(_board_history).plot(\n",
    "    title=f\"Exploration history over {_board_history.shape[0]} turns\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "_direct_score = np.sum(_board_history, axis=(-2, -1))\n",
    "_score = np.zeros_like(_direct_score)\n",
    "_score[:-1] = _direct_score[1:] - _direct_score[:-1]\n",
    "print(np.mean(_score, axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reword functions\n",
    "\n",
    "For any kind of reinforcement learning is a reword function needed.\n",
    "For otello this would be the final score, the information who won, changes to the score and the sum of the board.\n",
    "A combination of those three also be possible.\n",
    "It is probably not be possible to weight the current score to high in a reword function since that would be to close to a classic greedy algorithm.\n",
    "But some direct influence would increase the learning speed.\n",
    "In the next section are all three reword functions implemented to be combined and weight later on as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate the final game score\n",
    "\n",
    "When playing Otello the empty fileds at the end of the game are conted for the player with more stones.\n",
    "The folowing function calucates that. The result result will be the delta between the score for player 1 (black)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def final_boards_evaluation(boards: np.ndarray) -> np.ndarray:\n",
    "    \"\"\"Evaluates the board at the end of the game.\n",
    "\n",
    "    All unused fields are added to the score of the player that has more stones with his color up.\n",
    "    This score only applies to the end of the game.\n",
    "    Normally the score is represented by the number of stones each player has.\n",
    "    In this case the score was combined by building the difference.\n",
    "\n",
    "    Args:\n",
    "        boards: A stack of game bords ot the end of the game. Shaped (n * 8 * 8)\n",
    "\n",
    "    Returns:\n",
    "        the combined score for both player.\n",
    "    \"\"\"\n",
    "    score1, score2 = np.sum(boards == 1, axis=(1, 2)), np.sum(boards == -1, axis=(1, 2))\n",
    "    player_1_won = score1 > score2\n",
    "    player_2_won = score1 < score2\n",
    "    score1_final = 64 - score2[player_1_won] * 2\n",
    "    score2_final = 64 - score1[player_2_won] * 2\n",
    "    score = np.zeros(boards.shape[0])\n",
    "    score[player_1_won] = score1_final\n",
    "    score[player_2_won] = -score2_final\n",
    "    return score\n",
    "\n",
    "\n",
    "np.random.seed(2)\n",
    "_baords = simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0]\n",
    "np.testing.assert_array_equal(\n",
    "    np.sum(_baords[-1], axis=(1, 2)), final_boards_evaluation(_baords[-1])\n",
    ")\n",
    "np.random.seed(2)\n",
    "np.testing.assert_array_equal(\n",
    "    np.array([-6.0, -36.0, -12.0, -16.0, 38.0, -12.0, 2.0, -22.0, 2.0, 10.0]),\n",
    "    final_boards_evaluation(\n",
    "        simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0][-1]\n",
    "    ),\n",
    ")\n",
    "\n",
    "np.random.seed(2)\n",
    "boards = simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0][-1]\n",
    "boards[:, 4, :] = 0\n",
    "np.testing.assert_array_equal(\n",
    "    np.array([-14.0, -38.0, -14.0, -22.0, 40.0, -16.0, -14.0, -28.0, 0.0, 20.0]),\n",
    "    final_boards_evaluation(boards),\n",
    ")\n",
    "\n",
    "_boards = get_new_games(EXAMPLE_STACK_SIZE)\n",
    "%timeit final_boards_evaluation(_boards)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def calculate_final_evaluation_for_history(board_history: np.ndarray) -> np.ndarray:\n",
    "    \"\"\"Calculates the final scores for a stack of game histories.\n",
    "\n",
    "    Args:\n",
    "        board_history: A stack of game histories.\n",
    "    \"\"\"\n",
    "    final_evaluation = final_boards_evaluation(board_history[-1])\n",
    "    return final_evaluation / 64\n",
    "\n",
    "\n",
    "np.random.seed(2)\n",
    "_boards = simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0]\n",
    "np.testing.assert_array_equal(\n",
    "    np.array([-6.0, -36.0, -12.0, -16.0, 38.0, -12.0, 2.0, -22.0, 2.0, 10.0]) / 64,\n",
    "    calculate_final_evaluation_for_history(_boards),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "assert len(calculate_final_evaluation_for_history(_board_history).shape) == 1\n",
    "_final_eval = calculate_final_evaluation_for_history(_board_history)\n",
    "plt.title(\"Histogram over the final score distribution\")\n",
    "plt.hist((_final_eval * 64), density=True)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Evaluation game by stones only\n",
    "\n",
    "The next evaluation is just by counting stones by color and building the difference between both. In this implementation it can also be called the sum of a board."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_boards(boards: np.ndarray) -> np.ndarray:\n",
    "    \"\"\"Counts the stones each player has on the board.\n",
    "\n",
    "    Args:\n",
    "        boards: A stack of boards for evaluation. Shaped (n * 8 * 8)\n",
    "\n",
    "    Returns:\n",
    "        the combined score for both players.\n",
    "    \"\"\"\n",
    "    assert boards.shape[-2:] == (8, 8)\n",
    "    return np.sum(boards, axis=(-1, -2))\n",
    "\n",
    "\n",
    "np.random.seed(1)\n",
    "np.testing.assert_array_equal(\n",
    "    evaluate_boards(simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0][-1]),\n",
    "    np.array([-30, -14, -8, 4, -4, -8, -36, 14, -16, -4]),\n",
    ")\n",
    "np.random.seed(2)\n",
    "np.testing.assert_array_equal(\n",
    "    evaluate_boards(simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0][-1]),\n",
    "    np.array([-6, -36, -12, -16, 38, -12, 2, -22, 2, 10]),\n",
    ")\n",
    "np.testing.assert_array_equal(\n",
    "    evaluate_boards(simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0]).shape,\n",
    "    (70, 10),\n",
    ")\n",
    "np.random.seed(3)\n",
    "np.testing.assert_array_equal(\n",
    "    evaluate_boards(simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0][:4, :3]),\n",
    "    np.array([[0, 0, 0], [3, 3, 3], [0, 0, 0], [5, 3, 3]]),\n",
    ")\n",
    "\n",
    "_boards = get_new_games(EXAMPLE_STACK_SIZE)\n",
    "%timeit evaluate_boards(_boards)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "_eval = evaluate_boards(_board_history[-1])\n",
    "plt.title(\"Histogram over the final score distribution\")\n",
    "plt.hist(_eval, density=True)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate the winner of a game\n",
    "\n",
    "The last function evaluates who won by calculating who signum function of the sum of the numpy array representing the baord.\n",
    "The resulting number would be one if the game was wone by the player (white) or -1 if the enemy (black) won. The result would also "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_who_won(board_history: np.ndarray) -> np.ndarray:\n",
    "    \"\"\"Checks who won or is winning a game.\n",
    "\n",
    "    Args:\n",
    "        board_history: A stack of boards for evaluation. Shaped (70 * n * 8 * 8)\n",
    "\n",
    "    Returns:\n",
    "        The information who won for both player. 1 meaning the player won, -1 means the opponent lost. 0 represents a patt.\n",
    "    \"\"\"\n",
    "    assert board_history.shape[-2:] == (8, 8)\n",
    "    assert board_history.shape[0] == 70\n",
    "    return np.sign(np.sum(board_history[-1], axis=(1, 2)))\n",
    "\n",
    "\n",
    "np.random.seed(1)\n",
    "np.testing.assert_array_equal(\n",
    "    calculate_who_won(simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0]),\n",
    "    np.array([-1, -1, -1, 1, -1, -1, -1, 1, -1, -1]),\n",
    ")\n",
    "np.random.seed(2)\n",
    "np.testing.assert_array_equal(\n",
    "    calculate_who_won(simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0]),\n",
    "    np.array([-1, -1, -1, -1, 1, -1, 1, -1, 1, 1]),\n",
    ")\n",
    "\n",
    "\n",
    "_boards = simulate_game(EXAMPLE_STACK_SIZE, (RandomPolicy(1), RandomPolicy(1)))[0]\n",
    "%timeit calculate_who_won(_boards)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "plt.title(\"Win distribution\")\n",
    "plt.bar(\n",
    "    [\"black\", \"draw\", \"white\"],\n",
    "    pd.Series(calculate_who_won(_board_history)).value_counts().sort_index()\n",
    "    / _board_history.shape[1],\n",
    ")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Direct turn evaluation\n",
    "\n",
    "Besides evaluating the turn there is always the possibility to calculate how much of an direct impact a single turn had."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_direct_score(board_history: np.ndarray) -> np.ndarray:\n",
    "    \"\"\"Calculates the delta score for all actions.\n",
    "\n",
    "    Args:\n",
    "        board_history: A history of board games or a stack of board games.  Shaped (70 * n * 8 * 8)\n",
    "    \"\"\"\n",
    "    assert board_history.shape[0] == 70\n",
    "    assert board_history.shape[-2:] == (8, 8)\n",
    "    direct_score = np.sum(board_history, axis=(-2, -1))\n",
    "    score = np.zeros_like(direct_score)\n",
    "    score[:-1] = direct_score[1:] - direct_score[:-1]\n",
    "    return score / 64\n",
    "\n",
    "\n",
    "assert len(calculate_direct_score(_board_history).shape) == 2\n",
    "assert calculate_direct_score(_board_history).shape[0] == SIMULATE_TURNS\n",
    "np.random.seed(2)\n",
    "np.testing.assert_equal(\n",
    "    calculate_direct_score(simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0])[\n",
    "        :, 0\n",
    "    ][:10]\n",
    "    * 64,\n",
    "    np.array(\n",
    "        [3.0, -3.0, 3.0, -5.0, 5.0, -5.0, 3.0, -3.0, 7.0, -5.0],\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When plotting the direct score it can be easily seen that the later turnse are point-wise more important. A bad opening however will not allow the player to keep those points. But it is easy to see that points not made at the beginning of the game can be made at the end of the game. This allows for concentration on the gameplay and some preparation at the start of the game."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score_history = calculate_direct_score(_board_history)\n",
    "score_history *= 64\n",
    "\n",
    "\n",
    "@interact(turn=(0, 59))\n",
    "def hist_direct_score(turn):\n",
    "    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))\n",
    "    fig.suptitle(\n",
    "        f\"Action space size analysis / total size estimate {np.prod(np.extract(mean_possibility_count, mean_possibility_count)):.4g}\"\n",
    "    )\n",
    "\n",
    "    ax1.set_title(\n",
    "        f\"Histogram of scores changes on turn {turn} by {'white' if turn % 2 == 0 else 'black'}\"\n",
    "    )\n",
    "    score = score_history[turn]\n",
    "    bins = max(1, int(max(score) - min(score)))\n",
    "    ax1.hist(score, density=True, bins=bins)\n",
    "    ax1.set_xlabel(\"Points made\")\n",
    "    ax1.set_ylabel(\"Score probability\")\n",
    "    ax2.set_title(\"Points scored at turn\")\n",
    "    ax2.set_xlabel(\"Turn\")\n",
    "    ax2.set_ylabel(\"Average points scored\")\n",
    "    ax2.errorbar(\n",
    "        range(60),\n",
    "        np.abs(np.mean(score_history, axis=1)[:60]),\n",
    "        yerr=np.std(score_history, axis=1)[:60],\n",
    "        label=\"Mean score at turn\",\n",
    "    )\n",
    "    ax2.scatter(\n",
    "        turn, np.abs(np.mean(score_history, axis=1))[:60][turn], marker=\"x\", color=\"red\"\n",
    "    )\n",
    "    ax2.legend()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Q-Learning Policies\n",
    "Q-learning is a classic reinforcement learning technique. The Q-function is an action-value function that returns the expected value of an action in a given state.\n",
    "\n",
    "$Q^\\pi(s_t,a_t)=\\sum^{60}_{t=turn}\\gamma^{60-t} \\cdot R_t$\n",
    "\n",
    "With this function, all actions in a given state can be evaluated, and the most beneficial action can be taken. With classical reinforcement learning, a table for situations and actions is explored and slowly filled. With ANNs, there is the possibility to use an AI model that can interpolate between situations and should not need to explore the complete game tree to solve some situations.\n",
    "\n",
    "### Calculating discount tables\n",
    "\n",
    "Since the game stack contains all steps, even if no action is possible, this needs to be corrected. The normal formula for a reward is:\n",
    "\n",
    "$E(s_{turn},a_{turn}) = \\sum^{60}_{t=turn}\\gamma^{60-t} \\cdot R_t$\n",
    "\n",
    "Since turns that can't be taken do not have the element of uncertainty, the discounting has to be excluded by setting the value to $1$ instead of $\\gamma$.\n",
    "\n",
    "$\\gamma^*_t =\\begin{cases}1 & |a_t|=0\\\\gamma & |a_t|>0\\end{cases}$\n",
    "\n",
    "$E(s_{turn},a_{turn}) = \\prod_{t=turn}^{70}\\gamma^*_t \\cdot R_t$\n",
    "\n",
    "The table below contains the aggregated discount factors ($\\prod_{t=turn}^{70}\\gamma^*_t$) for each reward fitting to the state history. This setup also allows to reward the certainty gained by taking the choice of the action from the opponent. It can be argued that also all turns where a player had no choice how to act should not be discounted. But this will increase calculation requirements to nearly double, which is currently not acceptable since computation time and code complexity are bottlenecks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_gamma_table(board_history: np.ndarray, gamma_value: float) -> np.ndarray:\n",
    "    \"\"\"Calculates a discount table for a board history.\n",
    "\n",
    "    Args:\n",
    "        board_history: A history of game boards. Shaped (70 * n * 8 * 8)\n",
    "        gamma_value: The default discount factor.\n",
    "    \"\"\"\n",
    "    unchanged = history_changed(board_history)\n",
    "    gamma_values = np.ones_like(unchanged, dtype=float)\n",
    "    gamma_values[unchanged] = gamma_value\n",
    "    return gamma_values\n",
    "\n",
    "\n",
    "assert get_gamma_table(_board_history, 0.8).shape == _board_history.shape[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculating rewords\n",
    "\n",
    "To calculate the rewards for the Reversi AI, we will use the $\\gamma$ values to combine the rewards obtained during the game with the functions calculate_direct_score, calculate_final_evaluation_for_history, and calculate_who_won.\n",
    "\n",
    "The rewards obtained will be used to build a weighted sum of rewards, where most of the rewards are terminal rewards awarded at the end of the game and discounted back over the course of the game. The sum of the reward weights is always 1, with the third value calculated from the first two.\n",
    "\n",
    "The direct score is the only part of the reward that is awarded before the terminal reward. This setup allows for experimentation with different types of rewards to train the model, with different definitions of what is considered \"best\" depending on factors such as initial startup time, stability, and quality of results.\n",
    "\n",
    "Although $Q^\\pi$ depends on state and action, the rewards are returned and do not require a specified action to be given, since the action is implied by the structure of the data.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_q_reword(\n",
    "    board_history: np.ndarray,\n",
    "    who_won_fraction: float = 0.2,\n",
    "    final_score_fraction: float = 0.2,\n",
    "    gamma: float = 0.8,\n",
    ") -> np.ndarray:\n",
    "    \"\"\"Calculates a Q reword for a stack of states.\n",
    "\n",
    "    Args:\n",
    "        board_history: A stack ob board histories to calculate q_rewords for.\n",
    "        who_won_fraction: This factor describes how the winner of the game should be weighted. Expected value is in [0, 1].\n",
    "        final_score_fraction: This factor describes how important the final score of the game should be weighted. Expected value is in [0, 1].\n",
    "        gamma: The discount value fo all turns that had a choice.\n",
    "    \"\"\"\n",
    "    assert who_won_fraction + final_score_fraction <= 1\n",
    "    assert final_score_fraction >= 0\n",
    "    assert who_won_fraction >= 0\n",
    "\n",
    "    gama_table = get_gamma_table(board_history, gamma)\n",
    "    combined_score = np.zeros_like(gama_table)\n",
    "    combined_score += calculate_direct_score(board_history) * (\n",
    "        1 - (who_won_fraction + final_score_fraction)\n",
    "    )\n",
    "    combined_score[-1] += (\n",
    "        calculate_final_evaluation_for_history(board_history)\n",
    "        * final_score_fraction\n",
    "        / 0.7\n",
    "    )\n",
    "    combined_score[-1] += calculate_who_won(board_history) * who_won_fraction\n",
    "    for turn in range(SIMULATE_TURNS - 1, 0, -1):\n",
    "        values = gama_table[turn] * combined_score[turn]\n",
    "        combined_score[turn - 1] += values\n",
    "\n",
    "    return combined_score"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "The calculated q_learning rewords look than as shown below. For the different distributions of rewords by factor."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "calculate_q_reword(\n",
    "    _board_history, gamma=0.7, who_won_fraction=0, final_score_fraction=1\n",
    ")"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "calculate_q_reword(\n",
    "    _board_history, gamma=0.8, who_won_fraction=1, final_score_fraction=0\n",
    ")[:, 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "calculate_q_reword(\n",
    "    _board_history, gamma=0.8, who_won_fraction=0, final_score_fraction=0\n",
    ")[:, 0] * 64"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Proposed ANN Structures\n",
    "\n",
    "For the purpose of creating an AI to play the board game Otello, we propose a specific structure for an Artificial Neural Network (ANN) that can be trained to make intelligent moves in the game.\n",
    "\n",
    "##### Layer Initialization Function\n",
    "\n",
    "When training the ANN, it is important to initialize the weights and biases of its layers before starting the training process. This can be done using various techniques such as random initialization or using pre-trained weights. The choice of initialization function can have an impact on the training speed and the quality of the learned model. Therefore, we will carefully select an appropriate initialization function to ensure the best possible training outcome for our Otello AI."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def weights_init_normal(m):\n",
    "    \"\"\"Takes in a module and initializes all linear layers with weight\n",
    "       values taken from a normal distribution.\n",
    "    Source: https://stackoverflow.com/a/55546528/11003343\n",
    "    \"\"\"\n",
    "\n",
    "    classname = m.__class__.__name__\n",
    "    # for every Linear layer in a model\n",
    "    if classname.find(\"Linear\") != -1:\n",
    "        y = m.in_features\n",
    "        # m.weight.data should be taken from a normal distribution\n",
    "        m.weight.data.normal_(0.0, 1 / np.sqrt(y))\n",
    "        # m.bias.data should be 0\n",
    "        m.bias.data.fill_(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "The architecture of an Artificial Neural Network (ANN) is defined by its inputs and outputs. In the case of Q-Learning ANNs for the board game Otello, the network should calculate the expected reward for the current state and a proposed action.\n",
    "\n",
    "The state of the game board is represented by a 8x8 array, where each cell can be either empty, occupied by a black stone, or occupied by a white stone. The proposed action is represented by an array of equal size, using one-hot encoding to indicate the cell where a stone is proposed to be placed.\n",
    "\n",
    "The output of the ANN should be a value between -1 and 1, which suggests using a $\\tanh$ activation function. Since the game board is small and the position of the stones is crucial, it was decided to use only conventional layers, without any Long Short-Term Memory (LSTM) layer or other architectures for sequences and time series, as the history of the game does not matter.\n",
    "\n",
    "Therefore, the input size of the network is (8 x 8 x 2) and the output size is 1, representing the expected reward for the proposed action in the current state of the game board."
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "class DQLNet(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.fc1 = nn.Linear(8 * 8 * 2, 128 * 2)\n",
    "        self.fc2 = nn.Linear(128 * 2, 128 * 3)\n",
    "        self.fc3 = nn.Linear(128 * 3, 128 * 2)\n",
    "        self.fc4 = nn.Linear(128 * 2, 1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        if isinstance(x, np.ndarray):\n",
    "            x = torch.from_numpy(x).float()\n",
    "        x = torch.flatten(x, 1)\n",
    "        x = self.fc1(x)\n",
    "        x = functional.relu(x)\n",
    "        x = self.fc2(x)\n",
    "        x = functional.relu(x)\n",
    "        x = self.fc3(x)\n",
    "        x = functional.relu(x)\n",
    "        x = self.fc4(x)\n",
    "        x = torch.tanh(x)\n",
    "        return x\n",
    "\n",
    "\n",
    "class DQLSimple(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.fc1 = nn.Linear(8 * 8 * 2, 64 * 3)\n",
    "        self.fc2 = nn.Linear(64 * 3, 128 * 2)\n",
    "        self.fc3 = nn.Linear(128 * 2, 1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        if isinstance(x, np.ndarray):\n",
    "            x = torch.from_numpy(x).float()\n",
    "        x = torch.flatten(x, 1)\n",
    "        x = self.fc1(x)\n",
    "        x = functional.relu(x)\n",
    "        x = self.fc2(x)\n",
    "        x = functional.relu(x)\n",
    "        x = self.fc3(x)\n",
    "        x = torch.tanh(x)\n",
    "        return x\n",
    "\n",
    "\n",
    "assert DQLNet().forward(np.zeros((5, 2, 8, 8))).shape == (5, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "class SymmetryMode(Enum):\n",
    "    MULTIPLY = \"MULTIPLY\"\n",
    "    BREAK_SEQUENCE = \"BREAK_SEQUENCE\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "_board_history, _action_history = simulate_game(100, (RandomPolicy(1), RandomPolicy(1)))\n",
    "_board_history.shape, _action_history.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def action_to_q_learning_format(\n",
    "    board_history: np.ndarray, action_history: np.ndarray\n",
    ") -> np.ndarray:\n",
    "    q_learning_format = np.zeros(\n",
    "        (SIMULATE_TURNS, board_history.shape[1], 2, 8, 8), dtype=float\n",
    "    )\n",
    "    q_learning_format[:, :, 0, :, :] = board_history\n",
    "    q_learning_format[:, :, 1, :, :] = -1\n",
    "\n",
    "    game_index = list(range(board_history.shape[1]))\n",
    "    for turn_index in range(SIMULATE_TURNS):\n",
    "        q_learning_format[\n",
    "            turn_index,\n",
    "            game_index,\n",
    "            1,\n",
    "            action_history[turn_index, game_index, 0],\n",
    "            action_history[turn_index, game_index, 1],\n",
    "        ] = 1\n",
    "    return q_learning_format\n",
    "\n",
    "\n",
    "%timeit action_to_q_learning_format(_board_history, _action_history)\n",
    "%memit action_to_q_learning_format(_board_history, _action_history)\n",
    "\n",
    "\n",
    "plot_othello_boards(\n",
    "    action_to_q_learning_format(_board_history, _action_history)[:8, 0, 0]\n",
    ")\n",
    "plot_othello_boards(\n",
    "    action_to_q_learning_format(_board_history, _action_history)[:8, 0, 1]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def build_symetry_action(\n",
    "    board_history: np.ndarray, action_history: np.ndarray\n",
    ") -> np.ndarray:\n",
    "    board_history = board_history.copy()\n",
    "    board_history[1::2] *= -1\n",
    "    q_learning_format = np.zeros(\n",
    "        (2, 2, 2, SIMULATE_TURNS, board_history.shape[1], 2, 8, 8)\n",
    "    )\n",
    "    q_learning_format[0, 0, 0, :, :, :, :, :] = action_to_q_learning_format(\n",
    "        board_history, action_history\n",
    "    )\n",
    "    q_learning_format[1, 0, 0, :, :, :, :, :] = np.transpose(\n",
    "        q_learning_format[0, 0, 0, :, :, :, :, :], [0, 1, 2, 4, 3]\n",
    "    )\n",
    "    q_learning_format[:, 1, 0, :, :, :, :, :] = q_learning_format[\n",
    "        :, 0, 0, :, :, :, ::-1, :\n",
    "    ]\n",
    "    q_learning_format[:, :, 1, :, :, :, :, :] = q_learning_format[\n",
    "        :, :, 0, :, :, :, :, ::-1\n",
    "    ]\n",
    "    return q_learning_format\n",
    "\n",
    "\n",
    "%timeit build_symetry_action(_board_history, _action_history)\n",
    "%memit build_symetry_action(_board_history, _action_history)\n",
    "build_symetry_action(_board_history, _action_history).shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "tags": [],
    "ExecuteTime": {
     "start_time": "2023-03-31T00:14:24.370778Z",
     "end_time": "2023-03-31T00:14:24.420801Z"
    }
   },
   "outputs": [],
   "source": [
    "def live_history(training_history: pd.DataFrame, ai_name: str | None):\n",
    "    training_history.index = training_history.index.to_series().apply(\n",
    "        lambda x: \"{0} {1}\".format(*x)\n",
    "    )\n",
    "    clear_output(wait=True)\n",
    "    fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(14, 7))\n",
    "    _ = (\n",
    "        training_history[\n",
    "            [c for c in training_history.columns if c.endswith(\"final_score\")]\n",
    "        ]\n",
    "        .rename(lambda c: c.split(\" \")[0], axis=1)\n",
    "        .plot(ax=ax1)\n",
    "    )\n",
    "    plt.title(\"Final score\")\n",
    "    ax1.xlabel(\"epochs\")\n",
    "    _ = (\n",
    "        training_history[[c for c in training_history.columns if c.endswith(\"win\")]]\n",
    "        .rename(lambda c: c.split(\" \")[0], axis=1)\n",
    "        .plot(ax=ax2)\n",
    "    )\n",
    "    ax2.title(\"Win score\")\n",
    "    ax2.xlabel(\"epochs\")\n",
    "    _ = (\n",
    "        np.sqrt(\n",
    "            training_history[\n",
    "                [c for c in training_history.columns if c.startswith(\"loss\")]\n",
    "            ]\n",
    "        )\n",
    "        .rename(lambda c: c.split(\" \")[0], axis=1)\n",
    "        .plot(ax=ax3)\n",
    "    )\n",
    "    ax3.set_yscale(\"log\")\n",
    "    ax3.xlabel(\"epochs\")\n",
    "\n",
    "    ax3.title(\"Loss\")\n",
    "    fig.suptitle(f\"Training history of {ai_name}\" if ai_name else f\"Training history\")\n",
    "    fig.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class QLPolicy(GamePolicy):\n",
    "    # noinspection PyProtectedMember\n",
    "    def __init__(\n",
    "        self,\n",
    "        epsilon: float,\n",
    "        neural_network: DQLNet | DQLSimple,\n",
    "        symmetry_mode: SymmetryMode,\n",
    "        gamma: float = 0.8,\n",
    "        who_won_fraction: float = 0,\n",
    "        final_score_fraction: float = 0,\n",
    "        optimizer: torch.optim.Optimizer | None = None,\n",
    "        loss: nn.modules.loss._Loss | None = None,\n",
    "    ):\n",
    "        super().__init__(epsilon)\n",
    "        assert 0 <= gamma <= 1\n",
    "        self.gamma: float = gamma\n",
    "        del gamma\n",
    "        self.symmetry_mode: SymmetryMode = symmetry_mode\n",
    "        del symmetry_mode\n",
    "        self.neural_network: DQLNet | DQLSimple = neural_network\n",
    "        del neural_network\n",
    "        self.who_won_fraction: float = who_won_fraction\n",
    "        del who_won_fraction\n",
    "        self.final_score_fraction: float = final_score_fraction\n",
    "        del final_score_fraction\n",
    "\n",
    "        if optimizer is None:\n",
    "            self.optimizer = torch.optim.Adam(self.neural_network.parameters(), lr=5e-5)\n",
    "        else:\n",
    "            self.optimizer = optimizer\n",
    "        if loss is None:\n",
    "            self.loss = nn.MSELoss()\n",
    "        else:\n",
    "            self.loss = loss\n",
    "        self.training_results: list[dict[tuple[str, str], float]] = []\n",
    "        self.loss_history: list[float] = []\n",
    "\n",
    "    @property\n",
    "    def policy_name(self) -> str:\n",
    "        symmetry_name = {SymmetryMode.MULTIPLY: \"M\", SymmetryMode.BREAK_SEQUENCE: \"B\"}\n",
    "        g = f\"{self.gamma:.1f}\".replace(\".\", \"\")\n",
    "        ww = f\"{self.who_won_fraction:.1f}\".replace(\".\", \"\")\n",
    "        fsf = f\"{self.final_score_fraction:.1f}\".replace(\".\", \"\")\n",
    "        return f\"QL-{symmetry_name[self.symmetry_mode]}-G{g}-WW{ww}-FSF{fsf}-{self.neural_network.__class__.__name__}-{self.loss.__class__.__name__}\"\n",
    "\n",
    "    def __repr__(self) -> str:\n",
    "        return self.policy_name\n",
    "\n",
    "    def __str__(self) -> str:\n",
    "        return self.policy_name\n",
    "\n",
    "    def _internal_policy(self, boards: np.ndarray) -> np.ndarray:\n",
    "        results = np.zeros_like(boards, dtype=float)\n",
    "        results = torch.from_numpy(results).float()\n",
    "        q_learning_boards = np.zeros((boards.shape[0], 2, 8, 8))\n",
    "        q_learning_boards[:, 0, :, :] = boards\n",
    "        poss_turns = boards == 0  # checks where fields are empty.\n",
    "        poss_turns &= binary_dilation(boards == -1, SURROUNDING)\n",
    "        turn_possible = np.any(poss_turns, axis=0)\n",
    "        for action_x, action_y in itertools.product(range(8), range(8)):\n",
    "            if not turn_possible[action_x, action_y]:\n",
    "                continue\n",
    "            _q_learning_board = q_learning_boards[\n",
    "                poss_turns[range(boards.shape[0]), action_x, action_y]\n",
    "            ].copy()\n",
    "            _q_learning_board[\n",
    "                range(_q_learning_board.shape[0]), 1, action_x, action_y\n",
    "            ] = 1\n",
    "\n",
    "            ql_result = self.neural_network.forward(_q_learning_board)\n",
    "            results[poss_turns[:, action_x, action_y], action_x, action_y] = (\n",
    "                ql_result.reshape(-1) + 0.1\n",
    "            )\n",
    "        return results.cpu().detach().numpy()\n",
    "\n",
    "    def generate_trainings_data(\n",
    "        self, generate_data_size: int\n",
    "    ) -> tuple[torch.Tensor, torch.Tensor]:\n",
    "        train_boards, train_actions = simulate_game(generate_data_size, (self, self))\n",
    "        action_possible = ~np.all(train_actions[:, :] == -1, axis=2)\n",
    "        q_leaning_formatted_action = build_symetry_action(train_boards, train_actions)\n",
    "        q_rewords = calculate_q_reword(\n",
    "            board_history=train_boards,\n",
    "            who_won_fraction=self.who_won_fraction,\n",
    "            final_score_fraction=self.final_score_fraction,\n",
    "        )\n",
    "        q_rewords[1::2, :] *= -1\n",
    "        if self.symmetry_mode == SymmetryMode.MULTIPLY:\n",
    "            new_q_rewords = np.zeros((2, 2, 2) + q_rewords.shape)\n",
    "            for i, k, j in itertools.product((0, 1), (0, 1), (0, 1)):\n",
    "                new_q_rewords[i, k, j] = q_rewords\n",
    "            q_rewords = new_q_rewords\n",
    "            action_possible = np.array([action_possible] * 8).reshape(-1)\n",
    "\n",
    "        elif self.symmetry_mode == SymmetryMode.BREAK_SEQUENCE:\n",
    "            axis1 = np.random.randint(0, high=2, size=SIMULATE_TURNS, dtype=int)\n",
    "            axis2 = np.random.randint(0, high=2, size=SIMULATE_TURNS, dtype=int)\n",
    "            axis3 = np.random.randint(0, high=2, size=SIMULATE_TURNS, dtype=int)\n",
    "            q_leaning_formatted_action = q_leaning_formatted_action[\n",
    "                axis1, axis2, axis3, range(SIMULATE_TURNS)\n",
    "            ]\n",
    "            action_possible = action_possible.reshape(-1)\n",
    "\n",
    "        return (\n",
    "            torch.from_numpy(\n",
    "                q_leaning_formatted_action.reshape(-1, 2, BOARD_SIZE, BOARD_SIZE)[\n",
    "                    action_possible\n",
    "                ]\n",
    "            ).float(),\n",
    "            torch.from_numpy(q_rewords.reshape(-1, 1)[action_possible]).float(),\n",
    "        )\n",
    "\n",
    "    def train_batch(self, nr_of_games: int) -> None:\n",
    "        x_train, y_train = self.generate_trainings_data(nr_of_games)\n",
    "        y_pred = self.neural_network.forward(x_train)\n",
    "        loss_score = self.loss(y_pred, y_train)\n",
    "        self.optimizer.zero_grad()\n",
    "        self.loss_history.append(loss_score.item())\n",
    "        loss_score.backward()\n",
    "        # Update the parameters\n",
    "        self.optimizer.step()\n",
    "        # generate trainings data\n",
    "\n",
    "    def evaluate_model(self, compare_models: list[GamePolicy], nr_of_games: int):\n",
    "        result_dict: dict[tuple[str, str], float] = {}\n",
    "        eval_copy = copy.copy(self)\n",
    "        eval_copy._epsilon = 1\n",
    "        for model in compare_models:\n",
    "            boards_black, _ = simulate_game(nr_of_games, (eval_copy, model))\n",
    "            boards_white, _ = simulate_game(nr_of_games, (model, eval_copy))\n",
    "            win_eval_white = calculate_who_won(boards_white)\n",
    "            win_eval_black = calculate_who_won(boards_black)\n",
    "            result_dict[(model.policy_name, \"final_score\")] = (\n",
    "                float(\n",
    "                    np.mean(\n",
    "                        calculate_final_evaluation_for_history(boards_black)\n",
    "                        + (calculate_final_evaluation_for_history(boards_white) * -1)\n",
    "                    )\n",
    "                )\n",
    "                * 64\n",
    "            )\n",
    "            result_dict[(model.policy_name, \"white_win\")] = (\n",
    "                np.sum(win_eval_white == -1) / nr_of_games\n",
    "            )\n",
    "            result_dict[(model.policy_name, \"white_lose\")] = (\n",
    "                np.sum(win_eval_white == 1) / nr_of_games\n",
    "            )\n",
    "            result_dict[(model.policy_name, \"black_win\")] = (\n",
    "                np.sum(win_eval_black == 1) / nr_of_games\n",
    "            )\n",
    "            result_dict[(model.policy_name, \"black_lose\")] = (\n",
    "                np.sum(win_eval_black == -1) / nr_of_games\n",
    "            )\n",
    "\n",
    "        result_dict[(\"loss\", \"mean\")] = float(np.mean(np.array(self.loss_history)))\n",
    "        result_dict[(\"loss\", \"min\")] = np.min(np.array(self.loss_history))\n",
    "        result_dict[(\"loss\", \"max\")] = np.max(np.array(self.loss_history))\n",
    "        self.loss_history = []\n",
    "        result_dict[(\"base\", \"base\")] = nr_of_games\n",
    "        return result_dict\n",
    "\n",
    "    def save(self):\n",
    "        filename: str = f\"{self.policy_name}-{len(self.training_results)}\"\n",
    "        with open(TRAINING_RESULT_PATH / Path(f\"{filename}.pickle\"), \"wb\") as f:\n",
    "            pickle.dump(self.training_results, f)\n",
    "        torch.save(\n",
    "            self.neural_network.state_dict(),\n",
    "            TRAINING_RESULT_PATH / Path(f\"{filename}.torch\"),\n",
    "        )\n",
    "\n",
    "    def load(self):\n",
    "        pickle_files = glob.glob(f\"{TRAINING_RESULT_PATH}/{self.policy_name}-*.pickle\")\n",
    "        torch_files = glob.glob(f\"{TRAINING_RESULT_PATH}/{self.policy_name}-*.torch\")\n",
    "\n",
    "        assert len(pickle_files) == len(torch_files)\n",
    "        if not pickle_files:\n",
    "            return\n",
    "\n",
    "        pickle_dict = {\n",
    "            int(file.split(\"-\")[-1].split(\".\")[0]): file for file in pickle_files\n",
    "        }\n",
    "        torch_dict = {\n",
    "            int(file.split(\"-\")[-1].split(\".\")[0]): file for file in torch_files\n",
    "        }\n",
    "        pickle_file = pickle_dict[max(pickle_dict.keys())]\n",
    "        torch_file = torch_dict[max(torch_dict.keys())]\n",
    "\n",
    "        with open(pickle_file, \"rb\") as f:\n",
    "            self.training_results = pickle.load(f)\n",
    "\n",
    "        self.neural_network.load_state_dict(torch.load(Path(torch_file)))\n",
    "\n",
    "    def train(\n",
    "        self,\n",
    "        epochs: int,\n",
    "        batches: int,\n",
    "        batch_size: int,\n",
    "        eval_batch_size: int,\n",
    "        compare_with: list[GamePolicy],\n",
    "        save_every_epoch: bool = True,\n",
    "        live_plot: bool = True,\n",
    "    ) -> pd.DataFrame:\n",
    "        assert epochs > 0\n",
    "        epoch_progress = tqdm(range(epochs), unit=\"epoch\")\n",
    "        for _ in epoch_progress:\n",
    "            for _ in tqdm(range(batches), unit=\"batch\"):\n",
    "                self.train_batch(batch_size)\n",
    "            self.training_results.append(\n",
    "                self.evaluate_model(compare_with, eval_batch_size)\n",
    "            )\n",
    "            if save_every_epoch:\n",
    "                self.save()\n",
    "            if live_plot:\n",
    "                self.plot_history()\n",
    "            display(epoch_progress.container)\n",
    "        return self.history\n",
    "\n",
    "    def plot_history(self) -> None:\n",
    "        if not self.training_results:\n",
    "            return None\n",
    "        return live_history(self.history, None)\n",
    "\n",
    "    @property\n",
    "    def history(self) -> pd.DataFrame:\n",
    "        if not self.training_results:\n",
    "            return pd.DataFrame()\n",
    "        pandas_result = pd.DataFrame(self.training_results)\n",
    "        pandas_result.columns = pd.MultiIndex.from_tuples(pandas_result.columns)\n",
    "        return pandas_result\n",
    "\n",
    "\n",
    "ql_policy1 = QLPolicy(\n",
    "    0.95,\n",
    "    neural_network=DQLNet(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.8,\n",
    "    who_won_fraction=1,\n",
    "    final_score_fraction=0,\n",
    ")\n",
    "\n",
    "assert copy.copy(ql_policy1) is not ql_policy1\n",
    "assert copy.copy(ql_policy1).neural_network is ql_policy1.neural_network\n",
    "\n",
    "# noinspection PyProtectedMember\n",
    "t1, t2 = ql_policy1._internal_policy(get_new_games(2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Symmetry debug"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "_train_boards, _train_actions = simulate_game(10, (RandomPolicy(0), RandomPolicy(0)))\n",
    "_action_possible = ~np.all(_train_actions[:, :] == -1, axis=2)\n",
    "_q_leaning_formatted_action = action_to_q_learning_format(_train_boards, _train_actions)\n",
    "_train_boards.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policy1 = QLPolicy(\n",
    "    0.92,\n",
    "    neural_network=DQLSimple(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.9,\n",
    "    who_won_fraction=0,\n",
    "    final_score_fraction=1,\n",
    ")\n",
    "ql_policy1.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policies = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policy1 = QLPolicy(\n",
    "    0.92,\n",
    "    neural_network=DQLSimple(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.9,\n",
    "    who_won_fraction=0,\n",
    "    final_score_fraction=1,\n",
    ")\n",
    "ql_policies.append(ql_policy1)\n",
    "ql_policy1.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policy2 = QLPolicy(\n",
    "    0.92,\n",
    "    neural_network=DQLSimple(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.8,\n",
    "    who_won_fraction=0,\n",
    "    final_score_fraction=1,\n",
    ")\n",
    "ql_policies.append(ql_policy2)\n",
    "ql_policy2.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policy3 = QLPolicy(\n",
    "    0.92,\n",
    "    neural_network=DQLSimple(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=1,\n",
    "    who_won_fraction=0,\n",
    "    final_score_fraction=1,\n",
    ")\n",
    "ql_policies.append(ql_policy3)\n",
    "ql_policy3.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policy4 = QLPolicy(\n",
    "    0.92,\n",
    "    neural_network=DQLSimple(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.9,\n",
    "    who_won_fraction=1,\n",
    "    final_score_fraction=0,\n",
    ")\n",
    "ql_policies.append(ql_policy4)\n",
    "ql_policy4.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "ql_policy5 = QLPolicy(\n",
    "    0.95,\n",
    "    neural_network=DQLSimple(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.9,\n",
    "    who_won_fraction=0.3,\n",
    "    final_score_fraction=0.3,\n",
    ")\n",
    "ql_policies.append(ql_policy5)\n",
    "ql_policy5.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "ql_policy6 = QLPolicy(\n",
    "    0.95,\n",
    "    neural_network=DQLSimple(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.92,\n",
    "    who_won_fraction=0.3,\n",
    "    final_score_fraction=0.65,\n",
    ")\n",
    "ql_policies.append(ql_policy6)\n",
    "ql_policy6.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policy7 = QLPolicy(\n",
    "    0.95,\n",
    "    neural_network=DQLSimple(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.92,\n",
    "    who_won_fraction=0.2,\n",
    "    final_score_fraction=0.65,\n",
    ")\n",
    "ql_policies.append(ql_policy7)\n",
    "ql_policy7.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policy8 = QLPolicy(\n",
    "    0.92,\n",
    "    neural_network=DQLNet(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.8,\n",
    "    who_won_fraction=0,\n",
    "    final_score_fraction=1,\n",
    ")\n",
    "ql_policies.append(ql_policy8)\n",
    "ql_policy8.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policy9 = QLPolicy(\n",
    "    0.92,\n",
    "    neural_network=DQLNet(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=1,\n",
    "    who_won_fraction=0,\n",
    "    final_score_fraction=1,\n",
    ")\n",
    "ql_policies.append(ql_policy9)\n",
    "ql_policy9.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policy10 = QLPolicy(\n",
    "    0.92,\n",
    "    neural_network=DQLNet(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.9,\n",
    "    who_won_fraction=1,\n",
    "    final_score_fraction=0,\n",
    ")\n",
    "ql_policies.append(ql_policy10)\n",
    "ql_policy10.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "ql_policy11 = QLPolicy(\n",
    "    0.95,\n",
    "    neural_network=DQLNet(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.9,\n",
    "    who_won_fraction=0.3,\n",
    "    final_score_fraction=0.3,\n",
    ")\n",
    "ql_policies.append(ql_policy11)\n",
    "ql_policy11.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "ql_policy12 = QLPolicy(\n",
    "    0.95,\n",
    "    neural_network=DQLNet(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.92,\n",
    "    who_won_fraction=0.3,\n",
    "    final_score_fraction=0.65,\n",
    ")\n",
    "ql_policies.append(ql_policy12)\n",
    "ql_policy12.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ql_policy13 = QLPolicy(\n",
    "    0.95,\n",
    "    neural_network=DQLNet(),\n",
    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
    "    gamma=0.92,\n",
    "    who_won_fraction=0.2,\n",
    "    final_score_fraction=0.65,\n",
    ")\n",
    "ql_policies.append(ql_policy13)\n",
    "ql_policy13.policy_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "probes: int = 1000\n",
    "_ = (\n",
    "    calculate_board_branching(simulate_game(probes, (ql_policy1, ql_policy1))[0])\n",
    "    / probes\n",
    ").plot(\n",
    "    ylim=(0, 1),\n",
    "    title=f\"Branching rate for a QL policy with epsilon={ql_policy1.epsilon}\",\n",
    ")"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "constant_metric_policies = [RandomPolicy(0), GreedyPolicy(0)]"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "raw",
   "source": [
    "for i in range(100):\n",
    "    for ql_policy in ql_policys:\n",
    "        ql_policy.load()\n",
    "        ql_policy.train(1, 10, 1000, 250, constant_metric_policies)"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "for ql_policy in ql_policies:\n",
    "    ql_policy.load()\n",
    "    ql_policy.plot_history()"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "policy_list = constant_metric_policies + ql_policies\n",
    "\n",
    "RESULTS_FILE: Final[str] = \"results.pickle\"\n",
    "if not os.path.exists(RESULTS_FILE):\n",
    "    result_df = pd.DataFrame(\n",
    "        index=[policy.policy_name for policy in policy_list],\n",
    "        columns=[policy.policy_name for policy in policy_list],\n",
    "    )\n",
    "else:\n",
    "    result_df = pd.read_pickle(RESULTS_FILE)\n",
    "nr_of_eval_games = 2000\n",
    "for policy1, policy2 in tqdm(list(itertools.product(policy_list, policy_list))):\n",
    "    if not pd.isna(result_df.at[policy1.policy_name, policy2.policy_name]):\n",
    "        continue\n",
    "    _result_dict = {}\n",
    "    _boards_black, _ = simulate_game(nr_of_eval_games, (policy1, policy2))\n",
    "    _win_eval_black = calculate_who_won(_boards_black)\n",
    "    _result_dict[\"final_score\"] = float(\n",
    "        np.mean(calculate_final_evaluation_for_history(_boards_black))\n",
    "    )\n",
    "    _result_dict[\"win\"] = np.sum(_win_eval_black == 1) / nr_of_eval_games\n",
    "    _result_dict[\"lose\"] = np.sum(_win_eval_black == -1) / nr_of_eval_games\n",
    "    result_df.at[policy1.policy_name, policy2.policy_name] = _result_dict\n",
    "    result_df.to_pickle(RESULTS_FILE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "result_df.applymap(lambda x: x[\"win\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_df.applymap(lambda x: x[\"final_score\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "boards_and_actions, _score = ql_policy.generate_trainings_data(1)\n",
    "print(boards_and_actions.shape)\n",
    "print(_score.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "boards_and_actions.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "plot_othello_boards(boards_and_actions[:8, 0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "_score[:8, 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "plot_othello_boards(boards1[:60, 0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train a model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sources\n",
    "\n",
    "* Game rules and example board images [https://en.wikipedia.org/wiki/Reversi](https://en.wikipedia.org/wiki/Reversi)\n",
    "* Game rules and example game images [https://de.wikipedia.org/wiki/Othello_(Spiel)](https://de.wikipedia.org/wiki/Othello_(Spiel))\n",
    "* Game strategy examples [https://de.wikipedia.org/wiki/Computer-Othello](https://de.wikipedia.org/wiki/Computer-Othello)\n",
    "* Image for 8 directions [https://www.researchgate.net/journal/EURASIP-Journal-on-Image-and-Video-Processing-1687-5281](https://www.researchgate.net/journal/EURASIP-Journal-on-Image-and-Video-Processing-1687-5281)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "\n",
    "def sizeof_fmt(num, suffix=\"B\"):\n",
    "    \"\"\"by Fred Cirera,  https://stackoverflow.com/a/1094933/1870254, modified\"\"\"\n",
    "    for unit in [\"\", \"Ki\", \"Mi\", \"Gi\", \"Ti\", \"Pi\", \"Ei\", \"Zi\"]:\n",
    "        if abs(num) < 1024.0:\n",
    "            return \"%3.1f %s%s\" % (num, unit, suffix)\n",
    "        num /= 1024.0\n",
    "    return \"%.1f %s%s\" % (num, \"Yi\", suffix)\n",
    "\n",
    "\n",
    "for name, size in sorted(\n",
    "    ((name, sys.getsizeof(value)) for name, value in list(locals().items())),\n",
    "    key=lambda x: -x[1],\n",
    ")[:20]:\n",
    "    print(\"{:>30}: {:>8}\".format(name, sizeof_fmt(size)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Write story about mixed oreder!\n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "toc-autonumbering": true,
  "toc-showcode": false
 },
 "nbformat": 4,
 "nbformat_minor": 4
}