Added most of the lost text back in again.

2023-03-31 00:15:27 +02:00
parent 25187122d8
commit a18cf0beb6
1 changed files with 111 additions and 108 deletions
--- a/main.ipynb
+++ b/main.ipynb
@@ -20,55 +20,27 @@
   "source": [
    "## The game rules\n",
    "\n",
-    "Othello is played on a board with 8 x 8 fields for two player.\n",
-    "The board geometry is equal to a chess game.\n",
-    "The game is played with game stones that are black on one siede and white on the other.\n",
+    "Othello is a turn-based, two-player board game played on an 8 x 8 board, with a similar geometry to a chess game. The game pieces are black on one side and white on the other.\n",
    "\n",
    "![Othello game board example](reversi_example.png)\n",
    "\n",
-    "The player take turns.\n",
-    "A player places a stone with his or her color up on the game board.\n",
-    "The player can only place stones when he surrounds a number of stones with the opponents color with the new stone and already placed stones of his color.\n",
-    "Those surrounded stones can either be horizontally, vertically and/or diagonally be placed.\n",
-    "All stones thus surrounded will be flipped to be of the players color.\n",
-    "Turns are only possible if the player is also changing the color of the opponents stones. If a player can't act he is skipped.\n",
-    "The game ends if both players can't act. The player with the most stones wins.\n",
-    "If the score is counted in detail unclaimed fields go to the player with more stones of his or her color on the board.\n",
-    "The game begins with four stones places in the center of the game. Each player gets two. They are placed diagonally to each other.\n",
+    "The players take turns placing their stones on the board, and the objective is to surround the opponent's stones with your own stones. A player can only place a stone when it surrounds at least one of the opponent's stones with their own stones, either horizontally, vertically, or diagonally. When a player places a stone, all the surrounded stones will flip to become the player's color. If a player cannot make a move, they are skipped. The game ends when both players cannot make any more moves. The player with the most stones on the board wins, and any unclaimed fields go to the player with the most stones of their color on the board. The game starts with four stones placed in the center of the board, with each player getting two, which are placed diagonally opposite to each other.\n",
    "\n",
    "\n",
    "<img alt=\"Startaufstellung.png\" src=\"Startaufstellung.png\"/>\n",
    "\n",
    "## Some common Othello strategies\n",
    "\n",
-    "As can be easily understood the placement of stones and on the bord is always a careful balance of attack and defence.\n",
-    "If the player occupies huge homogenous stretches on the board it can be attacked easier.\n",
-    "The boards corners provide safety from wich occupied territory is impossible to loos but since it is only possible to reach the corners if the enemy is forced to allow this or calculates the cost of giving a stable base to the enemy it is difficult to obtain.\n",
-    "There are some text on otello computer strategies which implement greedy algorithms for reversi based on a modified score to each field.\n",
-    "Those different values are score modifiers for a traditional greedy algorithm.\n",
-    "If a players stone has captured such a filed the score reached is multiplied by the modifier.\n",
-    "The total score is the score reached by the player subtracted with the score of the enemy.\n",
-    "The scores change in the course of the game and converges against one. This gives some indications of what to expect from an Othello AI.\n",
+    "The placement of stones on the board is always a careful balance of attack and defense. Occupying large homogeneous stretches on the board can make it easier for the opponent to attack. The board's corners provide safety, from which occupied territory is impossible to lose, but they are difficult to obtain. The enemy must be forced to allow reaching the corners or calculate the cost of giving a stable base to the opponent. Some Othello computer strategies implement greedy algorithms based on a modified score for each field. Different values serve as score modifiers for a traditional greedy algorithm. When a player's stone captures a field, the score reached is multiplied by the modifier. The total score is the score reached by the player minus the score of the opponent. The scores change during the game and converge towards one, which gives some indications of what to expect from an Othello AI.\n",
+    "\n",
+    "<img alt=\"ComputerPossitionScore\" src=\"computer-score.png\"/>\n",
+    "\n",
    "\n",
-    "<img alt=\"ComputerPossitionScore\" src=\"computer-score.png\"/>\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
    "## Initial design decisions\n",
    "\n",
-    "At the beginning of this project I made some design decisions.\n",
-    "The first onw was that I do not want to use a gym library because it limits the data formats accessible.\n",
-    "I choose to implement the hole game as entry in a stack in numpy arrays to be able to accommodate interfacing with a neural network easier and to use scipy pattern recognition tools to implement some game mechanics for a fast simulation cycle.\n",
-    "I chose to ignore player colors as far as I could instead a player perspective was used. Which allowed to change the perspective with a flipping of the sign. (multiplying with -1).\n",
-    "The array format should also allow for data multiplication or the breaking of strikt sequences by flipping the game along one the for axis, (horizontal, vertical, transpose along both diagonals).\n",
+    "At the beginning of this project, I made some design decisions. The first one was that I did not want to use a gym library because it limits the data formats accessible. I chose to implement the whole game as an entry in a stack of NumPy arrays to be able to accommodate interfacing with a neural network easier and to use SciPy pattern recognition tools to implement some game mechanics for a fast simulation cycle. In the array format, stones from the player are marked as 1, and stones by the enemy are marked as -1. I chose to ignore player colors as far as I could; instead, a player perspective was used, which allowed changing the perspective with a flipping of the sign (multiplying with -1). The array format should also allow for data multiplication or the breaking of strict sequences by flipping the game along one of the four axes (horizontal, vertical, transpose along both diagonals).\n",
    "\n",
-    "I wanted to implement different agents as classes that act on those game stacks.\n",
-    "\n",
-    "Since computation time is critical all computational have results are saved.\n",
-    "The analysis of those is then repeated in real time. If a recalculation of such a section is required the save file can be deleted and the code should be executed again."
+    "I wanted to implement different agents as classes that act on those game stacks. Since computation time is critical, all computational results are saved. The analysis of those is then repeated in real-time. If a recalculation of such a section is required, the save file can be deleted, and the code should be executed again.\n"
   ]
  },
  {
@@ -579,8 +551,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "To make the computation of this game more feasable the `lru_cache` decorater was used. LRU uses a hash of the arguments to looke up and return a prevusly calculated result of a computationaly heavy operation. For this a code snippet was modified. Numpy arrays are mutable and unhashable. So the decorater was layed into a converion to tuples, use of the caching layer and the back conversion to numpy arrays.\n",
-    "Those then are converted back into numpy arrays. This reduces the calculation time to 30% of what it is without when calculating possible actions to take."
+    "To optimize the computation of this game, the `lru_cache` decorator was utilized. LRU cache stores the hash of the arguments and returns the previously calculated result of a computationally heavy operation. However, since Numpy arrays are mutable and unhashable, a code snippet was modified to include conversion to tuples, caching layer, and reconversion to Numpy arrays. This allows for the caching to be implemented. As a result, the calculation time of possible actions to take was reduced to only 30% of the time it takes without the lru_cache decorator."
   ]
  },
  {
@@ -2172,6 +2143,15 @@
   "metadata": {},
   "source": [
    "### Calculating rewords\n",
+    "\n",
+    "To calculate the rewards for the Reversi AI, we will use the $\\gamma$ values to combine the rewards obtained during the game with the functions calculate_direct_score, calculate_final_evaluation_for_history, and calculate_who_won.\n",
+    "\n",
+    "The rewards obtained will be used to build a weighted sum of rewards, where most of the rewards are terminal rewards awarded at the end of the game and discounted back over the course of the game. The sum of the reward weights is always 1, with the third value calculated from the first two.\n",
+    "\n",
+    "The direct score is the only part of the reward that is awarded before the terminal reward. This setup allows for experimentation with different types of rewards to train the model, with different definitions of what is considered \"best\" depending on factors such as initial startup time, stability, and quality of results.\n",
+    "\n",
+    "Although $Q^\\pi$ depends on state and action, the rewards are returned and do not require a specified action to be given, since the action is implied by the structure of the data.\n",
+    "\n",
    "\n"
   ]
  },
@@ -2187,13 +2167,13 @@
    "    final_score_fraction: float = 0.2,\n",
    "    gamma: float = 0.8,\n",
    ") -> np.ndarray:\n",
-    "    \"\"\"\n",
+    "    \"\"\"Calculates a Q reword for a stack of states.\n",
    "\n",
    "    Args:\n",
-    "        board_history:\n",
-    "        who_won_fraction:\n",
-    "        final_score_fraction:\n",
-    "        gamma:\n",
+    "        board_history: A stack ob board histories to calculate q_rewords for.\n",
+    "        who_won_fraction: This factor describes how the winner of the game should be weighted. Expected value is in [0, 1].\n",
+    "        final_score_fraction: This factor describes how important the final score of the game should be weighted. Expected value is in [0, 1].\n",
+    "        gamma: The discount value fo all turns that had a choice.\n",
    "    \"\"\"\n",
    "    assert who_won_fraction + final_score_fraction <= 1\n",
    "    assert final_score_fraction >= 0\n",
@@ -2214,13 +2194,30 @@
    "        values = gama_table[turn] * combined_score[turn]\n",
    "        combined_score[turn - 1] += values\n",
    "\n",
-    "    return combined_score\n",
-    "\n",
-    "\n",
+    "    return combined_score"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The calculated q_learning rewords look than as shown below. For the different distributions of rewords by factor."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
    "calculate_q_reword(\n",
    "    _board_history, gamma=0.7, who_won_fraction=0, final_score_fraction=1\n",
    ")"
-   ]
+   ],
+   "metadata": {
+    "collapsed": false
+   }
  },
  {
   "cell_type": "code",
@@ -2244,6 +2241,21 @@
    ")[:, 0] * 64"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "#### Proposed ANN Structures\n",
+    "\n",
+    "For the purpose of creating an AI to play the board game Otello, we propose a specific structure for an Artificial Neural Network (ANN) that can be trained to make intelligent moves in the game.\n",
+    "\n",
+    "##### Layer Initialization Function\n",
+    "\n",
+    "When training the ANN, it is important to initialize the weights and biases of its layers before starting the training process. This can be done using various techniques such as random initialization or using pre-trained weights. The choice of initialization function can have an impact on the training speed and the quality of the learned model. Therefore, we will carefully select an appropriate initialization function to ensure the best possible training outcome for our Otello AI."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -2268,6 +2280,21 @@
    "        m.bias.data.fill_(0)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "The architecture of an Artificial Neural Network (ANN) is defined by its inputs and outputs. In the case of Q-Learning ANNs for the board game Otello, the network should calculate the expected reward for the current state and a proposed action.\n",
+    "\n",
+    "The state of the game board is represented by a 8x8 array, where each cell can be either empty, occupied by a black stone, or occupied by a white stone. The proposed action is represented by an array of equal size, using one-hot encoding to indicate the cell where a stone is proposed to be placed.\n",
+    "\n",
+    "The output of the ANN should be a value between -1 and 1, which suggests using a $\\tanh$ activation function. Since the game board is small and the position of the stones is crucial, it was decided to use only conventional layers, without any Long Short-Term Memory (LSTM) layer or other architectures for sequences and time series, as the history of the game does not matter.\n",
+    "\n",
+    "Therefore, the input size of the network is (8 x 8 x 2) and the output size is 1, representing the expected reward for the proposed action in the current state of the game board."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -2426,33 +2453,53 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 106,
   "metadata": {
-    "tags": []
+    "tags": [],
+    "ExecuteTime": {
+     "start_time": "2023-03-31T00:14:24.370778Z",
+     "end_time": "2023-03-31T00:14:24.420801Z"
+    }
   },
   "outputs": [],
   "source": [
    "def live_history(training_history: pd.DataFrame, ai_name: str | None):\n",
+    "    training_history.index = training_history.index.to_series().apply(\n",
+    "        lambda x: \"{0} {1}\".format(*x)\n",
+    "    )\n",
    "    clear_output(wait=True)\n",
    "    fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(14, 7))\n",
-    "    _ = training_history[\n",
-    "        [c for c in training_history.columns if c[1] == \"final_score\"]\n",
-    "    ].plot(ax=ax1)\n",
-    "    plt.title(\"Final score\")\n",
-    "    plt.xlabel(\"epochs\")\n",
-    "    _ = training_history[[c for c in training_history.columns if \"win\" in c[1]]].plot(\n",
-    "        ax=ax2\n",
+    "    _ = (\n",
+    "        training_history[\n",
+    "            [c for c in training_history.columns if c.endswith(\"final_score\")]\n",
+    "        ]\n",
+    "        .rename(lambda c: c.split(\" \")[0], axis=1)\n",
+    "        .plot(ax=ax1)\n",
+    "    )\n",
+    "    plt.title(\"Final score\")\n",
+    "    ax1.xlabel(\"epochs\")\n",
+    "    _ = (\n",
+    "        training_history[[c for c in training_history.columns if c.endswith(\"win\")]]\n",
+    "        .rename(lambda c: c.split(\" \")[0], axis=1)\n",
+    "        .plot(ax=ax2)\n",
+    "    )\n",
+    "    ax2.title(\"Win score\")\n",
+    "    ax2.xlabel(\"epochs\")\n",
+    "    _ = (\n",
+    "        np.sqrt(\n",
+    "            training_history[\n",
+    "                [c for c in training_history.columns if c.startswith(\"loss\")]\n",
+    "            ]\n",
+    "        )\n",
+    "        .rename(lambda c: c.split(\" \")[0], axis=1)\n",
+    "        .plot(ax=ax3)\n",
    "    )\n",
-    "    plt.title(\"Win score\")\n",
-    "    plt.xlabel(\"epochs\")\n",
-    "    _ = np.sqrt(\n",
-    "        training_history[[c for c in training_history.columns if \"loss\" == c[0]]]\n",
-    "    ).plot(ax=ax3)\n",
    "    ax3.set_yscale(\"log\")\n",
-    "    plt.xlabel(\"epochs\")\n",
+    "    ax3.xlabel(\"epochs\")\n",
    "\n",
-    "    plt.title(\"Loss\")\n",
-    "    fig.subtitle(f\"Training history of {ai_name}\" if ai_name else f\"Training history\")\n",
+    "    ax3.title(\"Loss\")\n",
+    "    fig.suptitle(f\"Training history of {ai_name}\" if ai_name else f\"Training history\")\n",
+    "    fig.tight_layout()\n",
    "    plt.show()"
   ]
  },
@@ -2733,50 +2780,6 @@
    "_train_boards.shape"
   ]
  },
-  {
-   "cell_type": "raw",
-   "metadata": {
-    "tags": []
-   },
-   "source": [
-    "plot_othello_boards(train_boards[:8, 0])\n",
-    "plot_othello_boards(q_leaning_formatted_action[0:8, 0, 1])"
-   ]
-  },
-  {
-   "cell_type": "raw",
-   "metadata": {},
-   "source": [
-    "ql_policy = QLPolicy(\n",
-    "    0.95,\n",
-    "    neural_network=DQLNet(),\n",
-    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
-    "    gamma=0.8,\n",
-    "    who_won_fraction=0,\n",
-    "    final_score_fraction=0,\n",
-    ")\n",
-    "_batch_size = 100\n",
-    "%timeit ql_policy.train_batch(_batch_size)\n",
-    "%memit ql_policy.train_batch(_batch_size)\n",
-    "%timeit ql_policy.evaluate_model([RandomPolicy(0)], _batch_size)\n",
-    "%memit ql_policy.evaluate_model([RandomPolicy(0)], _batch_size)"
-   ]
-  },
-  {
-   "cell_type": "raw",
-   "metadata": {},
-   "source": [
-    "ql_policy = QLPolicy(\n",
-    "    0.95,\n",
-    "    neural_network=DQLNet(),\n",
-    "    symmetry_mode=SymmetryMode.MULTIPLY,\n",
-    "    gamma=0.8,\n",
-    "    who_won_fraction=1,\n",
-    "    final_score_fraction=0,\n",
-    ")\n",
-    "ql_policy.policy_name"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": null,