Final verion up.

This commit is contained in:
Philipp Horstenkamp 2023-03-31 22:29:56 +02:00
parent 2f64f17026
commit 742ac0a4e1
Signed by: Philipp
GPG Key ID: DD53EAC36AFB61B4
2 changed files with 228 additions and 81 deletions

165
README.md
View File

@ -1,8 +1,165 @@
# reversi
# Otello Q-Learning
A Deep Learning implementation of the game Reversi aka. Otello.
A Deep Learning implementation of the game Otello aka. reversi.
This is a Jupyter implementation only because it was requested in such a format for a class in my masters degree. Enjoy the read or ignore it.
## Comments from Gawron
Please note that the notebook contains interactive decorators that can only be used when executed as a jupyter notebook.
- Use Zobrist hashing for symetry
## Dependencies
The project was developed with the following dependencies:
```
aiofiles==22.1.0
aiosqlite==0.18.0
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
attrs==22.2.0
Babel==2.11.0
backcall==0.2.0
beautifulsoup4==4.11.2
black==21.12b0
blackcellmagic==0.0.3
bleach==6.0.0
certifi==2022.12.7
cffi==1.15.1
cfgv==3.3.1
charset-normalizer==3.0.1
click==8.1.3
cloudpickle==2.2.1
colorama==0.4.6
comm==0.1.2
contourpy==1.0.7
cycler==0.11.0
debugpy==1.6.6
decorator==5.1.1
defusedxml==0.7.1
distlib==0.3.6
exceptiongroup==1.1.0
executing==1.2.0
fastjsonschema==2.16.2
filelock==3.9.0
fonttools==4.38.0
fqdn==1.5.1
gitdb==4.0.10
GitPython==3.1.31
gym==0.26.2
gym-notices==0.0.8
identify==2.5.18
idna==3.4
iniconfig==2.0.0
ipykernel==6.21.2
ipython==8.10.0
ipython-genutils==0.2.0
ipywidgets==8.0.4
isoduration==20.11.0
isort==5.12.0
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.5.1
jupyter-events==0.5.0
jupyter-server-mathjax==0.2.6
jupyter-ydoc==0.2.2
jupyter_client==8.0.2
jupyter_core==5.2.0
jupyter_server==2.3.0
jupyter_server_fileid==0.6.0
jupyter_server_terminals==0.4.4
jupyter_server_ydoc==0.6.1
jupyterlab==3.6.1
jupyterlab-git==0.41.0
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.5
jupyterlab_server==2.19.0
KDEpy==1.1.0
kiwisolver==1.4.4
line-profiler==4.0.2
MarkupSafe==2.1.2
matplotlib==3.7.0
matplotlib-inline==0.1.6
memory-profiler==0.61.0
mistune==2.0.5
mypy-extensions==1.0.0
nbclassic==0.5.1
nbclient==0.7.2
nbconvert==7.2.9
nbdime==3.1.1
nbformat==5.7.3
nest-asyncio==1.5.6
nodeenv==1.7.0
notebook==6.5.2
notebook_shim==0.2.2
numpy==1.24.2
packaging==23.0
pandas==1.5.3
pandocfilters==1.5.0
parso==0.8.3
pathspec==0.11.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==3.0.0
plotly==5.13.0
pluggy==1.0.0
pre-commit==3.0.4
prometheus-client==0.16.0
prompt-toolkit==3.0.36
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pycparser==2.21
Pygments==2.14.0
pyparsing==3.0.9
pyrsistent==0.19.3
pytest==7.2.1
python-dateutil==2.8.2
python-json-logger==2.0.6
pytz==2022.7.1
pywin32==305
pywinpty==2.0.10
PyYAML==6.0
pyzmq==25.0.0
qtconsole==5.4.0
QtPy==2.3.0
requests==2.28.2
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
scipy==1.10.0
seaborn==0.12.2
Send2Trash==1.8.0
setuptools-scm==7.1.0
six==1.16.0
smmap==5.0.0
sniffio==1.3.0
soupsieve==2.4
stack-data==0.6.2
tenacity==8.2.1
terminado==0.17.1
tinycss2==1.2.1
tomli==1.2.3
torch==1.13.1
torchaudio==0.13.1
torchvision==0.14.1
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
typing_extensions==4.5.0
uri-template==1.2.0
urllib3==1.26.14
virtualenv==20.19.0
wcwidth==0.2.6
webcolors==1.12
webencodings==0.5.1
websocket-client==1.5.1
widgetsnbextension==4.0.5
y-py==0.5.5
ypy-websocket==0.8.2
```

View File

@ -159,7 +159,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The directions array contains all the numerical offsets needed to move along one of the 8 directions in a 2 dimensional grid. This will allow an iteration over the game board.\n",
"The directions array contains all the numerical offsets needed to move along one of the 8 directions in a two-dimensional grid. This will allow an iteration over the game board.\n",
"\n",
"![8-directions.png](8-directions.png \"Offset in 8 directions\")"
]
@ -539,7 +539,7 @@
"## Hash Otello Boards\n",
"\n",
"A challenge for training any reinforcement learning algorithm is how to properly calibrate the exploration rate.\n",
"To make huge numbers of boards comparable it is easier to work with hashes than with the acutal boards. For that purpose a functionalty to hash a board and a stack of boards was added."
"To make huge numbers of boards comparable it is easier to work with hashes than with the actual boards. For that purpose a functionality to hash a board and a stack of boards was added."
]
},
{
@ -757,11 +757,11 @@
"def _get_possible_turns_for_board(\n",
" board: np.ndarray, poss_turns: np.ndarray\n",
") -> np.ndarray:\n",
" \"\"\"Calcualtes where turns are possible.\n",
" \"\"\"Calculates where turns are possible.\n",
"\n",
" Args:\n",
" board: The board that should be checked for a playable action.\n",
" poss_turns: An array of actions that could be possible. All true fileds are empty and next to an enemy stone.\n",
" poss_turns: An array of actions that could be possible. All true fields are empty and next to an enemy stone.\n",
" \"\"\"\n",
" for idx, idy in itertools.product(range(BOARD_SIZE), range(BOARD_SIZE)):\n",
" if poss_turns[idx, idy]:\n",
@ -1851,11 +1851,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Posiblilty that an action is possible at a specifc turn\n",
"### Possibility that an action can be taken at a specifc turn\n",
"\n",
"The diagramm below shows where and when a stone can be placed at each of the turns.\n",
"This can be used to compare learning behavior for different Policies and show for example the behavior around the corners.\n",
"A very low possiblity for a corner would mean that the AI tires not to give the corners to the enemy an tries to capture them themselve if possible."
"The following diagram displays the available positions for placing a stone at each turn. This can help compare the learning behavior of different policies, particularly with regard to the corners. A low probability for a corner suggests that the AI avoids giving corners to the opponent and tries to capture them for itself whenever possible."
]
},
{
@ -1905,7 +1903,7 @@
"source": [
"### Statistic of skipped actions\n",
"\n",
"Not all turns can be played. Ploted as a mean over the curse of the game it can be clearly seen that the first time a turn can be skipped is is turn 9 and increases over time."
"Not all turns can be played. Plotted as a mean over the curse of the game it can be clearly seen that the first time a turn can be skipped is turn 9 and increases over time."
]
},
{
@ -1931,7 +1929,7 @@
" \"\"\"Calculates if the board changed between actions.\n",
"\n",
" Args:\n",
" board_history: A history of game baords. Shaped (70 * n * 8 * 8)\n",
" board_history: A history of game boards. Shaped (70 * n * 8 * 8)\n",
" \"\"\"\n",
" return ~np.all(\n",
" np.roll(board_history, shift=1, axis=0) == board_history, axis=(2, 3)\n",
@ -1953,9 +1951,9 @@
},
"source": [
"## Hash branching\n",
"To calibrate the explration rate properly we compared all the games in a stack of games. The graph shows the number of unique game boards at each of the game turns.\n",
"To calibrate the exploration rate properly we compared all the games in a stack of games. The graph shows the number of unique game boards at each of the game turns.\n",
"As can be seen below for random games the games start to be unique very fast.\n",
"For a proper directed exploration I assume the rate needs to be calbrated that the game still have some duplications of the best knwon game at the end of an game simulatin left."
"For a proper directed exploration I assume the rate needs to be calibrated that the game still have some duplications of the best knwon game at the end of a game simulation left."
]
},
{
@ -2044,8 +2042,7 @@
"source": [
"### Evaluate the final game score\n",
"\n",
"When playing Otello the empty fileds at the end of the game are conted for the player with more stones.\n",
"The folowing function calucates that. The result result will be the delta between the score for player 1 (black)."
"In Otello, the empty fields at the end of the game are counted for the player with more stones. The function below calculates this and returns the score difference for player 1 (black)."
]
},
{
@ -2088,9 +2085,9 @@
"\n",
"\n",
"np.random.seed(2)\n",
"_baords = simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0]\n",
"_boards = simulate_game(10, (RandomPolicy(1), RandomPolicy(1)))[0]\n",
"np.testing.assert_array_equal(\n",
" np.sum(_baords[-1], axis=(1, 2)), final_boards_evaluation(_baords[-1])\n",
" np.sum(_boards[-1], axis=(1, 2)), final_boards_evaluation(_boards[-1])\n",
")\n",
"np.random.seed(2)\n",
"np.testing.assert_array_equal(\n",
@ -2257,8 +2254,8 @@
"source": [
"### Evaluate the winner of a game\n",
"\n",
"The last function evaluates who won by calculating who signum function of the sum of the numpy array representing the baord.\n",
"The resulting number would be one if the game was wone by the player (white) or -1 if the enemy (black) won. The result would also "
"The last function evaluates who won by calculating who signum function of the sum of the numpy array representing the board.\n",
"The resulting number would be one of the game was won by the player (white) or -1 if the enemy (black) won. The result would also"
]
},
{
@ -2706,7 +2703,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### ANN Arcitecture \n",
"### ANN Architecture\n",
"\n",
"The architecture of an Artificial Neural Network (ANN) is defined by its inputs and outputs. In the case of Q-Learning ANNs for the board game Otello, the network should calculate the expected reward for the current state and a proposed action.\n",
"\n",
@ -2754,7 +2751,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Since the first ANN was initially not very succesfull a reduced version was also tried."
"Since the first ANN was initially not very successful a reduced version was also tried."
]
},
{
@ -2854,7 +2851,7 @@
"def action_to_q_learning_format(\n",
" board_history: np.ndarray, action_history: np.ndarray\n",
") -> np.ndarray:\n",
" \"\"\"Fomrats the board history and the action history into Q-Learning inputs.\n",
" \"\"\"Formats the board history and the action history into Q-Learning inputs.\n",
"\n",
" Args:\n",
" board_history: A stack of board histories.\n",
@ -2920,7 +2917,7 @@
"def build_symetry_action(\n",
" board_history: np.ndarray, action_history: np.ndarray\n",
") -> np.ndarray:\n",
" \"\"\"Build a set of symetrical game histories in the style of a q_learning learning policy.\n",
" \"\"\"Build a set of symmetrical game histories in the style of a q_learning learning policy.\n",
"\n",
" Args:\n",
" board_history: A history of game states.\n",
@ -3034,7 +3031,7 @@
"source": [
"class QLPolicy(GamePolicy):\n",
" \"\"\"\n",
" A simple Q-Learining policy.\n",
" A simple Q-Learning policy.\n",
" \"\"\"\n",
"\n",
" # noinspection PyProtectedMember\n",
@ -3087,7 +3084,7 @@
"\n",
" @property\n",
" def policy_name(self) -> str:\n",
" \"\"\"Geneartes a name for a Q-Learning policy from all their arguments\"\"\"\n",
" \"\"\"Generates a name for a Q-Learning policy from all their arguments\"\"\"\n",
" symmetry_name = {SymmetryMode.MULTIPLY: \"M\", SymmetryMode.BREAK_SEQUENCE: \"B\"}\n",
" g = f\"{self.gamma:.1f}\".replace(\".\", \"\")\n",
" ww = f\"{self.who_won_fraction:.1f}\".replace(\".\", \"\")\n",
@ -3242,7 +3239,7 @@
" )\n",
"\n",
" def load(self):\n",
" \"\"\"Loads the latest itteration of a model with the same configuration\"\"\"\n",
" \"\"\"Loads the latest iteration of a model with the same configuration\"\"\"\n",
" pickle_files = glob.glob(f\"{TRAINING_RESULT_PATH}/{self.policy_name}-*.pickle\")\n",
" torch_files = glob.glob(f\"{TRAINING_RESULT_PATH}/{self.policy_name}-*.torch\")\n",
"\n",
@ -3303,7 +3300,7 @@
" return self.history\n",
"\n",
" def plot_history(self) -> None:\n",
" \"\"\"Polts the training history.\"\"\"\n",
" \"\"\"Plots the training history.\"\"\"\n",
" if not self.training_results:\n",
" return None\n",
" return live_history(self.history, str(self))\n",
@ -3340,12 +3337,9 @@
"source": [
"### Calibrating the / greedy factor ($\\epsilon$)\n",
"\n",
"Since the polcies above to not use memory replay it is prosuably a good idear to calbirate the greedy factor $\\epsilon$ to be high enough that some policies stay unchanged to the end of the simulation.\n",
"I here assume that we simulate the game with 1000 probes per batch.\n",
"Some trial and error lead to a greedy factor of at least 0.92.\n",
"Given the policy of not using memory replay, it is advisable to set the greedy factor $\\epsilon$ high enough such that some policies remain unchanged until the end of the simulation. Assuming a simulation of the game with 1000 probes per batch, trial and error suggests a minimum greedy factor of 0.92.\n",
"\n",
"Later experimentation however did show a big instability in the ANN models and a convergence that did not lead to a very high uptake in win ratio and final score.\n",
"I prosume this would be much better with either more samples per batch or an memory replay buffer. More on that later."
"However, further experimentation revealed significant instability in the ANN models and poor convergence, leading to low win ratios and final scores. To address this issue, increasing the number of samples per batch or implementing a memory replay buffer may be beneficial. More on this will be discussed later."
]
},
{
@ -3858,7 +3852,7 @@
"source": [
"### Defining some metric polices\n",
"\n",
"Since the AI Polcies are in flux and change while trainig a continues comparison with unchanging policies is needed. The `RandomPolicy` and the `GreedyPolicy` fullfill that role."
"Since the ANN policies are in flux and change while trainig a continues comparison with unchanging policies is needed. The `RandomPolicy` and the `GreedyPolicy` fulfill that role."
]
},
{
@ -3889,7 +3883,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Below the training pogress for the Q-Learning policy is shown."
"Below the training progress for the Q-Learning policy is shown."
]
},
{
@ -4045,7 +4039,7 @@
"source": [
"The training consistency varies from policy to policy, with some policies being more stable and exhibiting less variation. Moreover, the highly interdependent metrics indicate that the evaluation sample size may be too small.\n",
"\n",
"Since batch size and metics have similar sizes this suggest that the batch size should also be increads. Seadly with the current Q-Learning setup this is not possible due to CPU constraints."
"Since batch size and metrics have similar sizes this suggests that the batch size should also be increased. Sadly with the current Q-Learning setup this is not possible due to CPU constraints."
]
},
{
@ -4054,8 +4048,8 @@
"source": [
"### Example simulations\n",
"\n",
"The section below shows the trained policies, playing as black against a white oponent.\n",
"The Policy can be choosen via the dropdown menu."
"The section below shows the trained policies, playing as black against a white opponent.\n",
"The Policy can be chosen via the dropdown menu."
]
},
{
@ -4121,8 +4115,8 @@
}
],
"source": [
"longest_trained_policy = ql_policies[4]\n",
"longest_trained_policy"
"example_policy = ql_policies[4]\n",
"example_policy"
]
},
{
@ -4131,7 +4125,7 @@
"source": [
"## Analysis of the Policies\n",
"\n",
"When creating such policies it is necessary to evaluate the results. One way of doing so is to play a huge turnerment where all Policies corss paths at least twice.\n",
"When creating such policies it is necessary to evaluate the results. One way of doing so is to play a huge tournament where all Policies cross paths at least twice.\n",
"The resulting dataframe contains the results as a json where on the index the first policy is noted and the second by the column.\n",
"This allows for different metrics for all the policies while simulating them only once."
]
@ -4200,7 +4194,7 @@
},
"source": [
"### Analysing the policy by the win score\n",
"The table below shows the win ratio by color. It shows directionally how one policy played against another as balck or white. Since nearly all values are positve the color bias that the player black has an advantage can't be overcome with the arcitecutre as is. It does however show some learning behavor."
"The table presented below displays the win ratio for each color, providing directional insight into how one policy played against the other as black or white. While most values are positive, indicating that the player who played as black had an advantage, this bias could not be overcome with the current architecture. Nonetheless, the table does demonstrate some learning behavior."
]
},
{
@ -5420,7 +5414,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Seadly there is no clear best policy to identify."
"Sadly there is no clear best policy to identify."
]
},
{
@ -5428,8 +5422,8 @@
"metadata": {},
"source": [
"### Analysing the polices by the final score\n",
"When anlaysing the final score it is also not easy to see any QL-policy clearly outpacing the other QL-policies.\n",
"Score color bias compensation is done the same as above. The multiplicaten by 64 compensates for the score beeing normed to [1, -1]. The score now calculatees the advantage in points a player has."
"When analysing the final score it is also not easy to see any QL-policy clearly outpacing the other QL-policies.\n",
"Score color bias compensation is done the same as above. The multiplication by 64 compensates for the score beeing normed to [1, -1]. The score now calculates the advantage in points a player has."
]
},
{
@ -6067,7 +6061,7 @@
}
],
"source": [
"longest_trained_policy"
"example_policy"
]
},
{
@ -6108,10 +6102,10 @@
],
"source": [
"_l_board_history, _l_action_history = simulate_game(\n",
" 1000, (longest_trained_policy, RandomPolicy(1)), True\n",
" 1000, (example_policy, RandomPolicy(1)), True\n",
")\n",
"_r_board_history, _r_action_history = simulate_game(\n",
" 1000, (RandomPolicy(1), longest_trained_policy), True\n",
" 1000, (RandomPolicy(1), example_policy), True\n",
")"
]
},
@ -6135,23 +6129,23 @@
],
"source": [
"_l_actions_possible = get_possible_turns(_l_board_history.reshape(-1, 8, 8)).reshape(\n",
" 70, -1, 8, 8\n",
" (70, -1, 8, 8)\n",
")\n",
"_r_actions_possible = get_possible_turns(_r_board_history.reshape(-1, 8, 8)).reshape(\n",
" 70, -1, 8, 8\n",
" (70, -1, 8, 8)\n",
")\n",
"\n",
"\n",
"_l_mean_actions_possible = np.mean(_l_actions_possible, axis=(1))\n",
"_r_mean_actions_possible = np.mean(_r_actions_possible, axis=(1))"
"_l_mean_actions_possible = np.mean(_l_actions_possible, axis=1)\n",
"_r_mean_actions_possible = np.mean(_r_actions_possible, axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Corner stone analysis via possiblity space\n",
"As notet in the analysis of the game and its behavior over time a heat map can be used to calculate how a game was won compared to a random strategy."
"#### Corner stone analysis via possible action space\n",
"As noted in the analysis of the game and its behavior over time a heat map can be used to calculate how a game was won compared to a random strategy."
]
},
{
@ -6277,7 +6271,7 @@
"metadata": {},
"source": [
"#### Corner stone capture count\n",
"When counting how many corners where captured it is clear to see that the QL policy understood the importance of the corners somewhat. Since it is not absolut is can't be saad that they should overly priorities those and i would have expexted an outkome around $3$. The simulated $2.7$ corners as an average at the end of the game is not that high but shows a good trend."
"When counting the number of corners captured, it's evident that the QL policy has some understanding of their importance. However, since the result is not absolutely dependent on those corners it can't be concluded that they should be overly prioritized, and an outcome around 3 would have been expected. The average of 2.7 corners captured at the end of the game, though not very high, shows a positive trend."
]
},
{
@ -6302,9 +6296,9 @@
}
],
"source": [
"a = np.sum(np.mean(_l_board_history[-1][:, [0, -1]][:, :, [0, -1]], axis=(0)))\n",
"b = -np.sum(np.mean(_r_board_history[-1][:, [0, -1]][:, :, [0, -1]], axis=(0)))\n",
"c = np.sum(np.mean(_board_history[-1][:, [0, -1]][:, :, [0, -1]], axis=(0)))\n",
"a = np.sum(np.mean(_l_board_history[-1][:, [0, -1]][:, :, [0, -1]], axis=0))\n",
"b = -np.sum(np.mean(_r_board_history[-1][:, [0, -1]][:, :, [0, -1]], axis=0))\n",
"c = np.sum(np.mean(_board_history[-1][:, [0, -1]][:, :, [0, -1]], axis=0))\n",
"\n",
"(\n",
" pd.Series(\n",
@ -6323,7 +6317,7 @@
"source": [
"### Time till capture\n",
"\n",
"An additional aspect to analyze is the time it takes to capture a corner when it becomes available. As capturing a corner is critical in Othello strategy, it is expected that the AI would priorites its capture. However, since the opponent is usually not able to capture the same corner at the same moment, the AI may not need to hurry. To test this hypothesis, we measured the time it took for the AI to capture a corner and compared it to the time it took for the AI to capture a non-corner cell. The results showed that the AI captured corners faster, although not by as much as expected."
"An additional aspect to analyze is the time it takes to capture a corner when it becomes available. As capturing a corner is critical in Othello strategy, it is expected that the AI would priorities its capture. However, since the opponent is usually not able to capture the same corner at the same moment, the AI may not need to hurry. To test this hypothesis, we measured the time it took for the AI to capture a corner and compared it to the time it took for the AI to capture a non-corner cell. The results showed that the AI captured corners faster, although not by as much as expected."
]
},
{
@ -6369,7 +6363,7 @@
"tags": []
},
"source": [
"With the coner capture of this first policy I was quit happy even if it did not went as far as I expected."
"With the corner capture behavior of this policy I was quit happy even if it was not as extreme as I expected. It goes into the right direction."
]
},
{
@ -6378,22 +6372,20 @@
"tags": []
},
"source": [
"### Analysing the symmetry behavior for the policy\n",
"### Analyzing Symmetry in the ANN's Behavior\n",
"\n",
"There are different ways to analyse the the result. I chose to use a normed error also sometimes called R2 meteic in DL.\n",
"This is ofcourse no loss just a measurement for how good the symmetry is.\n",
"There are several ways to assess symmetry, but I used the R2 metric, which is sometimes referred to as the normed error in Deep Learning. It's important to note that this is not a loss function, but rather a way to measure how symmetric the ANN's behavior is.\n",
"\n",
"The R2 score is calculated as \n",
"$R_2 = 1 - \\frac{MSE}{VAR}$\n",
"Here calculated as the error when fliping an axis calculating the policy and reversing the flip to its directly calculated oposit. The big etwantage of this formula is that the error is normed.\n",
"An result of 1 means there is no error while 0 means there is only random noise and negative value mean there is a worse than random connection."
"The R2 score is computed as follows: $R_2 = 1 - \\frac{MSE}{VAR}$ In my analysis, I calculated the error by flipping an axis, generating a policy, and then reversing the flip to its directly calculated opposite. The primary advantage of this formula is that it normalizes the error.\n",
"\n",
"A score of 1 indicates that there is no error, while a score of 0 suggests that there is only random noise. Negative values indicate that the connection is worse than random."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The SQT for this results the following."
"The VAR value for the R2 value calucates as follows."
]
},
{
@ -6416,7 +6408,7 @@
},
"outputs": [],
"source": [
"base_policy_results = longest_trained_policy._internal_policy(\n",
"base_policy_results = example_policy._internal_policy(\n",
" _board_history.reshape((-1, 8, 8))\n",
")"
]
@ -6476,7 +6468,7 @@
" 1\n",
" - np.var(\n",
" base_policy_results\n",
" - longest_trained_policy._internal_policy(\n",
" - example_policy._internal_policy(\n",
" _board_history.reshape((-1, 8, 8))[:, ::-1, :]\n",
" )[:, ::-1, :]\n",
" )\n",
@ -6488,7 +6480,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"When flipping the second axis the score reached is very similar which supports that the symetry of the game was learnd somewhat OK."
"When flipping the second axis the score reached is very similar which supports that the symetry of the game was learned somewhat OK."
]
},
{
@ -6514,7 +6506,7 @@
" 1\n",
" - np.var(\n",
" base_policy_results\n",
" - longest_trained_policy._internal_policy(\n",
" - example_policy._internal_policy(\n",
" _board_history.reshape((-1, 8, 8))[:, :, ::-1]\n",
" )[:, :, ::-1]\n",
" )\n",
@ -6528,7 +6520,7 @@
"tags": []
},
"source": [
"Since discount factors make out a huge part of the Q-Value this should be compensated for by this analyses.Those discount values can be found quit clearly in the Q-Valuenvariances below."
"Since discount factors make out a huge part of the Q-Value this should be compensated for by this analysis.Those discount values can be found quit clearly in the Q-Valuenvariances below."
]
},
{
@ -6611,9 +6603,7 @@
" [\n",
" np.var(\n",
" base_policy_results.reshape(70, -1, 8, 8)[i]\n",
" - longest_trained_policy._internal_policy(_board_history[i, :, ::-1, :])[\n",
" :, ::-1, :\n",
" ]\n",
" - example_policy._internal_policy(_board_history[i, :, ::-1, :])[:, ::-1, :]\n",
" )\n",
" for i in range(70)\n",
" ]\n",
@ -6690,7 +6680,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Concluseion\n",
"## Conclusion\n",
"\n",
"The implementation of Q-learning using artificial neural networks (ANNs) for the board game Othello proved to be a challenging task. Although I had planned to implement increasingly complex policies, the amount of work required for a reinforcement learning (RL) project, including Q-learning, was much larger than expected. Nonetheless, I am satisfied with the progress made, considering the limited amount of time available. Although the networks did not clearly converge, they showed promise, and I am uncertain how to define the best values for parameters such as $\\gamma$, $\\epsilon$, who_won_fraction, and final_score_fraction.\n",
"\n",
@ -6713,7 +6703,7 @@
"* Game rules and example game images [https://de.wikipedia.org/wiki/Othello_(Spiel)](https://de.wikipedia.org/wiki/Othello_(Spiel))\n",
"* Game strategy examples [https://de.wikipedia.org/wiki/Computer-Othello](https://de.wikipedia.org/wiki/Computer-Othello)\n",
"* Image for 8 directions [https://www.researchgate.net/journal/EURASIP-Journal-on-Image-and-Video-Processing-1687-5281](https://www.researchgate.net/journal/EURASIP-Journal-on-Image-and-Video-Processing-1687-5281)\n",
"* Deepl Leraning with PyTorch 1.x (ISBN 978-1-83855-300-5)\n",
"* Deepl Learning with PyTorch 1.x (ISBN 978-1-83855-300-5)\n",
"\n",
"ChatGPG was used to refactor some code snippets and as an advanced spell checker."
]
@ -6724,7 +6714,7 @@
"source": [
"# Memory usage\n",
"\n",
"The jupyter notebook uses a lot of memory. The cell below monitors the staticly allocated memory to get some kind of feeling where memroy is completly wasted.\n",
"The jupyter notebook uses a lot of memory. The cell below monitors the statically allocated memory to get some kind of feeling where memory is completely wasted.\n",
"\n",
"The code snipped was copied and only slightly modified."
]