Philipp Horstenkamp a9304201af
(chore): Initilised devops tools (#29)
* Added a first action

* Repaired a typo

* Repaired a typo2

* Repaired a typo2

* Added flake8 action

* Repaired a typo in the flake8 action.

* Added a first bandit action

* Added a first batch

* Added a first batch

* Added a first batch

* Added a first batch

* Added a first batch

* Added the flake8-prebuild as a need to flake8

* Added the flake8-prebuild as a need to flake8

* Added the flake8-prebuild as a need to flake8

* Added the docker socket to the volume.

* Added the flake8-prebuild as a need to flake8

* Removed latest part from container.

* Removed latest part from container.

* Removed latest part from container.

* Reworked flake8

* Reworked flake8

* Reworked flake8

* Reworked flake8

* Reworked flake8

* Reworked flake8

* Reworked flake8

* Reworked flake8

* Reworked flake8

* Reworked flake8

* Reworked flake8

* Reworked flake8

* Reworked flake8 poetry

* Reworked flake8 poetry

* Changed to 64bit

* Some edits to the runner

* Added python setup

* Added python -m to python docker image.

* Added python -m to python docker image.

* Added python -m to python docker image.

* Added python -m to python docker image.

* Added python -m to python docker image.

* Added python -m to python docker image.

* Added ra run linter

* Added ra run linter

* Added ra run linter

* Added ra run linter

* Removed redundant version

* Removed redundant version

* Added isort

* Added isort

* Added isort

* Added poetry install

* Added poetry install

* Added flake8 as lint.

* Added flake8 as lint.

* Added flake8 as lint.

* Added flake8 as lint.

* Added flake8 as lint.

* Added flake8 as lint.

* Added flake8 as lint.

* Uses nodejs and python image

* Added flake8 as lint.

* Added flake8 as lint.

* Added flake8 as lint.

* Added flake8 as lint.

* Removed selfhosted runner

* Removed self hosted runner

* Removed self hosted runner

* Removed self hosted runner

* Added black and flake8 tests

* Removed self hosted runner

* Removed self hosted runner

* Removed unneded actions

* Added a mypy error.

* Removed poetry call before boetry setup

* Removed poetry call before poetry setup

* Added a test to understand the poetry action better

* Added a test to understand the poetry action better

* Added a test to understand the poetry action better

* Added a test to understand the poetry action better

* Added a test to understand the poetry action better

* Added a test to understand the poetry action better

* Added the snook poetry builder

* Reworked the repo a bit

* Removed unneeded poetry installation

* Added the isort action

* Added isort test

* Added ruff

* Added full ruff configuration

* Added full ruff configuration2

* Added full ruff configuration2

* Removed duplicat configurations

* Removed some redundant pre-commit hooks

* Removed unneeded actions.

* Removed unneeded actions.

* Repaired ruff

* Added tests.

* Removed

* Removed

* Removed a missing file

* Removed a missing file

* Removed a missing file

* Removed a missing file

* Removed a missing file

* Added reports as artifacts

* Added reports as artifacts

* Added reports as artifacts

* Removed the unneded poetry test

* Added a license checker.

* Added a license checker.

* Removed some unneeded configuration.

* Removed the import reformatted.

* Added doc generation.

* Added doc generation.

* Added license summary.

* Add

* Add lint

* Switched pip-licenses to poetry.

* Switched pip-licenses to poetry.

* Switched pip-licenses to poetry.

* Remove some more packages.

* Remove some more packages.

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added a make file

* Added version codes to the main package

* Changed the format of the md files

* Presentation first draft

* Version up and added extensions

* Version up and added extensions

* Version up and added extensions

* Removed the venv path from docbuild

* Actions version up

* Actions version up

* Actions version up

* Actions version up

* Actions version up

* Actions version up

* Experiements with sphinx

* Experiments with sphinx

* Experiments with sphinx

* Experiments with sphinx

* Experiments with sphinx

* Experiments with sphinx

* Experiments with sphinx

* Experiments with sphinx

* First draft of the sphinx documentation.

* Added the protocol to the time series.

* Added the protocol to the time series.

* First draft ot a first build pipline

* Added mermaid version support

* Added documentations pull and branch request requirements.

* Added documentations pull and branch request requirements.

* Added documentations pull and branch request requirements.

* Added documentations pull and branch request requirements.

* Tests should now be passing

* Tests should now be passing

* Tests should now be passing

* Tests should now be passing

* Tests should now be passing

* Tests should now be passing

* Tests should now be passing

* Tests should now be passing

* Add safety

* Add safety

* Add safety

* Added the action on pull_request_target

* Added the action on pull_request_target

* Added the action on pull_request_target

* Added a pytest coverage report

* Added a pytest coverage report

* Added a pytest coverage report

* Added a pytest coverage report

* Added a pytest coverage report

* Added a build step

* Added a build step

* Added a build step

* Added a build step

* Changed the lint action to work only on python changes.

* Changed the lint action to work only on python changes.

* Changed the lint action to work only on python changes.

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Added the ability to compile a html report

* Coverage

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Finished test and build workflow

* Repaired a bug.

* Repaired a bug.

* Repaired a bug.

* Repaired a bug.

* Repaired a bug.

* Added a github branch.ref

* Removed a poetry install

* Docbuild now excludes templates

* Added the seminarpräsentation to the documentation build

* Added the seminarpräsentation to the documentation build

* Added the seminarpräsentation to the documentation build

* dded a few images

* Changed the pre-commit image

* Changed the pre-commit image

* Presentation done

* Never executing jupyter for sphinx

* Never executing jupyter for sphinx

* Never executing jupyter for sphinx

* Never executing jupyter for sphinx

* Never executing jupyter for sphinx
2023-06-23 18:47:04 +02:00

1001 lines
37 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# FinBert\n",
"\n",
"FinBert is a sentiment Analysis AI for Financial text.\n",
"Since we want to evaluate news article this is a necessary feature to analyse those texts.\n",
"In this document a first use of this tool will be shown.\n",
"Some texts will be analysed. Especially the analysis of german texts will be tried.\n",
"\n",
"## Sources\n",
"\n",
"[HugginFace](https://huggingface.co/ProsusAI/finbert)\n",
"[Tutorial](https://medium.com/codex/stocks-news-sentiment-analysis-with-deep-learning-transformers-and-machine-learning-cdcdb827fc06)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Libraries\n",
"\n",
"* transformers\n",
"* tqdm\n",
"* pandas\n",
"* numpy\n",
"* torch\n",
"* torchvision\n",
"* torchaudio\n",
"* sentencepiece\n",
"* sacremoses"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-01T13:16:13.740927Z",
"start_time": "2023-05-01T13:16:08.554998Z"
},
"jupyter": {
"outputs_hidden": false
},
"slideshow": {
"slide_type": "skip"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: transformers in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (4.28.1)\n",
"Requirement already satisfied: tqdm in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (4.65.0)\n",
"Requirement already satisfied: pandas in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (2.0.1)\n",
"Requirement already satisfied: numpy in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (1.24.3)\n",
"Requirement already satisfied: torch in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (2.0.0)\n",
"Requirement already satisfied: torchvision in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (0.15.1)\n",
"Requirement already satisfied: torchaudio in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (2.0.1)\n",
"Requirement already satisfied: sentencepiece in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (0.1.98)\n",
"Requirement already satisfied: sacremoses in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (0.0.53)\n",
"Requirement already satisfied: filelock in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from transformers) (3.8.0)\n",
"Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from transformers) (0.14.1)\n",
"Requirement already satisfied: packaging>=20.0 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from transformers) (23.1)\n",
"Requirement already satisfied: pyyaml>=5.1 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from transformers) (6.0)\n",
"Requirement already satisfied: regex!=2019.12.17 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from transformers) (2023.3.23)\n",
"Requirement already satisfied: requests in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from transformers) (2.28.1)\n",
"Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from transformers) (0.13.3)\n",
"Requirement already satisfied: colorama in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from tqdm) (0.4.6)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from pandas) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from pandas) (2022.7)\n",
"Requirement already satisfied: tzdata>=2022.1 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from pandas) (2023.3)\n",
"Requirement already satisfied: typing-extensions in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from torch) (4.5.0)\n",
"Requirement already satisfied: sympy in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from torch) (1.11.1)\n",
"Requirement already satisfied: networkx in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from torch) (3.1)\n",
"Requirement already satisfied: jinja2 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from torch) (3.1.2)\n",
"Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from torchvision) (9.4.0)\n",
"Requirement already satisfied: six in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from sacremoses) (1.16.0)\n",
"Requirement already satisfied: click in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from sacremoses) (8.1.3)\n",
"Requirement already satisfied: joblib in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from sacremoses) (1.2.0)\n",
"Requirement already satisfied: fsspec in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from huggingface-hub<1.0,>=0.11.0->transformers) (2023.4.0)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from jinja2->torch) (2.1.2)\n",
"Requirement already satisfied: charset-normalizer<3,>=2 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from requests->transformers) (2.1.1)\n",
"Requirement already satisfied: idna<4,>=2.5 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from requests->transformers) (3.4)\n",
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from requests->transformers) (1.26.12)\n",
"Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from requests->transformers) (2022.9.24)\n",
"Requirement already satisfied: mpmath>=0.19 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from sympy->torch) (1.3.0)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"[notice] A new release of pip is available: 23.0.1 -> 23.1.2\n",
"[notice] To update, run: python.exe -m pip install --upgrade pip\n"
]
}
],
"source": [
"!pip install transformers tqdm pandas numpy torch torchvision torchaudio sentencepiece sacremoses -U"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Importing and creation of models and tokenizer"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-01T13:16:15.121662Z",
"start_time": "2023-05-01T13:16:13.743921Z"
},
"jupyter": {
"outputs_hidden": false
},
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import torch\n",
"\n",
"from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
"\n",
"# create a tokenizer object\n",
"tokenizer = AutoTokenizer.from_pretrained(\"ProsusAI/finbert\")\n",
"\n",
"# fetch the pretrained model\n",
"model = AutoModelForSequenceClassification.from_pretrained(\"ProsusAI/finbert\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Analyze a single sentiment"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-01T13:16:15.194193Z",
"start_time": "2023-05-01T13:16:15.122665Z"
},
"jupyter": {
"outputs_hidden": false
},
"slideshow": {
"slide_type": "-"
}
},
"outputs": [
{
"data": {
"text/plain": [
"+ 0.034084\n",
"0 0.932933\n",
"- 0.032982\n",
"dtype: float32"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def analyze_sentiment(text: str) -> pd.Series:\n",
" input_tokens = tokenizer(text, padding=True, truncation=True, return_tensors=\"pt\")\n",
" output = model(**input_tokens)\n",
" return pd.Series(\n",
" torch.nn.functional.softmax(output.logits, dim=-1)[0].data,\n",
" index=[\"+\", \"0\", \"-\"],\n",
" )\n",
"\n",
"\n",
"headline = \"Microsoft fails to hit profit expectations\"\n",
"tf = analyze_sentiment(headline)\n",
"tf"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Creating test data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-01T13:16:15.208856Z",
"start_time": "2023-05-01T13:16:15.198186Z"
},
"slideshow": {
"slide_type": "skip"
},
"tags": []
},
"outputs": [],
"source": [
"text_df = pd.DataFrame(\n",
" [\n",
" {\"text\": \"Microsoft fails to hit profit expectations\", \"lan\": \"en\"},\n",
" {\n",
" \"text\": \"Am Aktienmarkt überwieg weiter die Zuversicht, wie der Kursverlauf des DAX zeigt.\",\n",
" \"lan\": \"de\",\n",
" },\n",
" {\"text\": \"Stocks rallied and the British pound gained.\", \"lan\": \"en\"},\n",
" {\n",
" \"text\": \"Meyer Burger bedient ab sofort australischen Markt und präsentiert sich auf Smart Energy Expo in Sydney.\",\n",
" \"lan\": \"de\",\n",
" },\n",
" {\n",
" \"text\": \"Meyer Burger enters Australian market and exhibits at Smart Energy Expo in Sydney.\",\n",
" \"lan\": \"en\",\n",
" },\n",
" {\n",
" \"text\": \"J&T Express Vietnam hilft lokalen Handwerksdörfern, ihre Reichweite zu vergrößern.\",\n",
" \"lan\": \"de\",\n",
" },\n",
" {\n",
" \"text\": \"7 Experten empfehlen die Aktie zum Kauf, 1 Experte empfiehlt, die Aktie zu halten.\",\n",
" \"lan\": \"de\",\n",
" },\n",
" {\"text\": \"Microsoft aktie fällt.\", \"lan\": \"de\"},\n",
" {\"text\": \"Microsoft aktie steigt.\", \"lan\": \"de\"},\n",
" ]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-01T13:16:15.208856Z",
"start_time": "2023-05-01T13:16:15.198186Z"
},
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>lan</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Microsoft fails to hit profit expectations</td>\n",
" <td>en</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Am Aktienmarkt überwieg weiter die Zuversicht,...</td>\n",
" <td>de</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Stocks rallied and the British pound gained.</td>\n",
" <td>en</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Meyer Burger bedient ab sofort australischen M...</td>\n",
" <td>de</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Meyer Burger enters Australian market and exhi...</td>\n",
" <td>en</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>J&amp;T Express Vietnam hilft lokalen Handwerksdör...</td>\n",
" <td>de</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7 Experten empfehlen die Aktie zum Kauf, 1 Exp...</td>\n",
" <td>de</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Microsoft aktie fällt.</td>\n",
" <td>de</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Microsoft aktie steigt.</td>\n",
" <td>de</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text lan\n",
"0 Microsoft fails to hit profit expectations en\n",
"1 Am Aktienmarkt überwieg weiter die Zuversicht,... de\n",
"2 Stocks rallied and the British pound gained. en\n",
"3 Meyer Burger bedient ab sofort australischen M... de\n",
"4 Meyer Burger enters Australian market and exhi... en\n",
"5 J&T Express Vietnam hilft lokalen Handwerksdör... de\n",
"6 7 Experten empfehlen die Aktie zum Kauf, 1 Exp... de\n",
"7 Microsoft aktie fällt. de\n",
"8 Microsoft aktie steigt. de"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analyze multiple Sentiments"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-01T13:16:16.132009Z",
"start_time": "2023-05-01T13:16:15.211858Z"
},
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>text</th>\n",
" <th>lan</th>\n",
" <th>+</th>\n",
" <th>0</th>\n",
" <th>-</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Microsoft fails to hit profit expectations</td>\n",
" <td>en</td>\n",
" <td>0.034084</td>\n",
" <td>0.932933</td>\n",
" <td>0.032982</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Am Aktienmarkt überwieg weiter die Zuversicht,...</td>\n",
" <td>de</td>\n",
" <td>0.053528</td>\n",
" <td>0.027950</td>\n",
" <td>0.918522</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Stocks rallied and the British pound gained.</td>\n",
" <td>en</td>\n",
" <td>0.898361</td>\n",
" <td>0.034474</td>\n",
" <td>0.067165</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Meyer Burger bedient ab sofort australischen M...</td>\n",
" <td>de</td>\n",
" <td>0.116597</td>\n",
" <td>0.012790</td>\n",
" <td>0.870613</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Meyer Burger enters Australian market and exhi...</td>\n",
" <td>en</td>\n",
" <td>0.187527</td>\n",
" <td>0.008846</td>\n",
" <td>0.803627</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>J&amp;T Express Vietnam hilft lokalen Handwerksdör...</td>\n",
" <td>de</td>\n",
" <td>0.066277</td>\n",
" <td>0.020608</td>\n",
" <td>0.913115</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7 Experten empfehlen die Aktie zum Kauf, 1 Exp...</td>\n",
" <td>de</td>\n",
" <td>0.050346</td>\n",
" <td>0.022004</td>\n",
" <td>0.927650</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Microsoft aktie fällt.</td>\n",
" <td>de</td>\n",
" <td>0.066061</td>\n",
" <td>0.016440</td>\n",
" <td>0.917498</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Microsoft aktie steigt.</td>\n",
" <td>de</td>\n",
" <td>0.041449</td>\n",
" <td>0.018471</td>\n",
" <td>0.940080</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" text lan + 0 \n",
"0 Microsoft fails to hit profit expectations en 0.034084 0.932933 \\\n",
"1 Am Aktienmarkt überwieg weiter die Zuversicht,... de 0.053528 0.027950 \n",
"2 Stocks rallied and the British pound gained. en 0.898361 0.034474 \n",
"3 Meyer Burger bedient ab sofort australischen M... de 0.116597 0.012790 \n",
"4 Meyer Burger enters Australian market and exhi... en 0.187527 0.008846 \n",
"5 J&T Express Vietnam hilft lokalen Handwerksdör... de 0.066277 0.020608 \n",
"6 7 Experten empfehlen die Aktie zum Kauf, 1 Exp... de 0.050346 0.022004 \n",
"7 Microsoft aktie fällt. de 0.066061 0.016440 \n",
"8 Microsoft aktie steigt. de 0.041449 0.018471 \n",
"\n",
" - \n",
"0 0.032982 \n",
"1 0.918522 \n",
"2 0.067165 \n",
"3 0.870613 \n",
"4 0.803627 \n",
"5 0.913115 \n",
"6 0.927650 \n",
"7 0.917498 \n",
"8 0.940080 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def analyse_sentiments(texts: pd.DataFrame) -> pd.DataFrame:\n",
" values = texts[\"text\"].apply(analyze_sentiment)\n",
" texts[[\"+\", \"0\", \"-\"]] = values\n",
" return texts\n",
"\n",
"\n",
"analyse_sentiments(text_df.copy())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion about FinBert\n",
"\n",
"The current form of this model can't be used for the german language.\n",
"It could be used if the text is translated beforehand. But it is questionable if that will work well.\n",
"Another way would be to retrain the same model with translated text from this models' data. But I do not believe this to be feasible."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Translating sentiments before analysing them with FinBert\n",
"\n",
"The problem with the FinBert model can be solved with translating the input before using FinBert.\n",
"The functions below explor this.\n",
"\n",
"[Translator: Helsinki-NLP/opus-mt-de-en](https://huggingface.co/Helsinki-NLP/opus-mt-de-en)\n",
"https://huggingface.co/docs/transformers/main/en/model_doc/marian#transformers.MarianMTModel\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-01T13:16:19.308043Z",
"start_time": "2023-05-01T13:16:16.135009Z"
}
},
"outputs": [],
"source": [
"from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
"\n",
"translation_tokenizer = AutoTokenizer.from_pretrained(\"Helsinki-NLP/opus-mt-de-en\")\n",
"\n",
"translation_model = AutoModelForSeq2SeqLM.from_pretrained(\"Helsinki-NLP/opus-mt-de-en\")"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-01T13:16:19.928232Z",
"start_time": "2023-05-01T13:16:19.310046Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\phhor\\PycharmProjects\\aki_prj23_transparenzregister\\venv\\Lib\\site-packages\\transformers\\generation\\utils.py:1313: UserWarning: Using `max_length`'s default (512) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/plain": [
"'J&T Express Vietnam helps local craft villages increase their reach.'"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def translate_sentiment(text: str) -> str:\n",
" input_tokens = translation_tokenizer([text], return_tensors=\"pt\")\n",
" generated_ids = translation_model.generate(**input_tokens)\n",
" return translation_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[\n",
" 0\n",
" ]\n",
"\n",
"\n",
"headline = (\n",
" \"J&T Express Vietnam hilft lokalen Handwerksdörfern, ihre Reichweite zu vergrößern.\"\n",
")\n",
"tf = translate_sentiment(headline)\n",
"tf"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-01T13:16:23.381261Z",
"start_time": "2023-05-01T13:16:19.933234Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Am Aktienmarkt überwieg weiter die Zuversicht, wie der Kursverlauf des DAX zeigt.\n",
"Meyer Burger bedient ab sofort australischen Markt und präsentiert sich auf Smart Energy Expo in Sydney.\n",
"J&T Express Vietnam hilft lokalen Handwerksdörfern, ihre Reichweite zu vergrößern.\n",
"7 Experten empfehlen die Aktie zum Kauf, 1 Experte empfiehlt, die Aktie zu halten.\n",
"Microsoft aktie fällt.\n",
"Microsoft aktie steigt.\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>lan</th>\n",
" <th>orig</th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>en</td>\n",
" <td>NaN</td>\n",
" <td>Microsoft fails to hit profit expectations</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>de_translated</td>\n",
" <td>Am Aktienmarkt überwieg weiter die Zuversicht,...</td>\n",
" <td>On the stock market, confidence continued to p...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>en</td>\n",
" <td>NaN</td>\n",
" <td>Stocks rallied and the British pound gained.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>de_translated</td>\n",
" <td>Meyer Burger bedient ab sofort australischen M...</td>\n",
" <td>Meyer Burger is now serving the Australian mar...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>en</td>\n",
" <td>NaN</td>\n",
" <td>Meyer Burger enters Australian market and exhi...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>de_translated</td>\n",
" <td>J&amp;T Express Vietnam hilft lokalen Handwerksdör...</td>\n",
" <td>J&amp;T Express Vietnam helps local craft villages...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>de_translated</td>\n",
" <td>7 Experten empfehlen die Aktie zum Kauf, 1 Exp...</td>\n",
" <td>7 experts recommend the stock for purchase, 1 ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>de_translated</td>\n",
" <td>Microsoft aktie fällt.</td>\n",
" <td>Microsoft Aktie falls.</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>de_translated</td>\n",
" <td>Microsoft aktie steigt.</td>\n",
" <td>Microsoft share is rising.</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" lan orig \n",
"0 en NaN \\\n",
"1 de_translated Am Aktienmarkt überwieg weiter die Zuversicht,... \n",
"2 en NaN \n",
"3 de_translated Meyer Burger bedient ab sofort australischen M... \n",
"4 en NaN \n",
"5 de_translated J&T Express Vietnam hilft lokalen Handwerksdör... \n",
"6 de_translated 7 Experten empfehlen die Aktie zum Kauf, 1 Exp... \n",
"7 de_translated Microsoft aktie fällt. \n",
"8 de_translated Microsoft aktie steigt. \n",
"\n",
" text \n",
"0 Microsoft fails to hit profit expectations \n",
"1 On the stock market, confidence continued to p... \n",
"2 Stocks rallied and the British pound gained. \n",
"3 Meyer Burger is now serving the Australian mar... \n",
"4 Meyer Burger enters Australian market and exhi... \n",
"5 J&T Express Vietnam helps local craft villages... \n",
"6 7 experts recommend the stock for purchase, 1 ... \n",
"7 Microsoft Aktie falls. \n",
"8 Microsoft share is rising. "
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def translate_sentiment_series(series: pd.Series) -> pd.Series:\n",
" if series[\"lan\"] == \"en\":\n",
" return series\n",
" elif series[\"lan\"] == \"de\":\n",
" print(series[\"text\"])\n",
" return pd.Series(\n",
" {\n",
" \"text\": translate_sentiment(series[\"text\"]),\n",
" \"lan\": \"de_translated\",\n",
" \"orig\": series[\"text\"],\n",
" }\n",
" )\n",
" raise ValueError(f\"Language {series['lan']} is not known.\")\n",
"\n",
"\n",
"def translate_sentiments(texts: pd.DataFrame) -> pd.DataFrame:\n",
" texts = texts.apply(translate_sentiment_series, axis=1)\n",
" return texts\n",
"\n",
"\n",
"translated_df = translate_sentiments(text_df.copy())\n",
"translated_df"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"ExecuteTime": {
"end_time": "2023-05-01T13:16:24.076261Z",
"start_time": "2023-05-01T13:16:23.383269Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>lan</th>\n",
" <th>orig</th>\n",
" <th>text</th>\n",
" <th>+</th>\n",
" <th>0</th>\n",
" <th>-</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>en</td>\n",
" <td>NaN</td>\n",
" <td>Microsoft fails to hit profit expectations</td>\n",
" <td>0.034084</td>\n",
" <td>0.932933</td>\n",
" <td>0.032982</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>de_translated</td>\n",
" <td>Am Aktienmarkt überwieg weiter die Zuversicht,...</td>\n",
" <td>On the stock market, confidence continued to p...</td>\n",
" <td>0.919673</td>\n",
" <td>0.018426</td>\n",
" <td>0.061901</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>en</td>\n",
" <td>NaN</td>\n",
" <td>Stocks rallied and the British pound gained.</td>\n",
" <td>0.898361</td>\n",
" <td>0.034474</td>\n",
" <td>0.067165</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>de_translated</td>\n",
" <td>Meyer Burger bedient ab sofort australischen M...</td>\n",
" <td>Meyer Burger is now serving the Australian mar...</td>\n",
" <td>0.221019</td>\n",
" <td>0.006844</td>\n",
" <td>0.772137</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>en</td>\n",
" <td>NaN</td>\n",
" <td>Meyer Burger enters Australian market and exhi...</td>\n",
" <td>0.187527</td>\n",
" <td>0.008846</td>\n",
" <td>0.803627</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>de_translated</td>\n",
" <td>J&amp;T Express Vietnam hilft lokalen Handwerksdör...</td>\n",
" <td>J&amp;T Express Vietnam helps local craft villages...</td>\n",
" <td>0.891114</td>\n",
" <td>0.007633</td>\n",
" <td>0.101254</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>de_translated</td>\n",
" <td>7 Experten empfehlen die Aktie zum Kauf, 1 Exp...</td>\n",
" <td>7 experts recommend the stock for purchase, 1 ...</td>\n",
" <td>0.040850</td>\n",
" <td>0.016722</td>\n",
" <td>0.942427</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>de_translated</td>\n",
" <td>Microsoft aktie fällt.</td>\n",
" <td>Microsoft Aktie falls.</td>\n",
" <td>0.027456</td>\n",
" <td>0.889160</td>\n",
" <td>0.083384</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>de_translated</td>\n",
" <td>Microsoft aktie steigt.</td>\n",
" <td>Microsoft share is rising.</td>\n",
" <td>0.952216</td>\n",
" <td>0.019054</td>\n",
" <td>0.028730</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" lan orig \n",
"0 en NaN \\\n",
"1 de_translated Am Aktienmarkt überwieg weiter die Zuversicht,... \n",
"2 en NaN \n",
"3 de_translated Meyer Burger bedient ab sofort australischen M... \n",
"4 en NaN \n",
"5 de_translated J&T Express Vietnam hilft lokalen Handwerksdör... \n",
"6 de_translated 7 Experten empfehlen die Aktie zum Kauf, 1 Exp... \n",
"7 de_translated Microsoft aktie fällt. \n",
"8 de_translated Microsoft aktie steigt. \n",
"\n",
" text + 0 \n",
"0 Microsoft fails to hit profit expectations 0.034084 0.932933 \\\n",
"1 On the stock market, confidence continued to p... 0.919673 0.018426 \n",
"2 Stocks rallied and the British pound gained. 0.898361 0.034474 \n",
"3 Meyer Burger is now serving the Australian mar... 0.221019 0.006844 \n",
"4 Meyer Burger enters Australian market and exhi... 0.187527 0.008846 \n",
"5 J&T Express Vietnam helps local craft villages... 0.891114 0.007633 \n",
"6 7 experts recommend the stock for purchase, 1 ... 0.040850 0.016722 \n",
"7 Microsoft Aktie falls. 0.027456 0.889160 \n",
"8 Microsoft share is rising. 0.952216 0.019054 \n",
"\n",
" - \n",
"0 0.032982 \n",
"1 0.061901 \n",
"2 0.067165 \n",
"3 0.772137 \n",
"4 0.803627 \n",
"5 0.101254 \n",
"6 0.942427 \n",
"7 0.083384 \n",
"8 0.028730 "
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sentiments = analyse_sentiments(translated_df)\n",
"sentiments"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion about a translated FinBert\n",
"\n",
"When translating a german text to english before using FinBert the results look much better and could be used for our project.\n",
"The big problem is that it will take even more CPU.\n",
"It should probably be combined with a language recognition and could be used to take multiple languages in since there are many variances of this translation model."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}