526 lines
32 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# FinBert\n",
"\n",
"FinBert is a sentiment Analysis AI for Financial text.\n",
"Since we want to evaluate news article this is a necessary feature to analyse those texts.\n",
"In this document a first use of this tool will be shown.\n",
"Some texts will be analysed. Especially the analysis of german texts will be tried.\n",
"\n",
"## Sources\n",
"\n",
"[HugginFace](https://huggingface.co/ProsusAI/finbert)\n",
"[Tutorial](https://medium.com/codex/stocks-news-sentiment-analysis-with-deep-learning-transformers-and-machine-learning-cdcdb827fc06)"
]
},
{
"cell_type": "markdown",
"source": [
"## Libraries\n",
"\n",
"* transformers\n",
"* tqdm\n",
"* pandas\n",
"* numpy\n",
"* torch\n",
"* torchvision\n",
"* torchaudio\n",
"* sentencepiece\n",
"* sacremoses"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"start_time": "2023-05-01T13:16:08.554998Z",
"end_time": "2023-05-01T13:16:13.740927Z"
},
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: transformers in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (4.28.1)\n",
"Requirement already satisfied: tqdm in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (4.65.0)\n",
"Requirement already satisfied: pandas in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (2.0.1)\n",
"Requirement already satisfied: numpy in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (1.24.3)\n",
"Requirement already satisfied: torch in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (2.0.0)\n",
"Requirement already satisfied: torchvision in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (0.15.1)\n",
"Requirement already satisfied: torchaudio in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (2.0.1)\n",
"Requirement already satisfied: sentencepiece in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (0.1.98)\n",
"Requirement already satisfied: sacremoses in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (0.0.53)\n",
"Requirement already satisfied: filelock in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from transformers) (3.8.0)\n",
"Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from transformers) (0.14.1)\n",
"Requirement already satisfied: packaging>=20.0 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from transformers) (23.1)\n",
"Requirement already satisfied: pyyaml>=5.1 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from transformers) (6.0)\n",
"Requirement already satisfied: regex!=2019.12.17 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from transformers) (2023.3.23)\n",
"Requirement already satisfied: requests in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from transformers) (2.28.1)\n",
"Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from transformers) (0.13.3)\n",
"Requirement already satisfied: colorama in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from tqdm) (0.4.6)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from pandas) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from pandas) (2022.7)\n",
"Requirement already satisfied: tzdata>=2022.1 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from pandas) (2023.3)\n",
"Requirement already satisfied: typing-extensions in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from torch) (4.5.0)\n",
"Requirement already satisfied: sympy in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from torch) (1.11.1)\n",
"Requirement already satisfied: networkx in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from torch) (3.1)\n",
"Requirement already satisfied: jinja2 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from torch) (3.1.2)\n",
"Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from torchvision) (9.4.0)\n",
"Requirement already satisfied: six in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from sacremoses) (1.16.0)\n",
"Requirement already satisfied: click in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from sacremoses) (8.1.3)\n",
"Requirement already satisfied: joblib in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from sacremoses) (1.2.0)\n",
"Requirement already satisfied: fsspec in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from huggingface-hub<1.0,>=0.11.0->transformers) (2023.4.0)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from jinja2->torch) (2.1.2)\n",
"Requirement already satisfied: charset-normalizer<3,>=2 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from requests->transformers) (2.1.1)\n",
"Requirement already satisfied: idna<4,>=2.5 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from requests->transformers) (3.4)\n",
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from requests->transformers) (1.26.12)\n",
"Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\phhor\\appdata\\roaming\\python\\python311\\site-packages (from requests->transformers) (2022.9.24)\n",
"Requirement already satisfied: mpmath>=0.19 in c:\\users\\phhor\\pycharmprojects\\aki_prj23_transparenzregister\\venv\\lib\\site-packages (from sympy->torch) (1.3.0)\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n",
"[notice] A new release of pip is available: 23.0.1 -> 23.1.2\n",
"[notice] To update, run: python.exe -m pip install --upgrade pip\n"
]
}
],
"source": [
"!pip install transformers tqdm pandas numpy torch torchvision torchaudio sentencepiece sacremoses -U"
]
},
{
"cell_type": "markdown",
"source": [
"### Importing and creation of models and tokenizer"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"tags": [],
"ExecuteTime": {
"start_time": "2023-05-01T13:16:13.743921Z",
"end_time": "2023-05-01T13:16:15.121662Z"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import torch\n",
"\n",
"from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
"\n",
"# create a tokenizer object\n",
"tokenizer = AutoTokenizer.from_pretrained(\"ProsusAI/finbert\")\n",
"\n",
"# fetch the pretrained model\n",
"model = AutoModelForSequenceClassification.from_pretrained(\"ProsusAI/finbert\")"
]
},
{
"cell_type": "markdown",
"source": [
"### Analyze a single sentiment"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"ExecuteTime": {
"start_time": "2023-05-01T13:16:15.122665Z",
"end_time": "2023-05-01T13:16:15.194193Z"
}
},
"outputs": [
{
"data": {
"text/plain": "+ 0.034084\n0 0.932933\n- 0.032982\ndtype: float32"
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def analyze_sentiment(text: str) -> pd.Series:\n",
" input_tokens = tokenizer(text, padding=True, truncation=True, return_tensors=\"pt\")\n",
" output = model(**input_tokens)\n",
" return pd.Series(\n",
" torch.nn.functional.softmax(output.logits, dim=-1)[0].data,\n",
" index=[\"+\", \"0\", \"-\"],\n",
" )\n",
"\n",
"\n",
"headline = \"Microsoft fails to hit profit expectations\"\n",
"tf = analyze_sentiment(headline)\n",
"tf"
]
},
{
"cell_type": "markdown",
"source": [
"### Creating test data"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"tags": [],
"ExecuteTime": {
"start_time": "2023-05-01T13:16:15.198186Z",
"end_time": "2023-05-01T13:16:15.208856Z"
}
},
"outputs": [
{
"data": {
"text/plain": " text lan\n0 Microsoft fails to hit profit expectations en\n1 Am Aktienmarkt überwieg weiter die Zuversicht,... de\n2 Stocks rallied and the British pound gained. en\n3 Meyer Burger bedient ab sofort australischen M... de\n4 Meyer Burger enters Australian market and exhi... en\n5 J&T Express Vietnam hilft lokalen Handwerksdör... de\n6 7 Experten empfehlen die Aktie zum Kauf, 1 Exp... de\n7 Microsoft aktie fällt. de\n8 Microsoft aktie steigt. de",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>text</th>\n <th>lan</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Microsoft fails to hit profit expectations</td>\n <td>en</td>\n </tr>\n <tr>\n <th>1</th>\n <td>Am Aktienmarkt überwieg weiter die Zuversicht,...</td>\n <td>de</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Stocks rallied and the British pound gained.</td>\n <td>en</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Meyer Burger bedient ab sofort australischen M...</td>\n <td>de</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Meyer Burger enters Australian market and exhi...</td>\n <td>en</td>\n </tr>\n <tr>\n <th>5</th>\n <td>J&amp;T Express Vietnam hilft lokalen Handwerksdör...</td>\n <td>de</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7 Experten empfehlen die Aktie zum Kauf, 1 Exp...</td>\n <td>de</td>\n </tr>\n <tr>\n <th>7</th>\n <td>Microsoft aktie fällt.</td>\n <td>de</td>\n </tr>\n <tr>\n <th>8</th>\n <td>Microsoft aktie steigt.</td>\n <td>de</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_df = pd.DataFrame(\n",
" [\n",
" {\"text\": \"Microsoft fails to hit profit expectations\", \"lan\": \"en\"},\n",
" {\n",
" \"text\": \"Am Aktienmarkt überwieg weiter die Zuversicht, wie der Kursverlauf des DAX zeigt.\",\n",
" \"lan\": \"de\",\n",
" },\n",
" {\"text\": \"Stocks rallied and the British pound gained.\", \"lan\": \"en\"},\n",
" {\n",
" \"text\": \"Meyer Burger bedient ab sofort australischen Markt und präsentiert sich auf Smart Energy Expo in Sydney.\",\n",
" \"lan\": \"de\",\n",
" },\n",
" {\n",
" \"text\": \"Meyer Burger enters Australian market and exhibits at Smart Energy Expo in Sydney.\",\n",
" \"lan\": \"en\",\n",
" },\n",
" {\n",
" \"text\": \"J&T Express Vietnam hilft lokalen Handwerksdörfern, ihre Reichweite zu vergrößern.\",\n",
" \"lan\": \"de\",\n",
" },\n",
" {\n",
" \"text\": \"7 Experten empfehlen die Aktie zum Kauf, 1 Experte empfiehlt, die Aktie zu halten.\",\n",
" \"lan\": \"de\",\n",
" },\n",
" {\"text\": \"Microsoft aktie fällt.\", \"lan\": \"de\"},\n",
" {\"text\": \"Microsoft aktie steigt.\", \"lan\": \"de\"},\n",
" ]\n",
")\n",
"text_df"
]
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### Analyze multiple Sentiments"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"ExecuteTime": {
"start_time": "2023-05-01T13:16:15.211858Z",
"end_time": "2023-05-01T13:16:16.132009Z"
}
},
"outputs": [
{
"data": {
"text/plain": " text lan + 0 \n0 Microsoft fails to hit profit expectations en 0.034084 0.932933 \\\n1 Am Aktienmarkt überwieg weiter die Zuversicht,... de 0.053528 0.027950 \n2 Stocks rallied and the British pound gained. en 0.898361 0.034474 \n3 Meyer Burger bedient ab sofort australischen M... de 0.116597 0.012790 \n4 Meyer Burger enters Australian market and exhi... en 0.187527 0.008846 \n5 J&T Express Vietnam hilft lokalen Handwerksdör... de 0.066277 0.020608 \n6 7 Experten empfehlen die Aktie zum Kauf, 1 Exp... de 0.050346 0.022004 \n7 Microsoft aktie fällt. de 0.066061 0.016440 \n8 Microsoft aktie steigt. de 0.041449 0.018471 \n\n - \n0 0.032982 \n1 0.918522 \n2 0.067165 \n3 0.870613 \n4 0.803627 \n5 0.913115 \n6 0.927650 \n7 0.917498 \n8 0.940080 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>text</th>\n <th>lan</th>\n <th>+</th>\n <th>0</th>\n <th>-</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Microsoft fails to hit profit expectations</td>\n <td>en</td>\n <td>0.034084</td>\n <td>0.932933</td>\n <td>0.032982</td>\n </tr>\n <tr>\n <th>1</th>\n <td>Am Aktienmarkt überwieg weiter die Zuversicht,...</td>\n <td>de</td>\n <td>0.053528</td>\n <td>0.027950</td>\n <td>0.918522</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Stocks rallied and the British pound gained.</td>\n <td>en</td>\n <td>0.898361</td>\n <td>0.034474</td>\n <td>0.067165</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Meyer Burger bedient ab sofort australischen M...</td>\n <td>de</td>\n <td>0.116597</td>\n <td>0.012790</td>\n <td>0.870613</td>\n </tr>\n <tr>\n <th>4</th>\n <td>Meyer Burger enters Australian market and exhi...</td>\n <td>en</td>\n <td>0.187527</td>\n <td>0.008846</td>\n <td>0.803627</td>\n </tr>\n <tr>\n <th>5</th>\n <td>J&amp;T Express Vietnam hilft lokalen Handwerksdör...</td>\n <td>de</td>\n <td>0.066277</td>\n <td>0.020608</td>\n <td>0.913115</td>\n </tr>\n <tr>\n <th>6</th>\n <td>7 Experten empfehlen die Aktie zum Kauf, 1 Exp...</td>\n <td>de</td>\n <td>0.050346</td>\n <td>0.022004</td>\n <td>0.927650</td>\n </tr>\n <tr>\n <th>7</th>\n <td>Microsoft aktie fällt.</td>\n <td>de</td>\n <td>0.066061</td>\n <td>0.016440</td>\n <td>0.917498</td>\n </tr>\n <tr>\n <th>8</th>\n <td>Microsoft aktie steigt.</td>\n <td>de</td>\n <td>0.041449</td>\n <td>0.018471</td>\n <td>0.940080</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def analyse_sentiments(texts: pd.DataFrame) -> pd.DataFrame:\n",
" values = texts[\"text\"].apply(analyze_sentiment)\n",
" texts[[\"+\", \"0\", \"-\"]] = values\n",
" return texts\n",
"\n",
"\n",
"analyse_sentiments(text_df.copy())"
]
},
{
"cell_type": "markdown",
"source": [
"## Conclusion about FinBert\n",
"\n",
"The current form of this model can't be used for the german language.\n",
"It could be used if the text is translated beforehand. But it is questionable if that will work well.\n",
"Another way would be to retrain the same model with translated text from this models' data. But I do not believe this to be feasible."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"# Translating sentiments before analysing them with FinBert\n",
"\n",
"The problem with the FinBert model can be solved with translating the input before using FinBert.\n",
"The functions below explor this.\n",
"\n",
"[Translator: Helsinki-NLP/opus-mt-de-en](https://huggingface.co/Helsinki-NLP/opus-mt-de-en)\n",
"https://huggingface.co/docs/transformers/main/en/model_doc/marian#transformers.MarianMTModel\n",
"\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 30,
"outputs": [],
"source": [
"from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
"\n",
"translation_tokenizer = AutoTokenizer.from_pretrained(\"Helsinki-NLP/opus-mt-de-en\")\n",
"\n",
"translation_model = AutoModelForSeq2SeqLM.from_pretrained(\"Helsinki-NLP/opus-mt-de-en\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2023-05-01T13:16:16.135009Z",
"end_time": "2023-05-01T13:16:19.308043Z"
}
}
},
{
"cell_type": "code",
"execution_count": 31,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\phhor\\PycharmProjects\\aki_prj23_transparenzregister\\venv\\Lib\\site-packages\\transformers\\generation\\utils.py:1313: UserWarning: Using `max_length`'s default (512) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/plain": "'J&T Express Vietnam helps local craft villages increase their reach.'"
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def translate_sentiment(text: str) -> str:\n",
" input_tokens = translation_tokenizer([text], return_tensors=\"pt\")\n",
" generated_ids = translation_model.generate(**input_tokens)\n",
" return translation_tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[\n",
" 0\n",
" ]\n",
"\n",
"\n",
"headline = (\n",
" \"J&T Express Vietnam hilft lokalen Handwerksdörfern, ihre Reichweite zu vergrößern.\"\n",
")\n",
"tf = translate_sentiment(headline)\n",
"tf"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2023-05-01T13:16:19.310046Z",
"end_time": "2023-05-01T13:16:19.928232Z"
}
}
},
{
"cell_type": "code",
"execution_count": 32,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Am Aktienmarkt überwieg weiter die Zuversicht, wie der Kursverlauf des DAX zeigt.\n",
"Meyer Burger bedient ab sofort australischen Markt und präsentiert sich auf Smart Energy Expo in Sydney.\n",
"J&T Express Vietnam hilft lokalen Handwerksdörfern, ihre Reichweite zu vergrößern.\n",
"7 Experten empfehlen die Aktie zum Kauf, 1 Experte empfiehlt, die Aktie zu halten.\n",
"Microsoft aktie fällt.\n",
"Microsoft aktie steigt.\n"
]
},
{
"data": {
"text/plain": " lan orig \n0 en NaN \\\n1 de_translated Am Aktienmarkt überwieg weiter die Zuversicht,... \n2 en NaN \n3 de_translated Meyer Burger bedient ab sofort australischen M... \n4 en NaN \n5 de_translated J&T Express Vietnam hilft lokalen Handwerksdör... \n6 de_translated 7 Experten empfehlen die Aktie zum Kauf, 1 Exp... \n7 de_translated Microsoft aktie fällt. \n8 de_translated Microsoft aktie steigt. \n\n text \n0 Microsoft fails to hit profit expectations \n1 On the stock market, confidence continued to p... \n2 Stocks rallied and the British pound gained. \n3 Meyer Burger is now serving the Australian mar... \n4 Meyer Burger enters Australian market and exhi... \n5 J&T Express Vietnam helps local craft villages... \n6 7 experts recommend the stock for purchase, 1 ... \n7 Microsoft Aktie falls. \n8 Microsoft share is rising. ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>lan</th>\n <th>orig</th>\n <th>text</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>en</td>\n <td>NaN</td>\n <td>Microsoft fails to hit profit expectations</td>\n </tr>\n <tr>\n <th>1</th>\n <td>de_translated</td>\n <td>Am Aktienmarkt überwieg weiter die Zuversicht,...</td>\n <td>On the stock market, confidence continued to p...</td>\n </tr>\n <tr>\n <th>2</th>\n <td>en</td>\n <td>NaN</td>\n <td>Stocks rallied and the British pound gained.</td>\n </tr>\n <tr>\n <th>3</th>\n <td>de_translated</td>\n <td>Meyer Burger bedient ab sofort australischen M...</td>\n <td>Meyer Burger is now serving the Australian mar...</td>\n </tr>\n <tr>\n <th>4</th>\n <td>en</td>\n <td>NaN</td>\n <td>Meyer Burger enters Australian market and exhi...</td>\n </tr>\n <tr>\n <th>5</th>\n <td>de_translated</td>\n <td>J&amp;T Express Vietnam hilft lokalen Handwerksdör...</td>\n <td>J&amp;T Express Vietnam helps local craft villages...</td>\n </tr>\n <tr>\n <th>6</th>\n <td>de_translated</td>\n <td>7 Experten empfehlen die Aktie zum Kauf, 1 Exp...</td>\n <td>7 experts recommend the stock for purchase, 1 ...</td>\n </tr>\n <tr>\n <th>7</th>\n <td>de_translated</td>\n <td>Microsoft aktie fällt.</td>\n <td>Microsoft Aktie falls.</td>\n </tr>\n <tr>\n <th>8</th>\n <td>de_translated</td>\n <td>Microsoft aktie steigt.</td>\n <td>Microsoft share is rising.</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def translate_sentiment_series(series: pd.Series) -> pd.Series:\n",
" if series[\"lan\"] == \"en\":\n",
" return series\n",
" elif series[\"lan\"] == \"de\":\n",
" print(series[\"text\"])\n",
" return pd.Series(\n",
" {\n",
" \"text\": translate_sentiment(series[\"text\"]),\n",
" \"lan\": \"de_translated\",\n",
" \"orig\": series[\"text\"],\n",
" }\n",
" )\n",
" raise ValueError(f\"Language {series['lan']} is not known.\")\n",
"\n",
"\n",
"def translate_sentiments(texts: pd.DataFrame) -> pd.DataFrame:\n",
" texts = texts.apply(translate_sentiment_series, axis=1)\n",
" return texts\n",
"\n",
"\n",
"translated_df = translate_sentiments(text_df.copy())\n",
"translated_df"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2023-05-01T13:16:19.933234Z",
"end_time": "2023-05-01T13:16:23.381261Z"
}
}
},
{
"cell_type": "code",
"execution_count": 33,
"outputs": [
{
"data": {
"text/plain": " lan orig \n0 en NaN \\\n1 de_translated Am Aktienmarkt überwieg weiter die Zuversicht,... \n2 en NaN \n3 de_translated Meyer Burger bedient ab sofort australischen M... \n4 en NaN \n5 de_translated J&T Express Vietnam hilft lokalen Handwerksdör... \n6 de_translated 7 Experten empfehlen die Aktie zum Kauf, 1 Exp... \n7 de_translated Microsoft aktie fällt. \n8 de_translated Microsoft aktie steigt. \n\n text + 0 \n0 Microsoft fails to hit profit expectations 0.034084 0.932933 \\\n1 On the stock market, confidence continued to p... 0.919673 0.018426 \n2 Stocks rallied and the British pound gained. 0.898361 0.034474 \n3 Meyer Burger is now serving the Australian mar... 0.221019 0.006844 \n4 Meyer Burger enters Australian market and exhi... 0.187527 0.008846 \n5 J&T Express Vietnam helps local craft villages... 0.891114 0.007633 \n6 7 experts recommend the stock for purchase, 1 ... 0.040850 0.016722 \n7 Microsoft Aktie falls. 0.027456 0.889160 \n8 Microsoft share is rising. 0.952216 0.019054 \n\n - \n0 0.032982 \n1 0.061901 \n2 0.067165 \n3 0.772137 \n4 0.803627 \n5 0.101254 \n6 0.942427 \n7 0.083384 \n8 0.028730 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>lan</th>\n <th>orig</th>\n <th>text</th>\n <th>+</th>\n <th>0</th>\n <th>-</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>en</td>\n <td>NaN</td>\n <td>Microsoft fails to hit profit expectations</td>\n <td>0.034084</td>\n <td>0.932933</td>\n <td>0.032982</td>\n </tr>\n <tr>\n <th>1</th>\n <td>de_translated</td>\n <td>Am Aktienmarkt überwieg weiter die Zuversicht,...</td>\n <td>On the stock market, confidence continued to p...</td>\n <td>0.919673</td>\n <td>0.018426</td>\n <td>0.061901</td>\n </tr>\n <tr>\n <th>2</th>\n <td>en</td>\n <td>NaN</td>\n <td>Stocks rallied and the British pound gained.</td>\n <td>0.898361</td>\n <td>0.034474</td>\n <td>0.067165</td>\n </tr>\n <tr>\n <th>3</th>\n <td>de_translated</td>\n <td>Meyer Burger bedient ab sofort australischen M...</td>\n <td>Meyer Burger is now serving the Australian mar...</td>\n <td>0.221019</td>\n <td>0.006844</td>\n <td>0.772137</td>\n </tr>\n <tr>\n <th>4</th>\n <td>en</td>\n <td>NaN</td>\n <td>Meyer Burger enters Australian market and exhi...</td>\n <td>0.187527</td>\n <td>0.008846</td>\n <td>0.803627</td>\n </tr>\n <tr>\n <th>5</th>\n <td>de_translated</td>\n <td>J&amp;T Express Vietnam hilft lokalen Handwerksdör...</td>\n <td>J&amp;T Express Vietnam helps local craft villages...</td>\n <td>0.891114</td>\n <td>0.007633</td>\n <td>0.101254</td>\n </tr>\n <tr>\n <th>6</th>\n <td>de_translated</td>\n <td>7 Experten empfehlen die Aktie zum Kauf, 1 Exp...</td>\n <td>7 experts recommend the stock for purchase, 1 ...</td>\n <td>0.040850</td>\n <td>0.016722</td>\n <td>0.942427</td>\n </tr>\n <tr>\n <th>7</th>\n <td>de_translated</td>\n <td>Microsoft aktie fällt.</td>\n <td>Microsoft Aktie falls.</td>\n <td>0.027456</td>\n <td>0.889160</td>\n <td>0.083384</td>\n </tr>\n <tr>\n <th>8</th>\n <td>de_translated</td>\n <td>Microsoft aktie steigt.</td>\n <td>Microsoft share is rising.</td>\n <td>0.952216</td>\n <td>0.019054</td>\n <td>0.028730</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sentiments = analyse_sentiments(translated_df)\n",
"sentiments"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2023-05-01T13:16:23.383269Z",
"end_time": "2023-05-01T13:16:24.076261Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Conclusion about a translated FinBert\n",
"\n",
"When translating a german text to english before using FinBert the results look much better and could be used for our project.\n",
"The big problem is that it will take even more CPU.\n",
"It should probably be combined with a language recognition and could be used to take multiple languages in since there are many variances of this translation model."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}