mirror of
https://github.com/fhswf/aki_prj23_transparenzregister.git
synced 2025-04-22 12:12:55 +02:00
Reverting black for the jupyter notebooks gets old. Can we just run black over all of them?
333 lines
12 KiB
Plaintext
333 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Sentiment Analysis using VADER\n",
|
|
"*Based on: [Social Media Sentiment Analysis in Python with VADER](https://towardsdatascience.com/social-media-sentiment-analysis-in-python-with-vader-no-training-required-4bc6a21e87b8)*"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"VADER is a lexicon and rule-based model for sentiment analysis. The result generated by VADER is a dictionary of four keys:\n",
|
|
"* neg (negative)\n",
|
|
"* neu (neutral)\n",
|
|
"* pos (positive)\n",
|
|
"* compund (determines the degree of the senitment)\n",
|
|
"\n",
|
|
"The neg, neu and pos values add up to 1. The compound is a value between -1 and +1. A compound greater or equal to 0.05 determines a positive sentiment, a compound lower or equal to -0.05 determines a negative sentiment. Otherwise it is considered to be neutral. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To use the VADER library, we need to install `nltk`. We will also install `pandas` for our example data:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 48,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Requirement already satisfied: nltk in /Users/kim/opt/anaconda3/lib/python3.9/site-packages (3.6.5)\n",
|
|
"Requirement already satisfied: pandas in /Users/kim/opt/anaconda3/lib/python3.9/site-packages (1.3.4)\n",
|
|
"Requirement already satisfied: click in /Users/kim/opt/anaconda3/lib/python3.9/site-packages (from nltk) (8.0.3)\n",
|
|
"Requirement already satisfied: joblib in /Users/kim/opt/anaconda3/lib/python3.9/site-packages (from nltk) (1.1.0)\n",
|
|
"Requirement already satisfied: regex>=2021.8.3 in /Users/kim/opt/anaconda3/lib/python3.9/site-packages (from nltk) (2021.8.3)\n",
|
|
"Requirement already satisfied: tqdm in /Users/kim/opt/anaconda3/lib/python3.9/site-packages (from nltk) (4.62.3)\n",
|
|
"Requirement already satisfied: python-dateutil>=2.7.3 in /Users/kim/opt/anaconda3/lib/python3.9/site-packages (from pandas) (2.8.2)\n",
|
|
"Requirement already satisfied: pytz>=2017.3 in /Users/kim/opt/anaconda3/lib/python3.9/site-packages (from pandas) (2021.3)\n",
|
|
"Requirement already satisfied: numpy>=1.17.3 in /Users/kim/opt/anaconda3/lib/python3.9/site-packages (from pandas) (1.20.3)\n",
|
|
"Requirement already satisfied: six>=1.5 in /Users/kim/opt/anaconda3/lib/python3.9/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"!pip install nltk pandas"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To use VADER for our analysis, we need to import the `nltk` package and the VADER lexicon:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 49,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[nltk_data] Downloading package vader_lexicon to\n",
|
|
"[nltk_data] /Users/kim/nltk_data...\n",
|
|
"[nltk_data] Package vader_lexicon is already up-to-date!\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import nltk\n",
|
|
"\n",
|
|
"# Download the lexicon\n",
|
|
"nltk.download(\"vader_lexicon\")\n",
|
|
"\n",
|
|
"# Import the lexicon\n",
|
|
"from nltk.sentiment.vader import SentimentIntensityAnalyzer\n",
|
|
"\n",
|
|
"# Create an instance of SentimentIntensityAnalyzer\n",
|
|
"sent_analyzer = SentimentIntensityAnalyzer()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To compare our results with FINBert, we analyze the same headline as in the FINBert notebook:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 50,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"{'neg': 0.289, 'neu': 0.412, 'pos': 0.299, 'compound': 0.0258}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"headline = \"Microsoft fails to hit profit expectations\"\n",
|
|
"print(sent_analyzer.polarity_scores(headline))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Since `compound` = 0.0258, this first headline is considered to be neutral. Now we create the same test data as in the notebook for FINBert. As VADER only works with english texts, we directly use the translations:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 51,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"text_df = pd.DataFrame(\n",
|
|
" [\n",
|
|
" {\"text\": \"Microsoft fails to hit profit expectations.\"},\n",
|
|
" {\n",
|
|
" \"text\": \"Confidence continues to prevail on the stock market, as the performance of the DAX shows.\"\n",
|
|
" },\n",
|
|
" {\"text\": \"Stocks rallied and the British pound gained.\"},\n",
|
|
" {\n",
|
|
" \"text\": \"Meyer Burger now serves Australian market and presents itself at Smart Energy Expo in Sydney.\"\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"text\": \"Meyer Burger enters Australian market and exhibits at Smart Energy Expo in Sydney.\"\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"text\": \"J&T Express Vietnam helps local craft villages increase their reach.\"\n",
|
|
" },\n",
|
|
" {\n",
|
|
" \"text\": \"7 experts recommend the stock for purchase, 1 expert recommends holding the stock.\"\n",
|
|
" },\n",
|
|
" {\"text\": \"Microsoft share falls.\"},\n",
|
|
" {\"text\": \"Microsoft share is rising.\"},\n",
|
|
" ]\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Analysis of the sample data:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 52,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>text</th>\n",
|
|
" <th>vader_prediction</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>Microsoft fails to hit profit expectations.</td>\n",
|
|
" <td>{'neg': 0.289, 'neu': 0.412, 'pos': 0.299, 'co...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>Confidence continues to prevail on the stock m...</td>\n",
|
|
" <td>{'neg': 0.0, 'neu': 0.809, 'pos': 0.191, 'comp...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>Stocks rallied and the British pound gained.</td>\n",
|
|
" <td>{'neg': 0.0, 'neu': 0.698, 'pos': 0.302, 'comp...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>Meyer Burger now serves Australian market and ...</td>\n",
|
|
" <td>{'neg': 0.0, 'neu': 0.73, 'pos': 0.27, 'compou...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>Meyer Burger enters Australian market and exhi...</td>\n",
|
|
" <td>{'neg': 0.0, 'neu': 0.696, 'pos': 0.304, 'comp...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>5</th>\n",
|
|
" <td>J&T Express Vietnam helps local craft villages...</td>\n",
|
|
" <td>{'neg': 0.0, 'neu': 0.538, 'pos': 0.462, 'comp...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>6</th>\n",
|
|
" <td>7 experts recommend the stock for purchase, 1 ...</td>\n",
|
|
" <td>{'neg': 0.0, 'neu': 0.672, 'pos': 0.328, 'comp...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>7</th>\n",
|
|
" <td>Microsoft share falls.</td>\n",
|
|
" <td>{'neg': 0.0, 'neu': 0.476, 'pos': 0.524, 'comp...</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>8</th>\n",
|
|
" <td>Microsoft share is rising.</td>\n",
|
|
" <td>{'neg': 0.0, 'neu': 0.577, 'pos': 0.423, 'comp...</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" text \\\n",
|
|
"0 Microsoft fails to hit profit expectations. \n",
|
|
"1 Confidence continues to prevail on the stock m... \n",
|
|
"2 Stocks rallied and the British pound gained. \n",
|
|
"3 Meyer Burger now serves Australian market and ... \n",
|
|
"4 Meyer Burger enters Australian market and exhi... \n",
|
|
"5 J&T Express Vietnam helps local craft villages... \n",
|
|
"6 7 experts recommend the stock for purchase, 1 ... \n",
|
|
"7 Microsoft share falls. \n",
|
|
"8 Microsoft share is rising. \n",
|
|
"\n",
|
|
" vader_prediction \n",
|
|
"0 {'neg': 0.289, 'neu': 0.412, 'pos': 0.299, 'co... \n",
|
|
"1 {'neg': 0.0, 'neu': 0.809, 'pos': 0.191, 'comp... \n",
|
|
"2 {'neg': 0.0, 'neu': 0.698, 'pos': 0.302, 'comp... \n",
|
|
"3 {'neg': 0.0, 'neu': 0.73, 'pos': 0.27, 'compou... \n",
|
|
"4 {'neg': 0.0, 'neu': 0.696, 'pos': 0.304, 'comp... \n",
|
|
"5 {'neg': 0.0, 'neu': 0.538, 'pos': 0.462, 'comp... \n",
|
|
"6 {'neg': 0.0, 'neu': 0.672, 'pos': 0.328, 'comp... \n",
|
|
"7 {'neg': 0.0, 'neu': 0.476, 'pos': 0.524, 'comp... \n",
|
|
"8 {'neg': 0.0, 'neu': 0.577, 'pos': 0.423, 'comp... "
|
|
]
|
|
},
|
|
"execution_count": 52,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"def format_output(output_dict):\n",
|
|
" polarity = \"neutral\"\n",
|
|
"\n",
|
|
" if output_dict[\"compound\"] >= 0.05:\n",
|
|
" polarity = \"positive\"\n",
|
|
"\n",
|
|
" elif output_dict[\"compound\"] <= -0.05:\n",
|
|
" polarity = \"negative\"\n",
|
|
"\n",
|
|
" return polarity\n",
|
|
"\n",
|
|
"\n",
|
|
"def predict_sentiment(text):\n",
|
|
" output_dict = sent_analyzer.polarity_scores(text)\n",
|
|
" return output_dict\n",
|
|
"\n",
|
|
"\n",
|
|
"# Run the predictions\n",
|
|
"text_df[\"vader_prediction\"] = text_df[\"text\"].apply(predict_sentiment)\n",
|
|
"\n",
|
|
"# Show results\n",
|
|
"text_df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Conclusion\n",
|
|
"Since VADER only evaluates the sentiment of single words, it does not seem feasible for our example of financial texts. Especially, the two examples \"Microsoft share falls\" and \"Microsoft share is rising\" should yield different sentiments."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"interpreter": {
|
|
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3.10.1 64-bit",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.1"
|
|
},
|
|
"orig_nbformat": 4
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|