VADER analysis 🚧
Momentan basierend auf dieses Tutorial:
https://www.kaggle.com/code/robikscube/sentiment-analysis-python-youtube-tutorial https://www.youtube.com/watch?v=QpzMWQvxXWk
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.
Als erstes importieren wir alle Libraries, die wir brauchen.
import panda as pd
import maptlotlib.pyplot as plt
import seaborn as sns
Dann lesen wir die .csv
Datei.
df = pd.read_csv("/pfad/zu/der/datei.csv")
Lass uns die Daten mal genauer anschauen, bevor wir damit anfangen zu arbeiten.
print(df.shape)
print(df.head(500))
Jetzt fangen wir an mit der Sentiment Analyse.
Dafür benutzen wir den SentimentIntensityAnalyzer
von nltk,
der im Hintergrund VADER benutzt.
sia = SentimentIntensityAnalyzer()
Mit sia.polarity_scores()
können wir jetzt jeden belibiegen Text analysieren.
print(sia.polarity_scores("I am so happy"))
Output
Wir haben wir hier ein Dictionary bekommen mit vier Feldern: "neg" für "negativ", "neu" für "neutral", "pos" für "positiv" und "compound" für eine kumulierten Wert, der ausdrückt wie positiv, negativ oder neutral der Text im Ganzen ist.
Hier nochmal ein Zitat vom Projekt "vaderSentiment" von deren Github, wie das scoring ermittelt wird:
Ich lass mal das "vaderSentiment" Team "compound" erklären:
The
compound
score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.
Und hier was "pos", "neu" und "neg" ist:
The
pos
,neu
, andneg
scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). These are the most useful metrics if you want to analyze the context & presentation of how sentiment is conveyed or embedded in rhetoric for a given sentence. For example, different writing styles may embed strongly positive or negative sentiment within varying proportions of neutral text -- i.e., some writing styles may reflect a penchant for strongly flavored rhetoric, whereas other styles may use a great deal of neutral text while still conveying a similar overall (compound) sentiment. As another example: researchers analyzing information presentation in journalistic or editorical news might desire to establish whether the proportions of text (associated with a topic or named entity, for example) are balanced with similar amounts of positively and negatively framed text versus being "biased" towards one polarity or the other for the topic/entity.
Jetzt können wir über all unsere Daten schleifen:
res = {}
for row in df.iterrows():
# ...
res[row["id"]] = sia.polarity_scores(row["text"])
print(res)
res = pd.DateFrame(res)
res.head()
Und mit seaborn plotten wir einen Barplot
ax = sns.barplot(res, x="Score", y="compound")
ax.set_title("Compound Score by Star Review")
plt.show()
fig, axs = plt.subplots(1, 3, figsize=(12, 3))
sns.barplot(data=vaders, x='Score', y='pos', ax=axs[0])
sns.barplot(data=vaders, x='Score', y='neu', ax=axs[1])
sns.barplot(data=vaders, x='Score', y='neg', ax=axs[2])
axs[0].set_title('Positive')
axs[1].set_title('Neutral')
axs[2].set_title('Negative')
plt.tight_layout()
plt.show()