We analyze the Twitter Stream for German #Tatort (“Am Ende des Flurs”) from 04.05.2014
This post is a (unformatted) copy of the IPython notebook, which can be found in our Github.
OK, lets go…
»I hob scho immer Frauen mögn, wo ma übern Zaun steigen muass.«
Some Example Tweets out of the Database
Tweets per Minute
What do you think, when the Tatort started? 🙂
Localisation of the Tweets
Text Processing with the Natural Language Toolkit
That great Book covers almost everything shown here:
Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper O’Reilly Media, 2009
import nltk
from nltk.corpus import stopwords
from nltk import FreqDist
text = tweets['text']
Common Words of a Language to filter out
stop_eng = stopwords.words('english')
stop_ger = stopwords.words('german')
customstopwords = ['tatort', 'mal', 'heute', 'gerade', 'erst', 'macht', 'eigentlich', 'warum', 'gibt', 'gar', 'immer', 'schon', 'beim', 'ganz', 'dass', 'wer', 'mehr', 'gleich', 'wohl']
Clean the Tweets from a bunch of stuff we are not interested in
tokens = []
sentences = []
for txt in text.values:
sentences.append(txt.lower())
tokens.extend([t.lower().encode('utf-8').strip(":,.!?") for t in txt.split()])
hashtags = [w for w in tokens if w.startswith('#')]
mentions = [w for w in tokens if w.startswith('@')]
links = [w for w in tokens if w.startswith('http') or w.startswith('www')]
filtered_tokens = [w for w in tokens \
if not w in stop_eng \
and not w in stop_ger \
and not w in customstopwords \
and w.isalpha() \
and not len(w)<3 \
and not w in hashtags \
and not w in links \
and not w in mentions]
Top 30 Words
freq_dist = nltk.FreqDist(filtered_tokens)
freq_dist
When does the community got, who the murderer was?
The murderer was the neighbour Ms Höllerer, an pharmacist ([ger] ‘Apothekerin’)
tweets[tweets.text.str.contains('Apothekerin')==True][['user','text','follower']].head(5)
Congrats @ClaudeeyaS, you were the first one on Twitter, who got it!
Let's take a look at the percentual amount of the names, which point to the murderer.
The Tatort ended at 21:45, the peaks with Apothekerin
after that are reviews and mostly, because it was Trending Topic and so the bots came to use the hashtag while real people ended writing about #Tatort.
tweets[tweets.text.str.contains('Apothekerin')==True]['201405042145':][['user','text','follower']].sort('follower', ascending=False).head(10)
Concordance
Use of the same word in context
Praktikant
What else the community said to the young man?
rawtweettext.similar('praktikant')
(Justin) Bieber
rawtweettext.concordance("Bieber")
Collocations
In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance.
tweettext = nltk.Text(filtered_tokens)
tweettext.collocations()
Search for Words
fdist = nltk.FreqDist([w.lower() for w in tweettext])
modals = ['apothekerin', 'angst', 'leitmayr', 'nutte', 'messer', 'irre', 'professionelle', 'praktikant']
for m in modals:
print m + ':', fdist[m],
Names in this Tatort
Dispersion Plot
Determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot.
plt.figure(figsize=(16,2))
rawtweettext.dispersion_plot(["franz", u"mike", "justin", "johnny"])
Sentiment Analysis
Use SentiWS as training set.
R. Remus, U. Quasthoff & G. Heyer: SentiWS - a Publicly Available German-language Resource for Sentiment Analysis.
In: Proceedings of the 7th International Language Ressources and Evaluation (LREC'10), pp. 1168--1171, 2010
SentiWS is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
Train a Naive Bayes Classifier
Basically, this is supervised machine learning and Jake Vanderplas made a great talk about that: Machine Learning with Scikit-Learn – Jake Vanderplas on Vimeo
First, we need samples. Our samples are the 2000 most used words out of all tweets (freq_dist
to just use every word once):
samples = freq_dist.keys()[:2000]
samples[:10]
Second, we need a feature.
A feature here is following:
- Every word from the collected Tweets get it’s feature with
True
orFalse
value, depending on, if it is in the Tweet or not - so by iterating over every Tweet, every word in the sample set should at least one time get the feature
True
- because we use a training set and have known sentiment values (Supervised Learning), these
true
orfalse
will getpositive
ornegative
values as features later
That is the easiest way of sentiment analysis. It will not cover negations, like
this was not a good movie
because it is just checking for good
and movie
.
The dictionary that is returned by this function is called a feature set and maps from features’ names to their values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature. Feature values are values with simple types, such as Booleans, numbers, and strings.
def tweet_features(tweet):
features={}
for word in samples:
features['contains(%s)' % word] = (word in tweet)
return features
Create a Training Featureset
Now we take our SentiWS training set and threat it like it were a Tweet. So, if a word from the SentiWS training set is in the samples list of the words we have in all the Tweets, we also have a sentiment (positive or negative) to classify it.
All that is saved in the trainingfeatureset
.
trainingfeatureset = [(tweet_features(word), sentiment) for (word, sentiment) in training_set]
Build the Classifier
The classfier now checks, if some words are more likely tagged with positive
or negative
values.
classifier = nltk.NaiveBayesClassifier.train(trainingfeatureset)
And there are some words:
classifier.show_most_informative_features(14)
These ratios are known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.
Notice the last shown: If a tweets contains RTL
(a german TV channel), the tweet is 4.5x more likely to be negative. 🙂
Example Automatic Sentiment Classification based on the SentiWS Training Set
just Tweets from 10 seconds after the end of the Tatort.
fr = '201405042145'
to = '20140504214510'
positivtweets = []
negativtweets = []
for t in range(len(tweets[fr:to].text)):
tt = tweets[fr:to].text[t]
ts = classifier.classify(tweet_features(tt))
if ts=='positive':
positivtweets.append(tt)
else:
negativtweets.append(tt)
Positive
Negative
Not bad for such a simple classifier!
Now let’s do it for all collected Tweets
Define a function which returns the sentiment from our classifier
def classifytweet(dataframe):
return classifier.classify(tweet_features(dataframe.text))
Apply to all Tweets (takes a while!)
tweets['sentiment'] = tweets.apply(classifytweet, axis=1)
Now we can look, how the mood of the crowd was
Thanks for watching. 🙂
[…] Zurück GNTM Finale 2014 – Stimmungsanalyse auf Twitter […]