In this example, we’ll connect to the Twitter Streaming API, gather tweets (based on a keyword), calculate the sentiment of each tweet, and build a real-time dashboard using the Elasticsearch DB and Kibana to visualize the results.
Follow the official Docker documentation to install both Docker and boot2docker. Then with boot2docker up and running, run
docker version to test the Docker installation. Create a directory to house your project, grab the Dockerfile from the repository, and build the image:
Once built, run the container:
Finally, run the next two commands in new terminal windows to map the IP address/port combo used by the boot2docker VM to your localhost:
Twitter Streaming API
In order to access the Twitter Streaming API, you need to register an application at http://apps.twitter.com. Once created, you should be redirected to your app’s page, where you can get the consumer key and consumer secret and create an access token under the “Keys and Access Tokens” tab. Add these to a new file called config.py:
1 2 3 4
Since this file contains sensitive information do not add it to your Git repository.
According to the Twitter Streaming documentation, “establishing a connection to the streaming APIs means making a very long lived HTTP request, and parsing the response incrementally. Conceptually, you can think of it as downloading an infinitely long file over HTTP.”
So, you make a request, filter it by a specific keyword, user, and/or geographic area and then leave the connection open, collecting as many tweets as possible.
This sounds complicated, but Tweepy makes it easy.
Tweepy uses a “listener” to not only grab the streaming tweets, but filter them as well.
Save the following code as sentiment.py:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
- We connect to the Twitter Streaming API;
- Filter the data by the keyword
- Decode the results (the tweets);
- Calculate sentiment analysis via TextBlob;
- Determine if the overall sentiment is positive, negative, or neutral; and,
- Finally the relevant sentiment and tweet data is added to the Elasticsearch DB.
Follow the inline comments for further details.
TextBlob sentiment basics
To calculate the overall sentiment, we look at the polarity score:
- Positive – from .01 to 1
- Neutral – 0
- Negative – from –.01 to -1
Refer to the official documentation for more information on how TextBlob calculates sentiment.
Over a two hour period, as I wrote this blog post, I pulled over 9,500 tweets with the keyword “congress”. At this point go ahead and perform a search of your own, on a subject of interest to you. Once you have a sizable number of tweets, stop the script. Now you can perform some quick searches/analysis…
Using the index (
"sentiment") from the sentiment.py script, you can use the Elasticsearch search API to gather some basic insights.
- Full text search for “obama”: http://localhost:9200/sentiment/_search?q=obama
- Author/Twitter username search: http://localhost:9200/sentiment/_search?q=author:allvoices
- Sentiment search: http://localhost:9200/sentiment/_search?q=sentiment:positive
- Sentiment and “obama” search: http://localhost:9200/sentiment/_search?q=sentiment:positive&message=obama
There’s much, much more you can do with Elasticsearch besides just searching and filtering results. Check out the Analyze API as well as the Elasticsearch – The Definitive Guide for more ideas on how to analyze and model your data.
The pie chart at the top of this post came direct from Kibana, which shows the proportion of each sentiment – positive, neutral, and negative – to the whole from the tweets I pulled. Here’s a few more graphs from Kibana…
All tweets filtered by the word “obama”
Top twitter users by tweet count
Notice how the top author as 76 tweets. That’s definitely worthy of a deeper look since that’s a lot of tweets in a two hour period. Anyway, that author basically tweeted the same tweet 76 times – so you would want to filter out 75 of these since the overall results are currently skewed.
Aside for these charts, it’s worth visualizing sentiment by location. Try this on your own. You’ll have to alter the data you are grabbing from each tweet. You may also want to try visualizing the data with a histogram as well.
- Grab the code from the repository.
- Leave comments/questions below.