Great article from MIT
Scaling With Limited Resources
Thanks to AWS, companies these days are not as hard to scale as they used to be. However, college dorm-room projects still are. One of the less important things a company has to worry about is paying for its servers. That’s not the case for us, though since we’ve been short on money and pretty greedy with data.
Here are some rough numbers about the volume of the data that we analyzed and got our results from.
User data: ~ 70 GB
Tweets: > 1 TB and growing
Analysis results: ~ 300 GB
> 10 billion rows in our databases.
Given the fact that we use almost all of this data to run experiments everyday, there was no way we could possibly afford putting it up on Amazon on a student budget. So we had to take a trip down to the closest hardware store and put together two desktops. That’s what we’re still using now.
We did lots of research about choosing the right database. Since we are only working on two nodes and are mainly bottlenecked by insertion speeds, we decided to go with good old MySQL. All other solutions were too slow on a couple nodes, or were too restrictive for running diverse queries. We wanted flexibility so we could experiment more easily.
Dealing with the I/O limitations on a single node is not easy. We had to use all kind of different of tricks to get around our limitations. SSD Caching, Bulk insertions, MySQL partitioning, dumping to files, extensive use of bitmaps, and the list goes on.
If you have advice or questions regarding all that fun stuff, we’re all ears. : )
Now on to the more interesting stuff, our results.
Word-based Signal Detection
Our first step towards signal and event detection was counting single words. We counted the number of occurrences of every word in our tweets during every hour. Sort of like Google’s Ngrams, but for tweets.
Make sure you play around with some of our experimental results here. The times are EST/EDT. *
Here are some cool stuff we found from looking at the counts. If you also find anything interesting from looking at different words, please share it with us!
Daily and Weekly Fluctuations
If you look for a very common word like ‘a’ to see an estimate of the total volume of tweets, you clearly see a daily and weekly pattern. Tweeting peaks at around 7 pm PST (10 pm EST) and hits a low at around 3 am PST every day. There’s also generally less tweeting during Fridays and Saturdays, probably because people have better things to do with their lives than to tweet!
A side note:
We’ve tried to focus on English speaking tweeters within the States. Note that the percentage of tweets containing ‘a’ also fluctuates during the day, which is surprising at first. But, this is because non-English tweets that we have discarded are much more frequent during the night in our time zone, and they often don’t contain the word ‘a’ as often as English tweets do.
I’m a night owl myself and I had always been curious to know at exactly what time the average person goes to sleep or at least thinks about it! I looked for the words “sleep”, “sleeping”, and “bed”. You can do this yourself, but the only problem you’ll see is that not all the tweets have the same time zones. To solve this issue, we isolated several million tweets which had users who had set their locations to Boston or Cambridge, MA. Then, we created a histogram of their average sleeping hours. Here’s the result:
It seems like the average Bostonian sleeps at around midnight! Of course, that’s probably not the average everywhere. After all, a fourth of our city are nocturnal college students!
You can look at all kinds of words relating to recurring events like ‘lunch’, ‘class’, ‘work’, ‘hungry’ and whatever you can imagine. I promise you, you’ll be fascinated.
Here are some suggestions:
Coke, Valentine, Hugo and other oscars-related words, IPO.
(please suggest other interesting things I should add to this list)
I’m Obsessed With Linguistics
As we were looking at different words, we noticed that the words Monday, Tuesday, etc show very interesting weekly patterns. They reflect a signal that has its peak on the respective day, as you’d expect, and which rises as you get closer to that day. This means that people have more anticipation for days that are closer, more or less linearly. But if you pay closer attention, you’ll see that the day immediately before the search term corresponds to a clear valley in the curve. This points to a very interesting linguistic phenomena. That in English, we never refer to the next day with the name of that weekday, and instead use the word ‘tomorrow’.
On a Wednesday, people don’t say ‘Thursday’
We tried to find events that we thought would have a strong reflection in the Twitter sphere. ‘superbowl’, ‘sopa’, and ‘goldman’ were pretty interesting. Here are the graphs for those three, which you can also recreate yourself.
Tweets about ‘Superbowl’ during each hour
Tweets about ‘Sopa’ during each hour
Tweets about ‘Goldman Sachs’ during each hour
We’ll post more about our attempt to exactly dissect what happened on Twitter during these events as time progressed. In the Goldman Sachs case, for example, the peak happens on the day of the controversial public exit of a GS employee, which was reflected in the NYTimes. The earliest news release time was at 7am GMT which is the same as the first signs of a rise in our signal.
Politics and Public Sentiment
If you query the word Obama, this is what you’ll see:
When we first saw this spike, we were very suspicious. The spike seemed way too prominent to be associated with an event. (~25 times the average signal amplitude) But guess what. The spike was at 9 PM on Jan 24th when the state of the union speech happened!
We were curious to see some of the 250 K sample tweets containing ‘obama’ from that hour. Here are a few of them along with some self-declared descriptions from users:
ok. the obama / giffords embrace made me choke up a little.
A teacher from Killeen, Texas. ~200 followers.
I love that Obama starts out with a tribute to our military. #SOTU
A liker of progressive politics from Utah. ~100 followers.
Great Speech Obama #SOTU
A CEO from NYC. ~700 followers.
There were both positive and negative tweets. But we wanted to know whether the tweets were positive or negative because that’s what really matters in a context like this. Here are some results and an explanation of how our sentiment analysis works in general.
Our sentiment analysis is done by training our model using several thousand manually classified sample tweets. It reaches very good prediction accuracy according to our tests and as I’ll explain below.
The graph below shows the normalized sentiments of several different sets of tweets for each day during a 3 month period.
The two graphs here are sentiments of millions of independent randomly chosen tweets from our set. The fact that they follow each other so closely is the important achievement of our system. It means that the signal to noise ratio is so high that the sentiment is clearly measurable.
You can also observe a weekly periodicity in the general sentiment. Interestingly, it shows that people have happier and more positive tweets during the weekends compared to the rest of the week! In addition, the signal acts sort of unusually around January 1st and February 14th!
The graphs below, on the other hand, are indicative of the sentiments of tweets in which some combination of keywords related to ‘economy’ or ‘energy’ were talked about. As you can see, the patterns in the graphs are fairly stable other than at a single day, January 24th, where the sentiment significantly drops. That’s when Obama’s state of the union speech was, and it looks like his speech triggered a lot of negative tweets related to energy and the economy.
We were curious to see whether the sentiment was only negative for this bag of tweets (those that contain energy, economy), or if tweets about obama during the state of the union speech were negative in general. Here are our results of running sentiment analysis on all the data containing the word ‘obama’ in the past 4 months.
Here, the blue curve is the average sentiment of tweets for each day and the red curve shows the amount of variance in the sentiment. (If it’s high, then there were more happy and sad tweets and when it’s low, tweets were more neutral)
The graph clearly shows that there was some heated debate about Obama on exactly January 24th. On the same day, the sentiment has dropped and so we can tell that there was more negative tweeting than there was positive. It looks like people weren’t too happy during the #sotu speech.
As we looked at the graph, it was also hard to ignore the other peak in variance (polarity) that seems to appear around the 15th to 17th of January. It seems like sentiment about Obama fell and raised rapidly in only a few days in that time span. This is likely a result of tweets about SOPA/PIPA and Obama’s disagreement with the bill which happened during those days.
The Growth of Twitter
Twitter has had periods of slow and rapid growth since its inception on March 21st of 2006 up until now. We tried to capture the growth of Twitter, since its very first user until earlier this year. Here is the result showing the number of people joining Twitter during every hour since day one:
As you can see, there are some very abnormally large numbers of people joining during specific hours throughout Twitter’s lifetime. One is apparently in April 2009, when theyreleased their search feature. Another is probably when they rolled out the then New Twitter for everyone.
As we started looking at the data for the first time, we were absolutely blown away by all the cool insights you can extract from it. It makes you wonder why people aren’t doing these sorts of things more often with all this cheap data and computing power that they have these days.
This was our first blog post and we hope you liked what you saw. You’re awesome if you’re still reading. Stay tuned for more and please give us any feedback you may have!
*The count results aren’t perfectly accurate. There is a general upward trend because twitter deletes history before 3200 tweets.There is also a discontinuity on around Feb 11th which is because of a temporary glitch we had.