Storage market

We’ve been for quite a long time into a trend of ever cheaper storage capacity.The GB price have been falling until the industry crisis from 2011-2012 (flooding in Thailand). Even if the trend remains it seems like something I’ve changed.

1) The market is no longer seeking cheaper capacity storage but instead faster storage. SSD technologies will change the market just like the multi-core change the CPU market when power and heat issues forced chip makers in the early 2000’s to move from single to multiple cores to enable CPU design to keep pace with Moore’s Law

2)Even the demand of storage capacity has relieve for personal files. People no longer need to download and store locally their movies and music. Instead they can subscribe to legal offer such as Netflix,Spotify or benefit from service like Google Drive or DropBox. Not only cheaper but also simpler to use they have changed the market for times to come.


1956, IBM’s hard drive ….10 mo



The recent trend illustrated

blog-cost-per-gb-1 blog-historical-vs-actual-1



Processing data with Drake

Introducing ‘Drake’, a “Make for Data”

We call this tool Drake, and today we are excited to share Drake with the world, as an open source project. It is written in Clojure.

Drake is a text-based command line data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs.  It automatically resolves dependencies and provides a rich set of options for controlling the workflow. It supports multiple inputs and outputs and has HDFS support built-in.

We use Drake at Factual on various internal projects. It serves as a primary way to define, run, and manage data workflow. Some core benefits we’ve seen:
    • Non-programmers can run Drake and fully manage a workflow
    • Encourages repeatability of the overall data building process
    • Encourages consistent organization (e.g., where supporting scripts live, and how they’re run)
    • Precise control over steps (for more effective testing, debugging, etc.)
    • Unifies different tools in a single workflow (shell commands, Ruby, Python, Clojure, pushing data to production, etc.)

Drake official blog:


Guardian dataset from their Datablog

All datasets from the Guardian, through their Datablog. Thanks to ScraperWiki they are all available via Google Spreadsheet:

Available here:

Built from original data:

Majestic Million help building linkdaq ( ), is now publishing free data under a free creative commons license.

The Majestic Million database is a list  of the top 1 million website in the world, ordered by the number of referring subnets. A subnet is a bit complex – but to a layman it is basically anything within an IP range, ignoring the last three digits of the IP number.


Regular Updates

We have set up a CRON job so that every day we will recompile the data, which is based on our Fresh index. It is POSSIBLE that on two consecutive days, the data is the same, if the Fresh Index did not update for some reason, but usually the data will change daily. Please do not try to download the data more than once every 24 hours, otherwise you will simply end up getting banned or we will have to reconsider giving the data away or putting it behind a walled garden.

Download Location

Be wary scrapers… this will do your head in if you weren’t expecting it… The Majestic Million CSV can regularly be downloaded here:


LinkDAQ is a fun (hopefully) trading game based off of the top 50,000 websites in the world.

Play on


MIT: Looking At The World Through Twitter Data

Great article from MIT

Me and my friend Kang, also an MIT CS undergrad started playing with some data from Twitter a little while ago. I wrote this post to give a summary of some of the challenges we faced, some things we learned along the way, and some of our results so far.I hope they’ll show to you, as they did to us, how valuable social data can be.

Scaling With Limited Resources

Thanks to AWS, companies these days are not as hard to scale as they used to be. However, college dorm-room projects still are. One of the less important things a company has to worry about is paying for its servers. That’s not the case for us, though since we’ve been short on money and pretty greedy with data.

Here are some rough numbers about the volume of the data that we analyzed and got our results from.

User data: ~ 70 GB
Tweets: > 1 TB and growing
Analysis results: ~ 300 GB

> 10 billion rows in our databases.

Given the fact that we use almost all of this data to run experiments everyday, there was no way we could possibly afford putting it up on Amazon on a student budget. So we had to take a trip down to the closest hardware store and put together two desktops. That’s what we’re still using now.

We did lots of research about choosing the right database. Since we are only working on two nodes and are mainly bottlenecked by insertion speeds, we decided to go with good old MySQL. All other solutions were too slow on a couple nodes, or were too restrictive for running diverse queries. We wanted flexibility so we could experiment more easily.

Dealing with the I/O limitations on a single node is not easy. We had to use all kind of different of tricks to get around our limitations. SSD Caching, Bulk insertions, MySQL partitioning, dumping to files, extensive use of bitmaps, and the list goes on.

If you have advice or questions regarding all that fun stuff, we’re all ears. : )

Now on to the more interesting stuff, our results.

Word-based Signal Detection

Our first step towards signal and event detection was counting single words. We counted the number of occurrences of every word in our tweets during every hour. Sort of like Google’s Ngrams, but for tweets.

Make sure you play around with some of our experimental results here. The times are EST/EDT. *

Here are some cool stuff we found from looking at the counts. If you also find anything interesting from looking at different words, please share it with us!

Daily and Weekly Fluctuations

If you look for a very common word like ‘a’ to see an estimate of the total volume of tweets, you clearly see a daily and weekly pattern. Tweeting peaks at around 7 pm PST (10 pm EST) and hits a low at around 3 am PST every day. There’s also generally less tweeting during Fridays and Saturdays, probably because people have better things to do with their lives than to tweet!


A side note:

We’ve tried to focus on English speaking tweeters within the States. Note that the percentage of tweets containing ‘a’ also fluctuates during the day, which is surprising at first. But, this is because non-English tweets that we have discarded are much more frequent during the night in our time zone, and they often don’t contain the word ‘a’ as often as English tweets do.


I’m a night owl myself and I had always been curious to know at exactly what time the average person goes to sleep or at least thinks about it! I looked for the words “sleep”, “sleeping”, and “bed”. You can do this yourself, but the only problem you’ll see is that not all the tweets have the same time zones. To solve this issue, we isolated several million tweets which had users who had set their locations to Boston or Cambridge, MA. Then, we created a histogram of their average sleeping hours. Here’s the result:


It seems like the average Bostonian sleeps at around midnight! Of course, that’s probably not the average everywhere. After all, a fourth of our city are nocturnal college students!

You can look at all kinds of words relating to recurring events like ‘lunch’, ‘class’, ‘work’, ‘hungry’ and whatever you can imagine. I promise you, you’ll be fascinated.

Here are some suggestions:
Coke, Valentine, Hugo and other oscars-related words, IPO.
(please suggest other interesting things I should add to this list)

I’m Obsessed With Linguistics

As we were looking at different words, we noticed that the words Monday, Tuesday, etc show very interesting weekly patterns. They reflect a signal that has its peak on the respective day, as you’d expect, and which rises as you get closer to that day. This means that people have more anticipation for days that are closer, more or less linearly. But if you pay closer attention, you’ll see that the day immediately before the search term corresponds to a clear valley in the curve. This points to a very interesting linguistic phenomena. That in English, we never refer to the next day with the name of that weekday, and instead use the word ‘tomorrow’.


On a Wednesday, people don’t say ‘Thursday’


We tried to find events that we thought would have a strong reflection in the Twitter sphere. ‘superbowl’, ‘sopa’, and ‘goldman’ were pretty interesting. Here are the graphs for those three, which you can also recreate yourself.

Tweets about ‘Superbowl’ during each hour

Tweets about ‘Sopa’ during each hour

Tweets about ‘Goldman Sachs’ during each hour

We’ll post more about our attempt to exactly dissect what happened on Twitter during these events as time progressed. In the Goldman Sachs case, for example, the peak happens on the day of the controversial public exit of a GS employee, which was reflected in the NYTimes. The earliest news release time was at 7am GMT which is the same as the first signs of a rise in our signal.

Politics and Public Sentiment

If you query the word Obama, this is what you’ll see:


When we first saw this spike, we were very suspicious. The spike seemed way too prominent to be associated with an event. (~25 times the average signal amplitude) But guess what. The spike was at 9 PM on Jan 24th when the state of the union speech happened!

We were curious to see some of the 250 K sample tweets containing ‘obama’ from that hour. Here are a few of them along with some self-declared descriptions from users:

ok. the obama / giffords embrace made me choke up a little.
A teacher from Killeen, Texas. ~200 followers.

I love that Obama starts out with a tribute to our military. #SOTU
A liker of progressive politics from Utah. ~100 followers.

Great Speech Obama #SOTU
A CEO from NYC. ~700 followers.

There were both positive and negative tweets. But we wanted to know whether the tweets were positive or negative because that’s what really matters in a context like this. Here are some results and an explanation of how our sentiment analysis works in general.

Sentiment Analysis

Our sentiment analysis is done by training our model using several thousand manually classified sample tweets. It reaches very good prediction accuracy according to our tests and as I’ll explain below.

The graph below shows the normalized sentiments of several different sets of tweets for each day during a 3 month period.

The two graphs here are sentiments of millions of independent randomly chosen tweets from our set. The fact that they follow each other so closely is the important achievement of our system. It means that the signal to noise ratio is so high that the sentiment is clearly measurable.

You can also observe a weekly periodicity in the general sentiment. Interestingly, it shows that people have happier and more positive tweets during the weekends compared to the rest of the week! In addition, the signal acts sort of unusually around January 1st and February 14th!

The graphs below, on the other hand, are indicative of the sentiments of tweets in which some combination of keywords related to ‘economy’ or ‘energy’ were talked about. As you can see, the patterns in the graphs are fairly stable other than at a single day, January 24th, where the sentiment significantly drops. That’s when Obama’s state of the union speech was, and it looks like his speech triggered a lot of negative tweets related to energy and the economy.

We were curious to see whether the sentiment was only negative for this bag of tweets (those that contain energy, economy), or if tweets about obama during the state of the union speech were negative in general. Here are our results of running sentiment analysis on all the data containing the word ‘obama’ in the past 4 months.

Here, the blue curve is the average sentiment of tweets for each day and the red curve shows the amount of variance in the sentiment. (If it’s high, then there were more happy and sad tweets and when it’s low, tweets were more neutral)

The graph clearly shows that there was some heated debate about Obama on exactly January 24th. On the same day, the sentiment has dropped and so we can tell that there was more negative tweeting than there was positive. It looks like people weren’t too happy during the #sotu speech.

As we looked at the graph, it was also hard to ignore the other peak in variance (polarity) that seems to appear around the 15th to 17th of January. It seems like sentiment about Obama fell and raised rapidly in only a few days in that time span. This is likely a result of tweets about SOPA/PIPA and Obama’s disagreement with the bill which happened during those days.

The Growth of Twitter

Twitter has had periods of slow and rapid growth since its inception on March 21st of 2006 up until now. We tried to capture the growth of Twitter, since its very first user until earlier this year. Here is the result showing the number of people joining Twitter during every hour since day one:

As you can see, there are some very abnormally large numbers of people joining during specific hours throughout Twitter’s lifetime. One is apparently in April 2009, when theyreleased their search feature. Another is probably when they rolled out the then New Twitter for everyone.

Summing up…

As we started looking at the data for the first time, we were absolutely blown away by all the cool insights you can extract from it. It makes you wonder why people aren’t doing these sorts of things more often with all this cheap data and computing power that they have these days.

This was our first blog post and we hope you liked what you saw. You’re awesome if you’re still reading. Stay tuned for more and please give us any feedback you may have!

*The count results aren’t perfectly accurate. There is a general upward trend because twitter deletes history before 3200 tweets.There is also a discontinuity on around Feb 11th which is because of a temporary glitch we had.

About JSON genesis

Video from  IEEE Computing Conversations

Interview with Douglas Crockford about the development of JavaScript Object Notation (JSON)


  • Crockford is likeably humble about the origins of JSON. Rather than claiming he inventedJSON he instead says he discovered it:
“I don’t claim to have invented it, because it already existed in nature. I just saw it, recognized the value of it, gave it a name, and a description, and showed its benefits. I don’t claim to be the only person to have discovered it.”


  • Crockford tried very hard to strip unnecessary stuff from JSON so it stood a better chance of being language independent. When confronted with push back about JSON not being a “standard” Crockford registered, put up a specification that documented the data format, and declared it as a standard.


  • Crockford wanted something that made his life easier. He needed JSON when building an application where a client written in JavaScript needed to communicate with a server written in Java.   He wanted something where the data serialization matched the data structures available to both programming language environments.


Harvard released metadata for 12Millions library book into the public domain

Harvard University has today put into the public domain (CC0) full bibliographic information about virtually all the 12M works in its 73 libraries. The metadata, in the standard MARC21 format, is available for bulk download from Harvard. The University also provided the data to the Digital Public Library of America’s prototype platform for programmatic access via an API.

Official press release

Harvard’s Open Metadata policy

API details

Google now offering docs for Takeout

Google, through its Data Liberation Blog, has just announced they now offering the docs for takeout.

So you can now export them along with everything else on the Google Takeout menu.




Choose to download all of the Docs that you own through Takeout in any of the formats mentioned above. We’re making it more convenient for you to retrieve your information however you want — you can even Takeout just your docs if you’d like. Lastly, be sure to click on the new “Configure” menu if you’d like to choose different formats for your documents.



Benelux roads dataset under public domain license

The “Benelux Roads” dataset consists of 2 ESRI Shapefiles containing the highways and major connecting roads in the Benelux area (Belgium, Netherlands and Luxembourg). The intended use of the data is maps at a scale of around 1:1 million, so general overview maps.

The data is under a Public Domain license, so there are no restrictions to use.

Download the data here

Learn more about Red Geographics

French's public data, freely available

A dedicated website to host French’s public data is now online:


The hosted data, are freely available in various format (xls,csv etc…) and under a licence made to be compatible with « Open Government Licence » (OGL) , « Creative Commons Attribution 2.0 » (CC-BY 2.0) Creative Commons and « Open Data Commons Attribution » (ODC-BY) fro, the Open Knowledge Foundation ».