Sword of the data, a brief History [part 1]

The Sword of the data, part #1
A Brief History of the last decade: "making things possible and making them easy"


1/First is the plummeting price of hard drive space, the storage cost has evenly decrease over the last 30 years


2/Second has cost the 2001 financial market crash, so called dot-com bubble, but has brings to the world, wired and wireless communication networks transmission for cheap and widespread. In few years, most of the world get connected to the Internet.



3/Third came the software innovation, required by the brand new challenges faced by the Internet company.

Google,Amazon,Facebook,Twitter,LinkedIn had to solve a same technical problem. They needed to handle huge volume of data (at scale never reach before).Managing and processing those very large dataset in order to deliver result in (almost) realtime became the key challenge to lead the competition.

No existing software were available at this time to solve those problems, surprisingly, all those company took the same path: and conclude their only choice were to internally developed new peace of software to handle very massive data set based on distributed architecture (meaning using hundreds of computer that communicate through network in order to achieve a common goal)




4/At last, the innovation made public and open source


Last but not least, most of the software developed get opened source such as Hadoop,Cassandra(Facebook),Voldemort(LinkedIn). Google‘s map reduce patent get free of use. Amazon open its architecture to tier.Innovation were made easily accessible by the public

Twitter's Storm source code available

As previously announced, Twitter’s Storm : the complex event processing system is now open sourced at github.com

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Find out more information and details here: https://github.com/nathanmarz/storm

Twitter open sourcing Storm

Storm is an “Hadoop of Real-Time Processing”  develloped by BackType: a company acquired last month by Twitter.

As most of BackType’s technology, including ElephantDB and Cascalog, are open source, twitter has announced that BackType’s Storm will be open-sourced at the Strange Loop conference in September.


Storm is a Hadoop-like system, but instead of running MapReduce “jobs” that eventually end, Storm runs never ending “topologies.” It can be used for continuous computing, processing streams of data, etc.

Here’s the rundown of the use-cases from the Twitter Engineering blog:

  1. Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
  2. Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
  3. Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.

More details  in the blog post.


First came the hardware, second the software and third is the age of data

Value have been first in the hardware as a first stage, in a second it has been within software and it seems “The age of data is upon us” declared Redmonk’s Stephen O’Grady at the Open Source Business Conference.

On a great articles available here which summarize O’Grady’s words: http://www.ecommercetimes.com/story/72471.html

Mainly it summarize the timefline as follow:

  1. The first stage, epitomized by IBM, held that the money was in the hardware and software was just an adjunct.
  2. Stage two, fired off by Microsoft, contended the money is in the software.
  3. Google epitomizes the third stage, where the money is not in the software, but software is a differentiator. “Google came up at a time when a lot of folks were building the Internet on the backs of some very expensive hardware and software. Google uses commodity hardware, free — meaning no-cost — software, and focuses on what it can do better than its competitors with that software.”

Wondering what could be the the fourth stage ?  It might be Facebook and Twitter. “Now, software is not even differentiating; it’s the value of the data. Facebook and Twitter monetize their data in different ways.”

How twitter uses nosql

InfoQ has released a video of Twitter‘s Kevin Weil explaining how the company uses NoSQL.

Twitter architecture mainly rely over the following products:

  • Scribe
  • Hadoop
  • Pig
  • Hbase
  • FlockDB
  • Cassandra


Video is available here: http://www.slideshare.net/nkallen/q-con-3770885


Twitter and all those data, what for ?


Twitter uses all the data for example, running comparisons of different types of users. Twitter analyzes data to determine whether mobile users, users who use 3rd party clients or “power users” use Twitter differently from average users. The company is also interested in determining whether certain features or occurrences “trigger” a casual usual into becoming a frequent user. For example: do people become frequent users when they start following the right people or discover the right feature?

Other questions Weil says Twitter is interested in include determining which types of tweets get retweeted the most, what types of social graph structures result in the most successful networks and how to tell the difference between humans and bots.