Facebook – Everything is interesting to us

Facebook is collecting a lot of data. Every time you click a notification, visit a page, upload a photo, or check out a friend’s link, you’re generating data for the company to track. You can multiply that by 950 million people and you have a lot of information to deal with.

Here are some of the stats the company provided Wednesday to demonstrate just how big Facebook’s data really is:

  • 2.5 billion content items shared per day (status updates + wall posts + photos + videos + comments)
  • 2.7 billion Likes per day
  • 300 million photos uploaded per day
  • 100+ petabytes of disk space in one of FB’s largest Hadoop (HDFS) clusters
  • 105 terabytes of data scanned via Hive, Facebook’s Hadoop query language, every 30 minutes
  • 70,000 queries executed on these databases per day
  • 500+terabytes of new data ingested into the databases every day

“If you aren’t taking advantage of big data, then you don’t have big data, you have just a pile of data,” said Jay Parikh, VP of infrastructure at Facebook on Wednesday. “Everything is interesting to us.”


90% of the existing data was created in the last two years

And a few more facts about BigData:

  • Wal-Mart handles more than a million customer transactions every hour.
  • Facebook hosts more than 50 billion photos.
  • Google has set up thousands of servers in huge warehouses to process searches.
  • 90 % of the data that exists today was created within the last two years.


It is a pattern of growth driven by such rapid and relentless trends as the rise of social networks, video and the Web.  Particularly for organizations struggling to keep on top of their most critical missions, providing visibility into, and actionable business intelligence out of the explosive surge in data, has created unprecedented challenges.

That’s because big data causes big problems for companies, as well as for our economy and national security. Look no further than the financial crisis. Near the end of 2008, when the global financial system stood at the brink of collapse, the CEO of a global banking giant during a conference call with analysts was repeatedly asked to quantify the volume of mortgage-backed security holdings on the bank’s books. Despite the bank’s having spent a whopping $37 billion on IT operations over the previous 10 years, his best response was a sheepish: “I don’t have that information.”

Had regulators and big banks been able to accurately assess their exposure to subprime lending, we might have dampened the recession and saved the housing market from its biggest fall in 30 years

Data generated through social media tools

Few figures about data generated through social media:

  • People send more than 144.8 billion Email messages sent a day.
  • People and brands on Twitter send more than 340 million tweets a day.
  • People on Facebook share more than 684,000 bits of content a day.
  • People upload 72 hours (259,200 seconds) of new video to YouTube a minute.
  • Consumers spend $272,000 on Web shopping a day.
  • Google receives over 2 million search queries a minute.
  • Apple receives around 47,000 app downloads a minute.
  • Brands receive more than 34,000 Facebook ‘likes’a minute.
  • Tumblr blog owners publish 27,000 new posts a minute.
  • Instagram photographers share 3,600 new photos a minute.
  • Flickr photographers upload 3,125 new photos a minute.
  • People perform over 2,000 Foursquare check-ins a minute.
  • Individuals and organizations launch 571 new websites a minute.
  • WordPress bloggers publish close to 350 new blog posts a minute.
  • The Mobile Web receives 217 new participants a minute.
    (The most updated numbers are available from the sites themselves.)

Ashish Thusoo insights from scaling the data analytics engine at Facebook

Ashish Thusoo  recently confess 6 insights in an article for Forbes, demonstrating he knows a lot about Big Data!

Thusoo joined Facebook in 2007 when the company had 50 million users. He left when it had some 800 million. During that time he managed Facebook’s internal data analytics team.

Here’s what Thusoo learned while scaling the data analytics engine at Facebook:

1. New technologies have shifted the conversation from “what data to store” to “what can we do with more data.” The lower comparative cost of open source technologies like Hadoop and Hive makes it possible to gather more key measurements. In the case of Facebook and other Internet properties, that means gathering a lot more data on user activity and behavior.

This reduction in cost also enables more historical data to be online. “The result,” says Thusoo, “is better data driven applications. At least in the data world, simple algorithms on more data seems to yield better results than complex algorithms on a smaller data sample, notwithstanding some exceptions.”

2. Simplify data analytics  for end users. Put another way, what Thusoo learned at Facebook was that there “was a lot of power in democratizing data for data users” such as scientists, analysts, and engineers.

His goal was to make all capabilities related to data easy, from instrumenting applications and collecting data, to understanding and analyzing it, to creating data driven applications.

“Building familiar interfaces,” and tools to deal with data was key to increasing the adoption of underlying technologies like Hadoop and Hive within Facebook.

3. More users means data analytics systems have to be more robust. The vision of “democratizing data” among Facebook’s “data scientists, analysts and data engineers made things harder.”

To realize that vision, Thusoo’s team had to design in the ability to handle poorly written queries so they wouldn’t crash the system. They had to build mechanisms for sharing resources fairly, including usage monitoring and limits.

“We had many different kinds of users ranging from business analysts to product engineers with varying levels of understanding of the infrastructure or the best practices of using it.”

4. Social networking works for Big Data. ”We invested in making our tools more and more collaborative so that users could share analysis with each other and discover data by getting connected to expert users of a data set.”

With Facebook’s hyper-growth and data that was changing all the time, a collaboration approach “worked better than creating knowledge bases around metadata.”

5. No single infrastructure can solve all Big Data problems. When it came to real-time reports, Thusoo’s team made “a lot of investment as we discovered use cases… better solved through systems other than Hadoop. In the case of real time reports our team invested in building out Puma. There were many other examples around graph analysis as well as low latency data inspection on large data sets,” where they had to build or invest in new technologies.

6. Building software is hard, but running a service is even harder. Thusoo’s team had to do a lot of work to make the service usable. They invested a lot of time and energy in building “systems that would measure usage, point out bottlenecks and really quantify for our users how much they were using” the system. They had to build capabilities to monitor and deliver on agreed upon service levels as well.

Visualizing Activity on Facebook

In 2010, Paul Butler produced a striking image of the world by visualizing friendships.  At the Where 2.0 conference earlier this year, we presented some visualizations that describe how people use location in their Facebook posts.

Liking places, checking in and tagging photos with a location are just a few examples of how people use location to interact and communicate with friends. The images below attempt to visually analyze how people are using location with Facebook and present some unique slices of data.

These posts were created by processing location data using the aptly named Processing,  a Java based open-source language for creating visualizations. We chose various time slices, from a single day snapshot to year-to-date images for analyzing longer-term trends.




2011 bullshit awards

And the bullshit awards 2011 goes to Facebook

  • “Our users want to interact with brands.”
  • “We value your privacy.”
  • “We’re not tracking you when you’re logged out.”


For fun …. but not only !

Sword of the data, a step behind [part 2]

This article follow the “Sword of the data, a brief History [part 1]” available here



The Sword of the data, part #2

A step behind "the data issue" 



1/A horizon of new possibilities

We’re living a new era, which can probably be qualified as the big data era.In one decade, data is driving revolutionary changes in computing and the Internet,including new opportunities for generating revenue, improving the efficiency of business processes, any organization(whatever its size) can now operate over the world in real time etc …


2/Data-centric  enterprise

As all organization became more and more data-centric every day, they are capturing data at deeper levels of detail and keeping more history than they ever have before(and still increasing at an unprecedented pace/rate). Data is now the most important and valuable component of modern applications, websites, organisation business intelligence used for decision-making.


3/In this new era, “whatever lives or dies strictly by the sword of data”

When competing for the same ground, the keys to success will be linked to the ways you handle your data:

  • Safety: Information security became a requirement
  • Quality, Integrity, Consistency erroneous or conflicting data has a major cost to a business, bottom line, impacting customers, reputation and revenue.
  • Accessibility easy search and number of click required to reach a wanted data makes a difference
  • Availability downtime, now, have a cost
  • Velocity poor performance will reduces productivity



Sword of the data, a brief History [part 1]

The Sword of the data, part #1
A Brief History of the last decade: "making things possible and making them easy"


1/First is the plummeting price of hard drive space, the storage cost has evenly decrease over the last 30 years


2/Second has cost the 2001 financial market crash, so called dot-com bubble, but has brings to the world, wired and wireless communication networks transmission for cheap and widespread. In few years, most of the world get connected to the Internet.



3/Third came the software innovation, required by the brand new challenges faced by the Internet company.

Google,Amazon,Facebook,Twitter,LinkedIn had to solve a same technical problem. They needed to handle huge volume of data (at scale never reach before).Managing and processing those very large dataset in order to deliver result in (almost) realtime became the key challenge to lead the competition.

No existing software were available at this time to solve those problems, surprisingly, all those company took the same path: and conclude their only choice were to internally developed new peace of software to handle very massive data set based on distributed architecture (meaning using hundreds of computer that communicate through network in order to achieve a common goal)




4/At last, the innovation made public and open source


Last but not least, most of the software developed get opened source such as Hadoop,Cassandra(Facebook),Voldemort(LinkedIn). Google‘s map reduce patent get free of use. Amazon open its architecture to tier.Innovation were made easily accessible by the public

Facts and stats, MongoDB trend

Although NoSQL databases like Hadoop(Apache Foundation),Redis(VMWare),Cassandra (developed and used by Facebook) or CouchDB get a lot of media attention lately, MongoDB appears to be the product to catch in this emerging market.

Making some search over Google trend and Job Trend(using indeed.com) looking for various NoSQL solutions but the evidence point out the MongoDB to lead those trend

SourceForge,Disney,Craiglist are all using MongoDB, check for full adopter list here: http://www.mongodb.org/display/DOCS/Production+Deployments


 Google trend result for “MongoDB”


 Job trend from indeed.com result for various nosql solution

 Job trend from indeed.com result for “MongoDB”


Facebook stats

Facebook confirms 750 million users, sharing 4 billion items daily

Full article: