Google Insights: the NoSQL fights

Google insights provide Web Search Interest for the following nosql solution: cassandra, redis, mongodb, hadoop, couchdb

No leading solution appears from this insights available here

Sword of the data, a step behind [part 2]

This article follow the “Sword of the data, a brief History [part 1]” available here

 

 

The Sword of the data, part #2

A step behind "the data issue" 

 

 

1/A horizon of new possibilities

We’re living a new era, which can probably be qualified as the big data era.In one decade, data is driving revolutionary changes in computing and the Internet,including new opportunities for generating revenue, improving the efficiency of business processes, any organization(whatever its size) can now operate over the world in real time etc …

 

2/Data-centric  enterprise

As all organization became more and more data-centric every day, they are capturing data at deeper levels of detail and keeping more history than they ever have before(and still increasing at an unprecedented pace/rate). Data is now the most important and valuable component of modern applications, websites, organisation business intelligence used for decision-making.

 

3/In this new era, “whatever lives or dies strictly by the sword of data”

When competing for the same ground, the keys to success will be linked to the ways you handle your data:

  • Safety: Information security became a requirement
  • Quality, Integrity, Consistency erroneous or conflicting data has a major cost to a business, bottom line, impacting customers, reputation and revenue.
  • Accessibility easy search and number of click required to reach a wanted data makes a difference
  • Availability downtime, now, have a cost
  • Velocity poor performance will reduces productivity

 

 


Sword of the data, a brief History [part 1]

The Sword of the data, part #1
A Brief History of the last decade: "making things possible and making them easy"

 

1/First is the plummeting price of hard drive space, the storage cost has evenly decrease over the last 30 years

 

2/Second has cost the 2001 financial market crash, so called dot-com bubble, but has brings to the world, wired and wireless communication networks transmission for cheap and widespread. In few years, most of the world get connected to the Internet.

 

 

3/Third came the software innovation, required by the brand new challenges faced by the Internet company.

Google,Amazon,Facebook,Twitter,LinkedIn had to solve a same technical problem. They needed to handle huge volume of data (at scale never reach before).Managing and processing those very large dataset in order to deliver result in (almost) realtime became the key challenge to lead the competition.

No existing software were available at this time to solve those problems, surprisingly, all those company took the same path: and conclude their only choice were to internally developed new peace of software to handle very massive data set based on distributed architecture (meaning using hundreds of computer that communicate through network in order to achieve a common goal)

 

 

 

4/At last, the innovation made public and open source

 

Last but not least, most of the software developed get opened source such as Hadoop,Cassandra(Facebook),Voldemort(LinkedIn). Google‘s map reduce patent get free of use. Amazon open its architecture to tier.Innovation were made easily accessible by the public

Google Cloud SQL

Google has created a relational database for its cloud-hosted App Engine application development and hosting platform, as such feature were one of the most-requested.
For now, the database, called Google Cloud SQL, is available on limited preview mode, which means that the company will hand-select the developers who get access to it.

During this preview period, Google Cloud SQL will be free of charge. Google will announce pricing a month before it starts charging for it.

 

What is Google Cloud SQL?

Google Cloud SQL is web service that allows you to create, configure, and use relational databases with your App Engine applications. It is a fully-managed service that maintains, manages, and administers your databases, allowing you to focus on your applications and services.

By offering the capabilities of a MySQL database, the service enables you to easily move your data, applications, and services into and out of the cloud. This allows for high data portability and helps in faster time-to-market because you can quickly leverage your existing database (using JDBC and/or DB-API) in your App Engine application.

To ensure that your critical applications and services are always running, Google Cloud SQL replicates data to multiple geographic regions to provide high data availability.

The service is currently in limited preview.

Highlights

  • Ease of Use

A rich graphical user interface allows for creating, configuring, managing, and monitoring your database instances, with just a click.

  • Fully managed

No worrying about tasks such as replication, patch management, or other database management chores. All these tasks are taken care of for you.

  • Highly Available

To meet the critical availability needs of today’s applications and services, features like replication across multiple geographic regions are built in, so the service is available even if a data center becomes unavailable.

  • Integrated with Google App Engine and other Google services

Tight integration with Google App Engine and other Google services enable you to work across multiple products easily, get more value from your data, move your data into and out of the cloud, and get better performance.

More information here:

http://code.google.com/apis/sql/index.html

Facts and stats, MongoDB trend

Although NoSQL databases like Hadoop(Apache Foundation),Redis(VMWare),Cassandra (developed and used by Facebook) or CouchDB get a lot of media attention lately, MongoDB appears to be the product to catch in this emerging market.

Making some search over Google trend and Job Trend(using indeed.com) looking for various NoSQL solutions but the evidence point out the MongoDB to lead those trend

SourceForge,Disney,Craiglist are all using MongoDB, check for full adopter list here: http://www.mongodb.org/display/DOCS/Production+Deployments

 

 Google trend result for “MongoDB”

 

 Job trend from indeed.com result for various nosql solution

 Job trend from indeed.com result for “MongoDB”


 

Google open sourcing LevelDB

There is one more NoSQL solution has  Google  just open sourced LevelDB !

LevelDB is a key-value storage engine written in C++, and its source code has now been released under a BSD-style license. Google designed LevelDB as a building block for a higher-level storage system, and it will in fact form the basis for the IndexedDB API – a new web standard for using apps that need a database – in future versions of Google Chrome.

The things you need to know about levelDB:

  • Open source + great performance
  • This is not a SQL database so it does not have a relational data model, no  SQL queries and no support for indexes.
  • Only a single process (possibly multi-threaded) can access a particular database at a time.
  • There is no client-server support. So unlike MySQL,MongoDB and CouchDB for instance, your application is can not connect to and operate remotely to LevelDB: it is a library included in your code to provide support for LevelDB , just like SQLite (such  support as client server would be needed you would have to wrap your own server around the library).

 

Find out more and download the source code from the project page on Google Code

Google reloaded, at least trying to

Google, the company which bring revolution to the web seems facing a critical time.

The truth based on facts is the company currently face multiple problem such as:

  • The stock is falling….breaking the extraordinary trend which was part of the google’s myth
  • There are criticism, from financial community, since Larry Page take over the company
  • FTC antitrust probe are back
  • Apple has reborn under the google’s reign
  • They miss to get “social” and now need to fight hard with Facebook (first display of advertising in US)
  • They miss Google TV so far … but this fight still open ….
  • Rolling out the people widget feature on gmail is taking longer than expected(still not yet completed)
  • Technical issue, see Google: At scale everything breaks

 

Google know, and its already big,  they are in trouble, and try to react. This reaction is interesting, because its trying to bringing back the success keys of the company:

  • Larry Page means the initial company spirit(technical startup) is back and the engineers should be backed up,strengthened in their choice
  • Restoring the technical leadership , GFS the Google File System at the very basis of their architecture is being reviewed and enhanced. Google Panda is being updated to version 2.2, with an improved algorithm.

 

Urs Hölzle: Massive scale bring some complexity

Google is the massive scale company, it operates technology that is expected to be reliable in the face of major traffic demands. In order to scale its services, the company has developed many systems, such as MapReduce and Google File System.

However, behind the scenes, the company is fighting a constant battle  to simply make it works, the massive scale brings increasingly challenging levels of complexity.

Urs Hölzle was Google’s first vice president of engineering. According to Hölzle, “at scale, everything breaks“.

 

The following is an interview of Urs Hölzle for which the original article can be found at this address:

http://www.zdnet.co.uk/news/cloud/2011/06/22/google-at-scale-everything-breaks-40093061/

 

 

Q: Apart from focusing on physical infrastructure, such as datacentres, are there efficiencies that Google gains from running software at massive scale?
A: I think there absolutely is a very large benefit there, probably more so than you can get from the physical efficiency. It’s because when you have an on-premise server it’s almost impossible to size the server to the load, because most servers are actually too powerful and most companies [using them] are relatively small.

[But] if you have a large-scale email service where millions of accounts are in one place, it’s much easier to size the pool of servers to that load. If you aggregate the load, it’s intrinsically much easier to keep your servers well utilised.

 

What are Google’s plans for the evolution of its internal software tools?
There’s obviously an evolution. For example, most applications don’t use [Google File System (GFS)] today. In fact, we’re phasing out GFS in favour of the next-generation file system that is very similar, but it’s not GFS anymore. It scales better and has better latency properties as well. I think three years from now we’ll try to retire that because flash memory is coming and faster networks and faster CPUs are on the way and that will change how we want to do things. 

One of the nice things is that if everyone today is using the Bigtable compressed database, suppose we have a better Bigtable down the line that does the right thing with flash — then it’s relatively easy to migrate all these applications as long as the API stays stable.

How significant is it to have these back-end systems — such as MapReduce and the Google File System — spawn open-source applications such as Hadoop through publication and adaptation by other companies?
It’s an unavoidable trend in the sense that [open source] started with the operating system, which was the lowest level that everyone needed. But the power of open source is that you can continue to build on the infrastructure that already exists [and you get] things like Apache for the web server. Now we’re getting into a broader range of services that are available through the cloud.

For instance, cluster management itself or some open-source version will happen, because everyone needs it as their computation scales and their issue becomes not the management of a single machine, but the management of a whole bunch of them. Average IT shops will have hundreds of virtual machines (VMs) or hundreds of machines they need to manage, so a lot of their work is about cluster management and not about the management of individual VMs.

Often, if computation is cheap enough, then it doesn’t pay to…

…do your own solution. Maybe you can do your own solution but you [might not be able to] justify the software engineering effort and then the ongoing maintenance, instead of staying within an open-source system.

In your role, what is the most captivating technical problem you deal with?
I think the big challenges haven’t changed that much. I’d say that it’s dealing with failure, because at scale everything breaks no matter what you do and you have to deal reasonably cleanly with that and try to hide it from the people actually using your system.

At scale everything breaks no matter what you do and you have to deal reasonably cleanly with that and try to hide it from the people actually using your system. 

There are two big reasons why MapReduce-Hadoop is really popular. One is that it gets rid of your parallelisation problem because it parallelises automatically and does load-balancing automatically across machines. But the second one is if you have a large computation, it deals with failures. So if one of [your] machines dies in the middle of a 10-hour computation, then you’re fine. It just happens.

I think the second one is dealing with stateful, mutable states. MapReduce is easy because it’s a case of presenting it with a number of files and having it compute them and, if things go wrong, you can just do it again. But Gmail, IM and other stateful services have very different security [uptime and data-loss] implications.

We use tapes, still, in this age because they’re actually a very cost-effective way as a last resort for Gmail. The reason why we put it in is not physical data loss, but once in a blue moon you will have a bug that destroys all copies of the online data and your only protection is to have something that is not connected to the same software system, so you can go and redo it.

The last challenge we’re seeing is to use commodity hardware but actually make it work in the face of rapid innovation cycles. For example, here’s a new HDD (hard-disk drive). There’s a lot of pressure in the market to get it out because you want to be the first one with a 3TB drive and there’s a lot of cost pressure to, but how do you actually make these drives reliable?

As a large-scale user, we see all the corner cases and in virtually every piece of hardware we use we find bugs, even if it’s a shipping piece of hardware.

If you use the same operating system, like Linux, and run the same computation on 10,000 machines and every day 100 of them fail, you’re going to say, wow this is wrong. But if you did it by yourself, it’s a one-percent failure rate. So three times a year you’d have to change your server. You probably wouldn’t take the effort to debug and you’d think it was a random fluke or you’d debug and it wouldn’t be happening any more.

It seems you want all your services to speak to each other. But surely this introduces its own problems of complexity?
Automation is key, but it’s also dangerous. You can shut down all machines automatically if you have a bug. It’s one of the things that is very challenging to do because you want uniformity and automation, but at the same time you can’t really automate everything without lots of safeguards or you get into cascading failures.

Keeping things simple and yet scalable is actually the biggest challenge. 

Complexity is evil in the grand scheme of things because it makes it possible for these bugs to lurk that you see only once every two or three years, but when you see them it’s a big story because it had a large, cascading effect.

Keeping things simple and yet scalable is actually the biggest challenge. It’s really, really hard. Most things don’t work that well at scale, so you need to introduce some complexity, but you have to keep it down.

Have you looked into some of the emerging hardware, such as PCIe-linked flash?
We’re not a priori excluding anything and we’re playing with things all the time. I would expect PCIe flash to become a commodity because it’s a pretty good way of exposing flash to your operating system. But flash is still tricky because the durability is not very good.

I think these all have the promise of deeply affecting how applications are written because if you can afford to put most of your data in a storage medium like this rather than on a moving head, then a factor of a thousand makes a huge difference. But these things are not close enough yet to disk in terms of storage cost.

 

First came the hardware, second the software and third is the age of data

Value have been first in the hardware as a first stage, in a second it has been within software and it seems “The age of data is upon us” declared Redmonk’s Stephen O’Grady at the Open Source Business Conference.

On a great articles available here which summarize O’Grady’s words: http://www.ecommercetimes.com/story/72471.html

Mainly it summarize the timefline as follow:

  1. The first stage, epitomized by IBM, held that the money was in the hardware and software was just an adjunct.
  2. Stage two, fired off by Microsoft, contended the money is in the software.
  3. Google epitomizes the third stage, where the money is not in the software, but software is a differentiator. “Google came up at a time when a lot of folks were building the Internet on the backs of some very expensive hardware and software. Google uses commodity hardware, free — meaning no-cost — software, and focuses on what it can do better than its competitors with that software.”

Wondering what could be the the fourth stage ?  It might be Facebook and Twitter. “Now, software is not even differentiating; it’s the value of the data. Facebook and Twitter monetize their data in different ways.”

The NoSQL trends according to google

Quick look into the Google’s world to determine the state of the NoSQL trends:

 

  • Google fight, popularity of NoSQL vs MySQL

http://www.googlefight.com/index.php?lang=en_GB&word1=nosql&word2=mysql