At Google IO developer conference in San Francisco last month, Google introduced a slew of new products including Google Compute Engine. What wasn’t talked was Google’s big data play. Google is definitely in the Big Data business and Google will be the 800lb gorilla in the space.
Google has two major things going for it. 1) Google has an amazing infrastructure and network inside their core operations; 2) Google owns lots of data lets just say about 90% of the worlds data including information and people.
Google’s infrastructure strength and direction with big data will shape not only applications but the enterprise business. Why? Because Google can provide infrastructure and data to anyone who wants it.
Watch out for Google because soon they will be competing with the everyone in the enterprise including the big boys like EMC/Greenplum, IBM/Netezza, HP, Microsoft, and everyone else.
David Floyer, Chief Technology Officer and head of research at Wikibon.org, wrote a great research paper today called Google and VMware Provide Virtualization of Hadoop and Big Data. David addressed the Google (and VMware) angle in that piece.
If you’re interested in what Google is doing in Big Data you have to read the Wikibon research.
Google Compute Engine Review: source: Wikibon.org
At the 2012 Google I/O conference, Google announced Compute engine. This provides 700,000 virtual cores to be available for users to spin up and tear down very rapidly for big data application in general, and MapReduce and Hadoop in particular. All without setting up any data center infrastructure. This service works with Google Cloud Storage service to provide the data; the data is encrypted at rest. This is a different service than the Google App service, but complementary.
Compute Engine uses the KVM hypervisor on top of the Linux operating system. In discussions with Wikibon, Google pointed out the improvements that they had made to the open source KVM code to improve performance and security in a multi-core multi-thread Intel CPU environment. This allows virtual cores (one thread, one core) to be used as the building block for spinning up very efficient virtual machines.
To help with data ingestion, Google are offering access to the full resources of Google’s Private Networks. This enables a large scale ability to move ingested data across the network at very high speed, and allows replication to a specific data center. The location(s) can be defined, allowing compliance with specific country or regional requirements to retain data within country. If the user can bring the data cost effectively and with sufficient bandwidth to a Google Edge, the Google network services will take over.
The Google Hadoop service can utilize the MapR framework in a similar way to the MapR service for Amazon. This provides improved availability and management components. John Schroeder, CEO and founder of MapR, presented a demonstration running Terasort on a 5,024 core Hadoop cluster with 1256 disks on the Google Compute Engine service. This completed in 1:20 seconds, at a total cost of $16. He compared this with a 1,460 physical server environment with over 11,000 cores, which would take months to set up and would cost over $5million dollars.
As a demonstration this was impressive. Of course, Terasort is a highly CPU intensive environment which can be effectively parallelized, and utilizes cores very efficiently. Other benchmark results which include more IO intensive use of the Google Cloud Storage are necessary to confirm that the service is of universal value.
Wikibon also discussed whether Google would provide other data services to allow joining of corporate data with other Google-derived and provided datasets. Google indicated that they understood the potential value of this service and understood that other service providers were offering these services (e.g., Microsoft Azure). Wikibon expects that data services of this type will be introduced by Google.
There is no doubt that Google is seriously addressing the big data market, and wanting to compete seriously in the enterprise space. The Google network services, data replication services and encryption services reflect this drive to compete strongly with Amazon.