Google is definitely in the Big Data business

At  Google IO developer conference in San Francisco last month, Google introduced a slew of new products including Google Compute Engine.  What wasn’t talked was Google’s big data play.   Google is definitely in the Big Data business and Google will be the 800lb gorilla in the space.

Google has two major things going for it.  1) Google has an amazing infrastructure and network inside their core operations; 2) Google owns lots of data lets just say about 90% of the worlds data including information and people.

Google’s infrastructure strength and direction with big data will shape not only applications but the enterprise business.  Why?  Because Google can provide infrastructure and data to anyone who wants it.

Watch out for Google because soon they will be competing with the everyone in the enterprise including the big boys like EMC/Greenplum, IBM/Netezza, HP, Microsoft, and everyone else.

David Floyer, Chief Technology Officer and head of research at Wikibon.org, wrote a great research paper today called Google and VMware Provide Virtualization of Hadoop and Big Data.   David addressed the Google (and VMware) angle in that piece.

If you’re interested in what Google is doing in Big Data you have to read the Wikibon research.

http://wikibon.org/wiki/v/Google_and_VMware_Provide_Virtualization_of_Hadoop_and_Big_Data

Google Compute Engine Review: source: Wikibon.org  

At the 2012 Google I/O conference, Google announced Compute engine. This provides 700,000 virtual cores to be available for users to spin up and tear down very rapidly for big data application in general, and MapReduce and Hadoop in particular. All without setting up any data center infrastructure. This service works with Google Cloud Storage service to provide the data; the data is encrypted at rest. This is a different service than the Google App service, but complementary.

Compute Engine uses the KVM hypervisor on top of the Linux operating system. In discussions with Wikibon, Google pointed out the improvements that they had made to the open source KVM code to improve performance and security in a multi-core multi-thread Intel CPU environment. This allows virtual cores (one thread, one core) to be used as the building block for spinning up very efficient virtual machines.

To help with data ingestion, Google are offering access to the full resources of Google’s Private Networks. This enables a large scale ability to move ingested data across the network at very high speed, and allows replication to a specific data center. The location(s) can be defined, allowing compliance with specific country or regional requirements to retain data within country. If the user can bring the data cost effectively and with sufficient bandwidth to a Google Edge, the Google network services will take over.

The Google Hadoop service can utilize the MapR framework in a similar way to the MapR service for Amazon. This provides improved availability and management components. John Schroeder, CEO and founder of MapR, presented a demonstration running Terasort on a 5,024 core Hadoop cluster with 1256 disks on the Google Compute Engine service. This completed in 1:20 seconds, at a total cost of $16. He compared this with a 1,460 physical server environment with over 11,000 cores, which would take months to set up and would cost over $5million dollars.

As a demonstration this was impressive. Of course, Terasort is a highly CPU intensive environment which can be effectively parallelized, and utilizes cores very efficiently. Other benchmark results which include more IO intensive use of the Google Cloud Storage are necessary to confirm that the service is of universal value.

Wikibon also discussed whether Google would provide other data services to allow joining of corporate data with other Google-derived and provided datasets. Google indicated that they understood the potential value of this service and understood that other service providers were offering these services (e.g., Microsoft Azure). Wikibon expects that data services of this type will be introduced by Google.

There is no doubt that Google is seriously addressing the big data market, and wanting to compete seriously in the enterprise space. The Google network services, data replication services and encryption services reflect this drive to compete strongly with Amazon.

 

Article from JOHN FURRIER  @ SiliconANGLE

 

 

Forbes: VMware Can Disrupt Big Data Analytics With Cetas Acquisition

VMware Can Disrupt Big Data Analytics With Cetas Acquisition

Trefis Team, Contributor

VMware, the leading virtualization company has acquired Cetas, an early stage startup focused on making access to big data analytics easier and cheaper. Terms of the deal haven’t been disclosed yet. [1] VMware competes withMicrosoftOracle and Citrix in the virtualization space.

Pure Play Cloud Platform

Unlike most big data applications, Cetas’ software is designed to run on virtual resources like Amazon Web Services and VMware’s vSphere. With this software application, there is no need to sell physical servers along with the software, making this easier to scale and cheaper to use. This acquisition makes sense for VMware as its applications are deployed on vSphere.

This makes the business model an Analytics-as-a-Service model which opens it up to small and medium businesses. By deploying on the cloud, the application will not only be cheaper, it is potentially much faster and easily scalable as its primary resources are virtual.

Cetas will continue to operate as a startup under the VMware umbrella and plans to integrate the software more tightly into the VMware product suite.

More Than Just Big Data Analytics

The leading providers of big data analytics areHewlett-Packard and International Business Machines, but VMware’s foray into this space might not be to compete with the leading players. The strategic target of these acquisitions is most likelyAmazon and OpenStack. The Platform-as-a-service offering can be greatly enhanced if it came with a built in analytics tool such as the one provided by Cetas.

By integrating this into vSphere, it makes the bundled offering cheaper to run than say deploying an application on Amazon Web Services and buying an analytics engine on top of that. Integration may be an issue as well so the combined offering will seem more attractive to most buyers. [2] We expect sales of the platforms to improve because of this acquisition and it is unlikely that analytics will be a major source of revenue in the short term.

We have a $109 Trefis price estimate for VMware, which is slightly above the current market price.

Originally posted on forbes.com:

http://www.forbes.com/sites/greatspeculations/2012/05/11/vmware-can-disrupt-big-data-analytics-with-cetas-acquisition/

VMware buys big data startup Cetas

VMware has acquired Cetas, a Palo Alto, Calif.-based big data startup that provides analytics atop the Hadoop platform. Terms of the deal haven’t been disclosed yet, but Cetas is an 18-month-old company with tens of paying customers, including within the Fortune 1000, that didn’t need to rush into an acquisition. So, why did VMware make such a compelling offer?

Because VMware is all about applications, and big data applications are the next big thing. Hypervisor virtualization is the foundation of everything VMware does, but it’s just that — the foundation. VMware can only become the de facto IT platform within enterprise data centers if applications can run atop those virtualized servers.

That’s why VMware bought SpringSource, GemStone and WaveMaker, then actual application providers Socialcast and SlideRocket. It’s why VMware developed vFabric andcreated the Cloud Foundry platform-as-a-service project and service to make it as easy as possible to develop and run applications.

Cetas deployed on-premise

Cetas is the logical next step, a big data application that’s designed to run on virtual resources – specifically Amazon Web Services and VMware’s vSphere. In fact, Co-Founder and CEO Muddu Sudhakar told me, its algorithms were designed with elasticity in mind. Jobs consume resources while they’re running and then the resources go away, whether the software is running internally or in the cloud. There’s no need to sell physical servers along with the software.

It doesn’t hurt, either, that Cetas can help VMware compete on bringing big data to bear on its own infrastructure software. As Splunk’s huge IPO  illustrated, there’s a real appetite for providing analytics around operational data, coming from both virtual machines and their physical hosts. In this regard, Cetas will be like the data layer that sits atop virtual servers, application platforms and the applications themselves, providing analytics on everything.

Sudhakar said this type of operational analysis is one of Cetas’ sweet spots, along with online analytics a la Google and Facebook, and enterprise analytics. The product and includes many algorithms and analytics tools designed for those specific use cases out of the box (it even gives some insights automatically), but also allows skilled users to build custom jobs.

Going forward, Sudhakar said Cetas will continue to operate as a startup under the VMware umbrella — which means little will change for its customers or business model — while also working to integrate the software more tightly with the VMware family.

Redis: new disk storage to replace VM

Redis decided to impelmented “diskstore”, working as described below:

 

–          In diskstore key-value paris are stored on disk.

–          Memory works as a cache for live objects. Operations are only performed on in memory keys, so data on disk does not need to be stored in complex forms.

–          The cache-max-memory limit is strict. Redis will never use more RAM, even if we have 2 MB of max memory and 1 billion of keys. This works since now we don’t need to take keys in memory.

–          Data is flushed on disk asynchronously. If a key is marked as dirty, and IO operation is scheduled for this key.

–          You can control the delay between modifications of keys and disk writes, so that if a key is modified a lot of time in small time, it will written only one time on disk.

–          Setting the delay to 0 means, sync it as fast as possible.

–          All I/O is performed by a single dedicated thread, that is long-running and not spawned on demand. The thread is awaked with a conditional variable.

–          The system is much simpler and sane than VM implementation, as there is no need to “undo” operations on race conditions.

–          Zero start-up time… as objects are loaded on demand.

–          There is negative caching. If a key is not on disk we remember it (if there is memory to do so). So we avoid accessing the disk again and again for keys that are not there.

–          The system is very fast if we access mostly our working set, and this working set happens to fit memory. Otherwise the system is much slower (I/O bound).

–          The system does not support BGSAVE currently, but will support this, and what is cool, with minimal overhead and used memory in the saving child, as data on disk is already written using the same serialization format as .rdb files. So our child will just copy files to obtain the .rdb. In the mean time the objects in cache are not flushed, so the system may use more memory, but it’s not about copy-on-write, so it will use very very little additional memory. – Persistence is *PER KEY* this means, there is no point in time persistence.

 

Original post from Salvatore Sanfilippo available here