eBay.com summarized with pretty large numbers

Hugh E. Williams work in the Marketplaces business at eBay. He recently post an article with some insight technologies and numbers.

Few facts about our scale and size. I thought I’d share some with you:
  • We have over 10 petabytes of data stored in our Hadoop and Teradata clusters. Hadoop is primarily used by engineers who use data to build products, and Teradata is primarily used by our finance team to understand our business
  • We have over 300 million items for sale, and over a billion accessible at any time (including, for example, items that are no longer for sale but that are used by customers for price research)
  • We process around 250 million user queries per day (which become many billions of queries behind the scenes – query rewriting implies many calls to search to provide results for a single user query, and many other parts of our system use search for various reasons)
  • We serve over 2 billion pages to customers every day
  • We have over 100 million active users
  • We sold over US$68 billion in merchandize in 2011
  • We make over 75 billion database calls each day (our database tables are denormalized because doing relational joins at our scale is often too slow – and so we precompute and store the results, leading to many more queries that take much less time each)

Full article available here

Google is definitely in the Big Data business

At  Google IO developer conference in San Francisco last month, Google introduced a slew of new products including Google Compute Engine.  What wasn’t talked was Google’s big data play.   Google is definitely in the Big Data business and Google will be the 800lb gorilla in the space.

Google has two major things going for it.  1) Google has an amazing infrastructure and network inside their core operations; 2) Google owns lots of data lets just say about 90% of the worlds data including information and people.

Google’s infrastructure strength and direction with big data will shape not only applications but the enterprise business.  Why?  Because Google can provide infrastructure and data to anyone who wants it.

Watch out for Google because soon they will be competing with the everyone in the enterprise including the big boys like EMC/Greenplum, IBM/Netezza, HP, Microsoft, and everyone else.

David Floyer, Chief Technology Officer and head of research at Wikibon.org, wrote a great research paper today called Google and VMware Provide Virtualization of Hadoop and Big Data.   David addressed the Google (and VMware) angle in that piece.

If you’re interested in what Google is doing in Big Data you have to read the Wikibon research.


Google Compute Engine Review: source: Wikibon.org  

At the 2012 Google I/O conference, Google announced Compute engine. This provides 700,000 virtual cores to be available for users to spin up and tear down very rapidly for big data application in general, and MapReduce and Hadoop in particular. All without setting up any data center infrastructure. This service works with Google Cloud Storage service to provide the data; the data is encrypted at rest. This is a different service than the Google App service, but complementary.

Compute Engine uses the KVM hypervisor on top of the Linux operating system. In discussions with Wikibon, Google pointed out the improvements that they had made to the open source KVM code to improve performance and security in a multi-core multi-thread Intel CPU environment. This allows virtual cores (one thread, one core) to be used as the building block for spinning up very efficient virtual machines.

To help with data ingestion, Google are offering access to the full resources of Google’s Private Networks. This enables a large scale ability to move ingested data across the network at very high speed, and allows replication to a specific data center. The location(s) can be defined, allowing compliance with specific country or regional requirements to retain data within country. If the user can bring the data cost effectively and with sufficient bandwidth to a Google Edge, the Google network services will take over.

The Google Hadoop service can utilize the MapR framework in a similar way to the MapR service for Amazon. This provides improved availability and management components. John Schroeder, CEO and founder of MapR, presented a demonstration running Terasort on a 5,024 core Hadoop cluster with 1256 disks on the Google Compute Engine service. This completed in 1:20 seconds, at a total cost of $16. He compared this with a 1,460 physical server environment with over 11,000 cores, which would take months to set up and would cost over $5million dollars.

As a demonstration this was impressive. Of course, Terasort is a highly CPU intensive environment which can be effectively parallelized, and utilizes cores very efficiently. Other benchmark results which include more IO intensive use of the Google Cloud Storage are necessary to confirm that the service is of universal value.

Wikibon also discussed whether Google would provide other data services to allow joining of corporate data with other Google-derived and provided datasets. Google indicated that they understood the potential value of this service and understood that other service providers were offering these services (e.g., Microsoft Azure). Wikibon expects that data services of this type will be introduced by Google.

There is no doubt that Google is seriously addressing the big data market, and wanting to compete seriously in the enterprise space. The Google network services, data replication services and encryption services reflect this drive to compete strongly with Amazon.


Article from JOHN FURRIER  @ SiliconANGLE



Hadoop Streaming Support for MongoDB

From MongoDB blog, 10gen announce it has some native data processing tools, such as the built-in Javascript-oriented MapReduce framework, and a new Aggregation Framework in MongoDB v2.2. That said, there will always be a need to decouple persistance and computational layers when working with Big Data.

Enter MongoDB+Hadoop: an adapter that allows Apache’s Hadoop platform to integrate with MongoDB.

Using this adapter, it is possible to use MongoDB as a real-time datastore for your application while shifting large aggregation, batch processing, and ETL workloads to a platform better suited for the task.


Well, the engineers at 10gen have taken it one step further with the introduction of the streaming assembly for Mongo-Hadoop.

What does all that mean?

The streaming assembly lets you write MapReduce jobs in languages like Python, Ruby, and JavaScript instead of Java, making it easy for developers that are familiar with MongoDB and popular dynamic programing languages to leverage the power of Hadoop.


It works like this:

Once a developer has Java installed and Hadoop ready to rock they download and build the adapter. With the adapter built, you compile the streaming assembly, load some data into Mongo, and get down to writing some MapReduce jobs.

The assembly streams data from MongoDB into Hadoop and back out again, running it through the mappers and reducers defined in a language you feel at home with. Cool right?

Ruby support was recently added and is particularly easy to get started with. Lets take a look at an example where we analyze twitter data.

Import some data into MongoDB from twitter:

curl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d twitter -c in
view rawimport.shThis Gist brought to you by GitHub.

Next, write a Mapper and save it in a file called mapper.rb:

#!/usr/bin/env ruby
require ‘mongo-hadoop’
MongoHadoop.map do |document|
  { :_id => document[‘user’][‘time_zone’], :count => 1 }
view rawmapper.rbThis Gist brought to you by GitHub.

Now, write a Reducer and save it in a file called reducer.rb:

#!/usr/bin/env ruby
require ‘mongo-hadoop’
MongoHadoop.reduce do |key, values|
  count = sum = 0
  values.each do |value|
    count += 1
    sum += value[‘num’]
  { :_id => key, :average => sum / count }
view rawreducer.rbThis Gist brought to you by GitHub.

To run it all, create a shell script that executes hadoop with the streaming assembly jar and tells it how to find the mapper and reducer files as well as where to retrieve and store the data:

hadoop jar mongo-hadoop-streaming-assembly*.jar -mapper mapper.rb -reducer reducer.rb -inputURI mongodb:// -outputURI mongodb://
view rawtwit.shThis Gist brought to you by GitHub.

Make them all executable by running chmod +x on the all the scripts and run twit.sh to have hadoop process the job.

Step by step Oracle NoSQL Database with Hadoop

From Oracle, a step by step article to play round Oracle NoSQL database with Hadoop.

Introduced in 2011, Oracle NoSQL Database is a highly available, highly scalable, key/value storage based (nonrelational) database that provides support for CRUD operations via a Java API. A related technology, the Hadoop MapReduce framework, provides a distributed environment for developing applications that process large quantities of data in parallel on large clusters.

In this article we discuss integrating Oracle NoSQL Database with Hadoop on Windows OS via an Oracle JDeveloper project (download). We will also demonstrate processing the NoSQL Database data in Hadoop using a MapReduce job.


Access the Oracle’s official tutorial

10 companies using Hadoop

Cloudera COO Kirk Dunn I’ve recently uncovered a few ways companies are using Hadoop, here are 10 uses cases:
  1. Online travel. Dunn noted that Cloudera’s Hadoop distribution currently powers about 80 percent of all online travel booked worldwide. He didn’t mention users by name, but last year I covered how one of those customers, Orbitz Worldwide, uses Hadoop.
  2. Mobile data. This another of Dunn’s anonymous statistics — that Cloudera powers “70 percent of all smartphones in the U.S.” I assume he’s talking about the storage and processing of mobile data by wireless providers, and a little market-share math probably could help one pinpoint the customers.
  3. E-commerce. More anonymity, but Dunn says Cloudera powers more than 10 million online merchants in the United States. Dunn said one large retailer (I assume eBay, which is a major Hadoop user and manages a large marketplace of individual sellers that would help account for those 10-plus million merchants) added 3 percent to its net profits after using Hadoop for just 90 days.
  4. Energy discovery. During a panel at Cloudera’s event, a Chevron representative explained just one of many ways his company uses Hadoop: to sort and process data from ships that troll the ocean collecting seismic data that might signify the presence of oil reserves.
  5. Energy savings. At the other end of the spectrum from Chevron is Opower, whichuses Hadoop to power its service that suggests ways for consumers to save money on energy bills. A representative on the panel noted that certain capabilities, such as accurate and long-term bill forecasting were hardly feasible without Hadoop.
  6. Infrastructure management. This is a rather common use case, actually, as more companies (including Etsy, which I profiled recently) are gathering and analyzing data from their servers, switches and other IT gear. At the Cloudera event, a NetApp  rep noted how his company collects device logs (it has more than a petabyte worth at present) from its entire install base and stores them in Hadoop.
  7. Image processing. A startup called Skybox Imaging is using Hadoop to store and process images from the high-definition images its satellites will regularly capture as they attempt to detect patterns of geographic change. Skybox recently raised $70 million for its efforts.
  8. Fraud detection. This is another oldie but goodie, used by both financial services organizations and intelligence agencies. One of those users, Zions Bancorporation,explained to me recently how a move to Hadoop lets it store all the data it can on customer transactions and spot anomalies that might suggest fraudulent behavior.
  9. IT security. As with infrastructure management, companies also use Hadoop to process machine-generated data that can identify malware and cyber attack patterns. Last year, we told the story of ipTrust, which uses Hadoop to assign reputation scores to IP address, which lets other security products decide whether to accept traffic from those sources.
  10. Health care. I suspect there are many ways Hadoop can benefit health care practitioners, but one of them goes back to its search roots. Last year, I profiled Apixio, which uses Hadoop to power its service that leverages semantic analysis to provide doctors, nurses and others more-relevant answers to their questions about patients’ health.

Cascading 2.0 has been released

Cascading is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions.


Cascading 2.0 is now publicly available for download. This release includes a number of new features.

  • Apache 2.0 Licensing
  • Support for Hadoop 1.0.2
  • Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
  • HashJoin pipe for “map side joins”
  • Merge pipe for “map side merges”
  • Simple Checkpointing for capturing intermediate data as a file
  • Improved Tap and Scheme APIs

Manage your Hadoop cluster

You can get into the fun part of actually processing and analyzing big data with Hadoop, you have to configure, deploy and manage your cluster. It’s neither easy nor glamorous — data scientists get all the love — but it is necessary. Here are five tools (not from commercial distribution providers such as Cloudera or MapR) to help you do it.


  • Apache Ambari

Apache Ambari is an open source project for monitoring, administration and lifecycle management for Hadoop. It’s also the project that Hortonworks has chosen as the management component for the Hortonworks Data Platform. Ambari works with Hadoop MapReduce, HDFS, HBase, Pig, Hive, HCatalog and Zookeeper.

  • Apache Mesos

Apache Mesos is a cluster manager that lets users run multiple Hadoop jobs, or other high-performance applications, on the same cluster at the same time.According to Twitter Open Source Manager Chris Aniszczyk, Mesos “runs on hundreds of production machines and makes it easier to execute jobs that do everything from running services to handling our analytics workload.”

  • Platform MapReduce

Platform MapReduce is high-performance computing expert Platform Computing’s entre into the big data space. It’s a runtime environment that supports a variety of MapReduce applications and file systems, not just those directly associated with Hadoop, and is tuned for enterprise-class performance and reliability. Platform, now part of IBM, built a respectable business managing clusters for large financial services institutions.

  • StackIQ Rocks+ Big Data

StackIQ Rock+ Big Data is a commercial distribution of the Rocks cluster management software that the company has beefed up to also support Apache Hadoop. Rocks+ supports the Apache, Cloudera, Hortonworks and MapR distributions, and handles the entire process from configuring bare metal servers to managing an operational Hadoop cluster.

  • Zettaset Orchestrator

Zettaset Orchestrator is an end-to-end Hadoop management product that supports multiple Hadoop distributions. Zettaset touts Orchestrator’s UI-based experience and its ability to handle what the company calls MAAPS — management, availability, automation, provisioning and security. At least one large company, Zions Bancorporation, is a Zettaset customer.


What is Hadoop in 3 minutes video

What is Hadoop with Rafael Coss, “Manager Big Data Enablement”



Apache Hadoop 2.0 Alpha has been released

Apache Hadoop community has just released Apache Hadoop 2.0.0 (alpha)

While only an alpha release (read: not ready to run in production), it is still an important step forward as it represents the very first release that delivers new and important capabilities, including:

In addition to these new capabilities, there are several planned enhancements that are on the way from the community, includingHDFS Snapshots and auto-failover for HA NameNode, along with further improvements to the stability and performance with the next generation of MapReduce (YARN). There are definitely good times ahead.

Again, please note that the Apache Hadoop community has decided to use the alpha moniker for this release since it is a preview release that is not yet ready for production deployments for the following reasons:

  • We still need to iterate over some of the APIs (especially with the switch to protobufs) before we declare them stable, i.e. something that can be supported over the long run in a compatible manner.
  • Several features including HDFS HA, NextGen MapReduce et al need a lot more testing and validation before they are ready for prime time.
  • While we are excited about the progress made for supporting HA for HDFS, auto-failover for HDFS NameNode and HA for NextGen MapReduce are still a work-in-progress.

Please visit the Apache Hadoop Releases page to download hadoop-2.0.0-alpha and visit the Documentation page for more information.

Hadoop in a Microsoft environment

Microsoft announced a partnership with Hortonworks last year to bring Hadoop to Windows Server and Windows Azure. Microsoft’s vision revolves around making Hadoop and related Big Data tools trivially accessible to the regular IT end-user and to this end it integrates with SQL Server Analysis and Reporting Services as well as Excel PowerPivot.

Here are some resources to use Hadoop in a Microsoft environment