Big Data top paying skills

 According to kdnuggets the Big Data related skills led the list of top paying technical skills (six-figure salaries) in 2013.

The study focus on  technology professionals in the U.S. who enjoyed raises over the last year(2013).

Average U.S. tech salaries increased nearly three percent to $87,811 in 2013, up from $85,619 the previous year.Technology professionals understand they can easily find ways to grow their career in 2014, with two-thirds of respondents (65%) confident in finding a new, better position. That overwhelming confidence matched with declining salary satisfaction (54%, down from 57%) will keep tech-powered companies on edge about their retention strategies.

Companies are willing to pay hefty amounts to professionals with Big Data skills.

According to a report released on Jan 29, 2014 an average salary for a professional having knowledge and experience in programming language R was $115,531 in year 2013. 

Other Big Data oriented skills such as NoSQL, MapReduce, Cassandra, Pig, Hadoop, MongoDB are among top 10 paying skills. 


Source: kdnuggets

Hadoop Streaming Support for MongoDB

From MongoDB blog, 10gen announce it has some native data processing tools, such as the built-in Javascript-oriented MapReduce framework, and a new Aggregation Framework in MongoDB v2.2. That said, there will always be a need to decouple persistance and computational layers when working with Big Data.

Enter MongoDB+Hadoop: an adapter that allows Apache’s Hadoop platform to integrate with MongoDB.

Using this adapter, it is possible to use MongoDB as a real-time datastore for your application while shifting large aggregation, batch processing, and ETL workloads to a platform better suited for the task.


Well, the engineers at 10gen have taken it one step further with the introduction of the streaming assembly for Mongo-Hadoop.

What does all that mean?

The streaming assembly lets you write MapReduce jobs in languages like Python, Ruby, and JavaScript instead of Java, making it easy for developers that are familiar with MongoDB and popular dynamic programing languages to leverage the power of Hadoop.


It works like this:

Once a developer has Java installed and Hadoop ready to rock they download and build the adapter. With the adapter built, you compile the streaming assembly, load some data into Mongo, and get down to writing some MapReduce jobs.

The assembly streams data from MongoDB into Hadoop and back out again, running it through the mappers and reducers defined in a language you feel at home with. Cool right?

Ruby support was recently added and is particularly easy to get started with. Lets take a look at an example where we analyze twitter data.

Import some data into MongoDB from twitter:

curl -u<login>:<password> | mongoimport -d twitter -c in
view rawimport.shThis Gist brought to you by GitHub.

Next, write a Mapper and save it in a file called mapper.rb:

#!/usr/bin/env ruby
require ‘mongo-hadoop’ do |document|
  { :_id => document[‘user’][‘time_zone’], :count => 1 }
view rawmapper.rbThis Gist brought to you by GitHub.

Now, write a Reducer and save it in a file called reducer.rb:

#!/usr/bin/env ruby
require ‘mongo-hadoop’
MongoHadoop.reduce do |key, values|
  count = sum = 0
  values.each do |value|
    count += 1
    sum += value[‘num’]
  { :_id => key, :average => sum / count }
view rawreducer.rbThis Gist brought to you by GitHub.

To run it all, create a shell script that executes hadoop with the streaming assembly jar and tells it how to find the mapper and reducer files as well as where to retrieve and store the data:

hadoop jar mongo-hadoop-streaming-assembly*.jar -mapper mapper.rb -reducer reducer.rb -inputURI mongodb:// -outputURI mongodb://
view rawtwit.shThis Gist brought to you by GitHub.

Make them all executable by running chmod +x on the all the scripts and run to have hadoop process the job.

Manage your Hadoop cluster

You can get into the fun part of actually processing and analyzing big data with Hadoop, you have to configure, deploy and manage your cluster. It’s neither easy nor glamorous — data scientists get all the love — but it is necessary. Here are five tools (not from commercial distribution providers such as Cloudera or MapR) to help you do it.


  • Apache Ambari

Apache Ambari is an open source project for monitoring, administration and lifecycle management for Hadoop. It’s also the project that Hortonworks has chosen as the management component for the Hortonworks Data Platform. Ambari works with Hadoop MapReduce, HDFS, HBase, Pig, Hive, HCatalog and Zookeeper.

  • Apache Mesos

Apache Mesos is a cluster manager that lets users run multiple Hadoop jobs, or other high-performance applications, on the same cluster at the same time.According to Twitter Open Source Manager Chris Aniszczyk, Mesos “runs on hundreds of production machines and makes it easier to execute jobs that do everything from running services to handling our analytics workload.”

  • Platform MapReduce

Platform MapReduce is high-performance computing expert Platform Computing’s entre into the big data space. It’s a runtime environment that supports a variety of MapReduce applications and file systems, not just those directly associated with Hadoop, and is tuned for enterprise-class performance and reliability. Platform, now part of IBM, built a respectable business managing clusters for large financial services institutions.

  • StackIQ Rocks+ Big Data

StackIQ Rock+ Big Data is a commercial distribution of the Rocks cluster management software that the company has beefed up to also support Apache Hadoop. Rocks+ supports the Apache, Cloudera, Hortonworks and MapR distributions, and handles the entire process from configuring bare metal servers to managing an operational Hadoop cluster.

  • Zettaset Orchestrator

Zettaset Orchestrator is an end-to-end Hadoop management product that supports multiple Hadoop distributions. Zettaset touts Orchestrator’s UI-based experience and its ability to handle what the company calls MAAPS — management, availability, automation, provisioning and security. At least one large company, Zions Bancorporation, is a Zettaset customer.


MapReduce for the masses

Great initiative from , called “MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl”

I’ve duplicate, hereafter, their original blogpost from:


Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.

When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.

With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?

This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.

When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!

Ready to get started?  Watch our screencast and follow along below:

Step 1 – Install Git and Eclipse

We first need to install a few important tools to get started:

Eclipse (for writing Hadoop code)

How to install (Windows and OS X):

Download the “Eclipse IDE for Java developers” installer package located at:

How to install (Linux):

Run the following command in a terminal:


 # sudo yum install eclipse


 # sudo apt-get install eclipse

Git (for retrieving our sample application)

How to install (Windows)

Install the latest .EXE from:

How to install (OS X)

Install the appropriate .DMG from:

How to install (Linux)

Run the following command in a terminal:


# sudo yum install git


# sudo apt-get install git

Step 2 – Check out the code and compile the HelloWorld JAR

Now that you’ve installed the packages you need to play with our code, run the following command from a terminal/command prompt to pull down the code:

# git clone git://

Next, start Eclipse.  Open the File menu then select “Project” from the “New” menu.  Open the “Java” folder and select “Java Project from Existing Ant Buildfile”.  Click Browse, then locate the folder containing the code you just checked out (if you didn’t change the directory when you opened the terminal, it should be in your home directory) and select the “build.xml” file.  Eclipse will find the right targets, and tick the “Link to the buildfile in the file system” box, as this will enable you to share the edits you make to it in Eclipse with git.

We now need to tell Eclipse how to build our JAR, so right click on the base project folder (by default it’s named “Hello World”) and select “Properties” from the menu that appears.  Navigate to the Builders tab in the left hand panel of the Properties window, then click “New”.  Select “Ant Builder” from the dialog which appears, then click OK.

To configure our new Ant builder, we need to specify three pieces of information here: where the buildfile is located, where the root directory of the project is, and which ant build target we wish to execute.  To set the buildfile, click the “Browse File System” button under the “Buildfile:” field, and find the build.xml file which you found earlier.  To set the root directory, click the “Browse File System” button under the “Base Directory:” field, and select the folder into which you checked out our code.  To specify the target, enter “dist” without the quotes into the “Arguments” field.  Click OK and close the Properties window.

Finally, right click on the base project folder and select “Build Project”, and Ant will assemble a JAR, ready for use in Elastic MapReduce.

Step 3 – Get an Amazon Web Services account (if you don’t have one already) and find your security credentials

If you don’t already have an account with Amazon Web Services, you can sign up for one at the following URL:

Once you’ve registered, visit the following page and copy down your Access Key ID and Secret Access Key:

This information can be used by any Amazon Web Services client to authorize things that cost money, so be sure to keep this information in a safe place.

Step 4 – Upload the HelloWorld JAR to Amazon S3

Uploading the JAR we just built to Amazon S3 is a lot simpler than it sounds. First, visit the following URL:

Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane, then click the “Upload” button, and select the JAR you just built. It should be located here:

<your checkout dir>/dist/lib/HelloWorld.jar

Step 5 – Create an Elastic MapReduce job based on your new JAR

Now that the JAR is uploaded into S3, all we need to do is to point Elastic MapReduce to it, and as it so happens, that’s pretty easy to do too! Visit the following URL:

and click the “Create New Job Flow” button. Give your new flow a name, and tick the “Run your own application” box. Select “Custom JAR” from the “Choose a Job Type” menu and click the “Continue” button.

The next field in the wizard will ask you which JAR to use and what command-line arguments to pass to it. Add the following location:

s3n://<your bucket name>/HelloWorld.jar

then add the following arguments to it:

org.commoncrawl.tutorial.HelloWorld <your aws secret key id> <your aws secret key>2010/01/07/18/1262876244253_18.arc.gz s3n://<your bucket name>/helloworld-out

CommonCrawl stores its crawl information as GZipped ARC-formatted files (, and each one is indexed using the following strategy:

/YYYY/MM/DD/the hour that the crawler ran in 24-hour format/*.arc.gz

Thus, by passing these arguments to the JAR we uploaded, we’re telling Hadoop to:

1. Run the main() method in our HelloWorld class (located at org.commoncrawl.tutorial.HelloWorld)

2. Log into Amazon S3 with your AWS access codes

3. Count all the words taken from a chunk of what the web crawler downloaded at 6:00PM on January 7th, 2010

4. Output the results as a series of CSV files into your Amazon S3 bucket (in a directory called helloworld-out)

Edit 12/21/11: Updated to use directory prefix notation instead of glob notation (thanks Petar!)

If you prefer to run against a larger subset of the crawl, you can use directory prefix notation to specify a more inclusive set of data. For instance:

2010/01/07/18 – All files from this particular crawler run (6PM, January 7th 2010)

2010/ - All crawl files from 2010

Don’t worry about the continue fields for now, just accept the default values. If you’re offered the opportunity to use debugging, I recommend enabling it to be able to see your job in action. Once you’ve clicked through them all, click the “Create Job Flow” button and your Hadoop job will be sent to the Amazon cloud.

Step 6 – Watch the show

Now just wait and watch as your job runs through the Hadoop flow; you can look for errors by using the Debug button. Within about 10 minutes, your job will be complete. You can view results in the S3 Browser panel, located here. If you download these files and load them into a text editor, you can see what came out of the job. You can take this sort of data and add it into a database, or create a new Hadoop OutputFormat to export into XML which you can render into HTML with an XSLT, the possibilities are pretty much endless.

Step 7 – Start playing!

If you find something cool in your adventures and want to share it with us, we’ll feature it on our site if we think it’s cool too. To submit a remix, push your codebase to GitHub or Gitorious and send a message to our user group about it: we promise we’ll look at it.

Tenzing – Hive done the Google way

Impressive it is, the Hive done in the Google way.

Google has publish a paper describing the architecture and implementation of Tenzing, a query engine built on top of MapReduce.


Tenzing is a query engine built on top of MapReduce for ad hoc analysis of Google data. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility. Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data. In this paper, we describe the architecture and implementation of Tenzing, and present benchmarks of typical analytical queries.
Citation: “Tenzing A SQL Implementation On The MapReduce Framework”, Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina, Younghee Kwon, Michael Wong, Proceedings of the VLDB Endowment, vol. 4 (2011), pp. 1318-1327.
[pdf] [search]

Solving your performance issue when dealing with big data

  1. Foresee and understand your performance issue

When dealing with big data, you will face performance problems  with the most simple and basic operation as soon as the processing require the whole data sets to be analyzed.

It is the case for instance when:

  • You aggregate data, to deliver summary statistics: action such as “count”,”min”,”avg” etc…
  • You need to sort your data

This in mind, you can easily and quickly anticipated issues in advance and start thinking about solving the problem.

  1. Solving the performance issue using technical tools

Compression is often a key solution to many performance issue as its require CPU speed which is currently always faster than i/o disk and i/o networks, so compression allow to speed up disk access, data transfer over network and eventually allow to  keep reduced data in memory.

Statistics can often apply to fasten your algorithm and are not necessarily complex, maintaining values range (min,max) or values distribution might fasten your processing resolution path.

Caching, deterministic result are provided by process independant from the data or based on data which rarely changed and forwhich you can assume they won’t change during your process time

Avoid data type conversions, because it’s always resources consuming

Balance then loadparalellized the processing and use map reduce :)


  1. Solving the performance issue by giving-up or resigning

We tend to refuse such approach, but sometimes it is a good exercise to go back and review why we do the things we do.

Can i approximate without altering significantly the result ?

Can i use a representative data sample instead of the whole data ?

At least do not avoid to think this way, wondering if solving an easier problem or looking for approximate result can’t finally bring you very close to the solution.


Video: What is Hadoop? Other big data terms like MapReduce?

What is Hadoop? Other big data terms like MapReduce? Cloudera’s CEO talks us through big data:

Gratuitous Hadoop: Stress Testing on the Cheap

Hadoop Streaming is a powerful tool and there is a nice illustration on how cool it can be: available

Following our previous article Yahoo’s RealTime MapReduce: S4 being open source

Website and twitter account are now available



Yahoo's RealTime MapReduce: S4 being open source

Yahoo is making an strong move by pushing its more advanced technology available as an Open Sourced project. Yahoo’s RealTime MapReduce, called S4, is meant to be a real-time, distributed, fault-tolerant, scalable, event driven, expandable platform  and allows programmers to easily implement applications for processing continuous unbounded streams of data.This project has been approved for being Open Sourced andwill be available on Github at You should see the S4 codebase available to you soon (the code and website content is being staged by the team).
All details about S4 available here: