Hadoop Streaming Support for MongoDB

From MongoDB blog, 10gen announce it has some native data processing tools, such as the built-in Javascript-oriented MapReduce framework, and a new Aggregation Framework in MongoDB v2.2. That said, there will always be a need to decouple persistance and computational layers when working with Big Data.

Enter MongoDB+Hadoop: an adapter that allows Apache’s Hadoop platform to integrate with MongoDB.

Using this adapter, it is possible to use MongoDB as a real-time datastore for your application while shifting large aggregation, batch processing, and ETL workloads to a platform better suited for the task.

          

Well, the engineers at 10gen have taken it one step further with the introduction of the streaming assembly for Mongo-Hadoop.

What does all that mean?

The streaming assembly lets you write MapReduce jobs in languages like Python, Ruby, and JavaScript instead of Java, making it easy for developers that are familiar with MongoDB and popular dynamic programing languages to leverage the power of Hadoop.

                    

It works like this:

Once a developer has Java installed and Hadoop ready to rock they download and build the adapter. With the adapter built, you compile the streaming assembly, load some data into Mongo, and get down to writing some MapReduce jobs.

The assembly streams data from MongoDB into Hadoop and back out again, running it through the mappers and reducers defined in a language you feel at home with. Cool right?

Ruby support was recently added and is particularly easy to get started with. Lets take a look at an example where we analyze twitter data.

Import some data into MongoDB from twitter:

curl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d twitter -c in
view rawimport.shThis Gist brought to you by GitHub.

Next, write a Mapper and save it in a file called mapper.rb:

#!/usr/bin/env ruby
require ‘mongo-hadoop’
MongoHadoop.map do |document|
  { :_id => document[‘user’][‘time_zone’], :count => 1 }
end
view rawmapper.rbThis Gist brought to you by GitHub.

Now, write a Reducer and save it in a file called reducer.rb:

#!/usr/bin/env ruby
require ‘mongo-hadoop’
MongoHadoop.reduce do |key, values|
  count = sum = 0
  values.each do |value|
    count += 1
    sum += value[‘num’]
  end
  { :_id => key, :average => sum / count }
end
view rawreducer.rbThis Gist brought to you by GitHub.

To run it all, create a shell script that executes hadoop with the streaming assembly jar and tells it how to find the mapper and reducer files as well as where to retrieve and store the data:

hadoop jar mongo-hadoop-streaming-assembly*.jar -mapper mapper.rb -reducer reducer.rb -inputURI mongodb://127.0.0.1/twitter.in -outputURI mongodb://127.0.0.1/twitter.out
view rawtwit.shThis Gist brought to you by GitHub.

Make them all executable by running chmod +x on the all the scripts and run twit.sh to have hadoop process the job.

Comments are closed.