Enter MongoDB+Hadoop: an adapter that allows Apache’s Hadoop platform to integrate with MongoDB.
Using this adapter, it is possible to use MongoDB as a real-time datastore for your application while shifting large aggregation, batch processing, and ETL workloads to a platform better suited for the task.
Well, the engineers at 10gen have taken it one step further with the introduction of the streaming assembly for Mongo-Hadoop.
What does all that mean?
It works like this:
Once a developer has Java installed and Hadoop ready to rock they download and build the adapter. With the adapter built, you compile the streaming assembly, load some data into Mongo, and get down to writing some MapReduce jobs.
The assembly streams data from MongoDB into Hadoop and back out again, running it through the mappers and reducers defined in a language you feel at home with. Cool right?
Ruby support was recently added and is particularly easy to get started with. Lets take a look at an example where we analyze twitter data.
Import some data into MongoDB from twitter:
Next, write a Mapper and save it in a file called mapper.rb:
Now, write a Reducer and save it in a file called reducer.rb:
To run it all, create a shell script that executes hadoop with the streaming assembly jar and tells it how to find the mapper and reducer files as well as where to retrieve and store the data:
Make them all executable by running chmod +x on the all the scripts and run twit.sh to have hadoop process the job.