The Mongo-Hadoop Adapter 1.1 have been released, it makes easy to use Mongo databases, or mongoDB backup files in .bson format, as the input source or output destination for Hadoop Map/Reduce jobs. By inspecting the data and computing input splits, Hadoop can process the data in parallel so that very large datasets can be processed quickly.
The Mongo-Hadoop adapter also includes support for Pig and Hive, which allow very sophisticated MapReduce workflows to be executed just by writing very simple scripts.
- Pig is a high-level scripting language for data analysis and building map/reduce workflows
- Hive is a SQL-like language for ad-hoc queries and analysis of data sets on Hadoop-compatible file systems.
Hadoop streaming is also supported, so map/reduce functions can be written in any language besides Java. Right now the Mongo-Hadoop adapter supports streaming in Ruby, Node.js and Python.
How it Works
How the Hadoop Adapter works
- The adapter examines the MongoDB Collection and calculates a set of splits from the data
- Each of the splits gets assigned to a node in Hadoop cluster
- In parallel, Hadoop nodes pull data for their splits from MongoDB (or BSON) and process them locally
- Hadoop merges results and streams output back to MongoDB or BSON