Apache Kafka 0.8.0 released

Kafka is a Message Queue developed by LinkedIn that persists messages to disk in a very performant manner. It provides the functionality of a messaging system,  with a very unique design.

The 0.8.0 bring new features such as:

  • [KAFKA-50] – kafka intra-cluster replication support
  • [KAFKA-188] – Support multiple data directories
  • [KAFKA-202] – Make the request processing in kafka asynchonous
  • [KAFKA-203] – Improve Kafka internal metrics
  • [KAFKA-235] – Add a ‘log.file.age’ configuration parameter to force rotation of log files after they’ve reached a certain age
  • [KAFKA-429] – Expose JMX operation to set logger level dynamically
  • [KAFKA-475] – Time based log segment rollout
  • [KAFKA-545] – Add a Performance Suite for the Log subsystem
  • [KAFKA-546] – Fix commit() in zk consumer for compressed messages

Downloads version 0.8.0 here: http://kafka.apache.org/downloads.html

Mesos 0.13.0 released

Mesos 0.13 has been released and fix many bugs and include the following improvment:

  • [MESOS-46] – Refactor MasterTest to use fixture
  • [MESOS-134] – Add Python documentation
  • [MESOS-140] – Unrecognized command line args should fail the process
  • [MESOS-242] – Add more tests to Dominant Share Allocator
  • [MESOS-305] – Inform the frameworks / slaves about a master failover
  • [MESOS-346] – Improve OSX configure output when deprecated headers are present.
  • [MESOS-360] – Mesos jar should be built for java 6
  • [MESOS-409] – Master detector code should stat nodes before attempting to create
  • [MESOS-472] – Separate ResourceStatistics::cpu_time into ResourceStatistics::cpu_user_time and ResourceStatistics::cpu_system_time.
  • [MESOS-493] – Expose version information in http endpoints
  • [MESOS-503] – Master should log LOST messages sent to the framework
  • [MESOS-526] – Change slave command line flag from ‘safe’ to ‘strict’
  • [MESOS-602] – Allow Mesos native library to be loaded from an absolute path
  • [MESOS-603] – Add support for better test output in newer versions of autools

Download the most recent stable release: 0.13.0. (Release Notes)


Cassandra 2.0, just big

In five years, Apache Cassandra has grown into one of the most widely used NoSQL databases in the world and serves as the backbone for some of today’s most popular applications including as Facebook,Netflix,Twitter.



This newest version, Cassandra 2.0 just announced, includes multiple new features. But perhaps the biggest of them is that “Cassandra 2.0 makes it easier than ever for developers to migrate from relational databases and become productive quickly.”

New features and improvements include:

  • Lightweight transactions allow ensuring operation linearizability similar to the serializable isolation level offered by relational databases, which prevents conflicts during concurrent requests
  • Triggers, which enable pushing performance-critical code close to the data it deals with, and simplify integration with event-driven frameworks like Storm
  • CQL enhancements such as cursors and improved index support
  • Improved compaction, keeping read performance from deteriorating under heavy write load
  • Eager retries to avoid query timeouts by sending redundant requests to other replicas if too much time elapses on the original request
  • Custom Thrift server implementation based on LMAX Disruptor that achieves lower message processing latencies and better throughput with flexible buffer allocation strategies

Official website

Official announce


PouchDB's alpha has been released

PouchDB is now available as Alpha version and can be downloaded here.

PouchDB is a JavaScript library that allows you to store and query data for web applications that need to work offline, and sync with an online database when you are online.

The Browser Database that Syncs

Based on the work of Apache CouchDB, PouchDB provides a simple API in which to store and retrieve JSON objects, due to the similiar API, and CouchDB’s HTTP API it is possible to sync data that is stored in your local PouchDB to an online CouchDB as well as syncing data from CouchDB down to PouchDB (you can even sync between 2 PouchDB databases).

Status & Browser Support

PouchDB is currently in alpha preview. The first version will support browsers that have implemented the HTML5 IndexedDB API. Work is being done on providing support for WebSQL as well as LocalStorage and node.js. Currently tested in:

  • Firefox 12+
  • Chrome 19+

Quick start

Download pouch.js and include in your HTML then you can start using PouchDB straight away, here is a quick example:

var authors = [
  {name: 'Dale Harvey', commits: 253},
  {name: 'Mikeal Rogers', commits: 42},
  {name: 'Johannes J. Schmidt', commits: 13},
  {name: 'Randall Leeds', commits: 9}
Pouch('idb://authors', function(err, db) {
  // Opened a new database
  db.bulkDocs({docs: authors}, function(err, results) {
    // Saved the documents into the database
    db.replicate.to('http://host.com/cloud', function() {
      // The documents are now in the cloud!

Cascading 2.0 has been released

Cascading is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions.


Cascading 2.0 is now publicly available for download. This release includes a number of new features.

  • Apache 2.0 Licensing
  • Support for Hadoop 1.0.2
  • Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
  • HashJoin pipe for “map side joins”
  • Merge pipe for “map side merges”
  • Simple Checkpointing for capturing intermediate data as a file
  • Improved Tap and Scheme APIs

Manage your Hadoop cluster

You can get into the fun part of actually processing and analyzing big data with Hadoop, you have to configure, deploy and manage your cluster. It’s neither easy nor glamorous — data scientists get all the love — but it is necessary. Here are five tools (not from commercial distribution providers such as Cloudera or MapR) to help you do it.


  • Apache Ambari

Apache Ambari is an open source project for monitoring, administration and lifecycle management for Hadoop. It’s also the project that Hortonworks has chosen as the management component for the Hortonworks Data Platform. Ambari works with Hadoop MapReduce, HDFS, HBase, Pig, Hive, HCatalog and Zookeeper.

  • Apache Mesos

Apache Mesos is a cluster manager that lets users run multiple Hadoop jobs, or other high-performance applications, on the same cluster at the same time.According to Twitter Open Source Manager Chris Aniszczyk, Mesos “runs on hundreds of production machines and makes it easier to execute jobs that do everything from running services to handling our analytics workload.”

  • Platform MapReduce

Platform MapReduce is high-performance computing expert Platform Computing’s entre into the big data space. It’s a runtime environment that supports a variety of MapReduce applications and file systems, not just those directly associated with Hadoop, and is tuned for enterprise-class performance and reliability. Platform, now part of IBM, built a respectable business managing clusters for large financial services institutions.

  • StackIQ Rocks+ Big Data

StackIQ Rock+ Big Data is a commercial distribution of the Rocks cluster management software that the company has beefed up to also support Apache Hadoop. Rocks+ supports the Apache, Cloudera, Hortonworks and MapR distributions, and handles the entire process from configuring bare metal servers to managing an operational Hadoop cluster.

  • Zettaset Orchestrator

Zettaset Orchestrator is an end-to-end Hadoop management product that supports multiple Hadoop distributions. Zettaset touts Orchestrator’s UI-based experience and its ability to handle what the company calls MAAPS — management, availability, automation, provisioning and security. At least one large company, Zions Bancorporation, is a Zettaset customer.


Apache Hadoop 2.0 Alpha has been released

Apache Hadoop community has just released Apache Hadoop 2.0.0 (alpha)

While only an alpha release (read: not ready to run in production), it is still an important step forward as it represents the very first release that delivers new and important capabilities, including:

In addition to these new capabilities, there are several planned enhancements that are on the way from the community, includingHDFS Snapshots and auto-failover for HA NameNode, along with further improvements to the stability and performance with the next generation of MapReduce (YARN). There are definitely good times ahead.

Again, please note that the Apache Hadoop community has decided to use the alpha moniker for this release since it is a preview release that is not yet ready for production deployments for the following reasons:

  • We still need to iterate over some of the APIs (especially with the switch to protobufs) before we declare them stable, i.e. something that can be supported over the long run in a compatible manner.
  • Several features including HDFS HA, NextGen MapReduce et al need a lot more testing and validation before they are ready for prime time.
  • While we are excited about the progress made for supporting HA for HDFS, auto-failover for HDFS NameNode and HA for NextGen MapReduce are still a work-in-progress.

Please visit the Apache Hadoop Releases page to download hadoop-2.0.0-alpha and visit the Documentation page for more information.

Apache HBase 0.94 has been released

Apache HBase 0.94.0 has been released and can be downloaded here

This is the first major release since the January 22nd HBase 0.92 release.

In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).

Performance Related JIRAs

Below are a few of the important performance related JIRAs:

  • Read Caching improvements: HDFS stores data in one block file and its corresponding metadata (checksum) in another block file. This means that every read into the HBase block cache may consume up to two disk ops, one to the datafile and one to the checksum file.HBASE-5074: “Support checksums in HBase block cache” adds a block level checksum in the HFile itself in order to avoid one disk op,  boosting up the read performance. This feature isenabled by default.
  • Seek optimizations: Till now, if there were several StoreFiles for a column family in a region, HBase would seek in each such files and merge the results, even if the row/column we are looking for is in the most recent file.  HBASE-4465: “Lazy Seek optimization of StoreFile Scanners” optimizes scanner reads to read the most recent StoreFile first by lazily seeking the StoreFiles. This is achieved by introducing a fake keyvalue with its timestamp equal to the maximum timestamp present in the particular StoreFile. Thus, a disk seek is avoided until the KeyValueScanner for a StoreFile is bubbled up the heap, implying a need to do a real read operation.  This should provide a significant read performance boost, especially for IncrementColumnValue operations where we care only for latest value. This feature is enabledby default.
  • Write to WAL optimizations: HBase write throughput is upper bounded by the write rate of WAL where the log is replicated to a number of datanodes, depending on the replication factor.HBASE-4608: “HLog Compression” adds a custom dictionary-based compression of HLogs for faster replication on HDFS datanodes, thus improving overall write rate for HBase. This feature is considered experimental and is off by default.

New Feature Related JIRAs

Here is a list of some of the important JIRAs related to adding new features:

  • More powerful first aid box: The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. HBASE-5128: “Uber hbck”, adds these missing features to the first aid box.
  • Simplified Region Sizing: Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. HBASE-4365: “Heuristic for Region size” adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.
  • Smarter transaction semantics: Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations. HBASE-3584: “Atomic Put & Delete in a single transaction” enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is on by default.

This major release has a number of new features and bug fixes; a total of 397 resolved JIRAs with 140 enhancements and 180 bug fixes. It is compatible with 0.92. This opens up a window of opportunity to backport some of the cool features back in CDH4, which is based on the 0.92 branch.



About the Hadoop success

Yes Hadoop is a success, as over the last few years it became the platform for parallel computation in Java.

But it’s all it is to me, a leadership by default. I’ve misleading you ,”About the Hadoop success”, while probably true, is not a praise of the Hadoop solution but is a small and short critical review of the solution:

  • No serious competitors so far has bring this leadership by default, competition is an essential requirement for software evolution. Hadoop has a wandering evolution. It all started from from Google’s map-reduce concept, small and well defined software concept but today no one of the users really understand  the undertaken path
  • Not production ready ? Well, stability and efficiency seems to remain awaited by many. The 1.0 version has been released and announced as a big event but still, were is the maturity of a 1.0 commercial product ?
  • Unfriendly eco-system is another recurring critics,referring to the “Hadoop ecosystem map” one obvious remark popup: complexity.
  • Data management is inefficient, we all agree it’s easier to code queries in Hive than by using MapReduce directly.All Hadoop data management are moving to higher level languages, i.e., to SQL and SQL-like languages and Hadoop should push ahead such move.

In summary, the Gartner Group has formulated the well-known “hype cycle”  to describe the evolution of a new technology from inception onward.

Current Hadoop is promised as the “best thing since sliced bread” by its advocates.

We hope that its shortcomings can be fixed in time for it to live up to its promise.




  1. How did it all start: huge data on the web!
  2. Nutch built to crawl this web data
  3. Huge data had to saved: HDFS was born!
  4. How to use this data?
  5. Map reduce framework built for coding and running analytics (Java, any language through streaming/pipes)
  6. How to import unstructured data: web logs, click streams – fuse,webdav, chukwa, flume, Scribe
  7. Hiho and sqoop for loading data into HDFS – RDBMS can join the Hadoop band wagon!
  8. High level interfaces required over low level map reduce programming– Pig, Hive, Jaql
  9. BI tools with advanced UI reporting- drilldown etc- Intellicus
  10. Workflow tools over Map-Reduce processes and High level languages
  11. Monitor and manage Hadoop, run jobs/Hive, view HDFS – high level view- Hue, Karmasphere, Eclipse plugin, Cacti, Ganglia
  12. Support frameworks- Avro (Serialization), Zookeeper (Coordination)
  13. More High level interfaces/uses- Mahout, Elastic map Reduce
  14. OLTP- also possible – HBase

Apache Hive 0.9.0 has been released

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

The 0.9.0 release continues the trend of extending Hive’s SQL support. Hive now understands the BETWEEN operator and the NULL-safe equality operator, plus several new user defined functions (UDF) have now been added. New UDFs include printf()sort_array(), and java_method(). Also, the concat_ws() function has been modified to support input parameters consisting of arrays of strings.

This Hive release also includes several significant improvements to the query compiler and execution engine. HIVE-2642 improved Hive’s ability to optimize UNION queries, HIVE-2881 made the the map-side JOIN algorithm more efficient, and Hive’s ability to generate optimized execution plans for queries that contain multiple GROUP BY clauses was significantly improved in HIVE-2621.

HBase users will also be interested in several improvements to Hive’s HBase StorageHandler, mainly: