Apache | Data story

PouchDB’s alpha has been released1

PouchDB is now available as Alpha version and can be downloaded here.

PouchDB is a JavaScript library that allows you to store and query data for web applications that need to work offline, and sync with an online database when you are online.

The Browser Database that Syncs

Based on the work of Apache CouchDB, PouchDB provides a simple API in which to store and retrieve JSON objects, due to the similiar API, and CouchDB’s HTTP API it is possible to sync data that is stored in your local PouchDB to an online CouchDB as well as syncing data from CouchDB down to PouchDB (you can even sync between 2 PouchDB databases).

Status & Browser Support

PouchDB is currently in alpha preview. The first version will support browsers that have implemented the HTML5 IndexedDB API. Work is being done on providing support for WebSQL as well as LocalStorage and node.js. Currently tested in:

  • Firefox 12+
  • Chrome 19+

Quick start

Download pouch.js and include in your HTML then you can start using PouchDB straight away, here is a quick example:

var authors = [
  {name: 'Dale Harvey', commits: 253},
  {name: 'Mikeal Rogers', commits: 42},
  {name: 'Johannes J. Schmidt', commits: 13},
  {name: 'Randall Leeds', commits: 9}
];
Pouch('idb://authors', function(err, db) {
  // Opened a new database
  db.bulkDocs({docs: authors}, function(err, results) {
    // Saved the documents into the database
    db.replicate.to('http://host.com/cloud', function() {
      // The documents are now in the cloud!
    });
  });
});

Cascading 2.0 has been released1

Cascading is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions.

 

Cascading 2.0 is now publicly available for download. This release includes a number of new features.

  • Apache 2.0 Licensing
  • Support for Hadoop 1.0.2
  • Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies
  • HashJoin pipe for “map side joins”
  • Merge pipe for “map side merges”
  • Simple Checkpointing for capturing intermediate data as a file
  • Improved Tap and Scheme APIs

Manage your Hadoop cluster1

You can get into the fun part of actually processing and analyzing big data with Hadoop, you have to configure, deploy and manage your cluster. It’s neither easy nor glamorous — data scientists get all the love — but it is necessary. Here are five tools (not from commercial distribution providers such as Cloudera or MapR) to help you do it.

 

  • Apache Ambari

Apache Ambari is an open source project for monitoring, administration and lifecycle management for Hadoop. It’s also the project that Hortonworks has chosen as the management component for the Hortonworks Data Platform. Ambari works with Hadoop MapReduce, HDFS, HBase, Pig, Hive, HCatalog and Zookeeper.

  • Apache Mesos

Apache Mesos is a cluster manager that lets users run multiple Hadoop jobs, or other high-performance applications, on the same cluster at the same time.According to Twitter Open Source Manager Chris Aniszczyk, Mesos “runs on hundreds of production machines and makes it easier to execute jobs that do everything from running services to handling our analytics workload.”

  • Platform MapReduce

Platform MapReduce is high-performance computing expert Platform Computing’s entre into the big data space. It’s a runtime environment that supports a variety of MapReduce applications and file systems, not just those directly associated with Hadoop, and is tuned for enterprise-class performance and reliability. Platform, now part of IBM, built a respectable business managing clusters for large financial services institutions.

  • StackIQ Rocks+ Big Data

StackIQ Rock+ Big Data is a commercial distribution of the Rocks cluster management software that the company has beefed up to also support Apache Hadoop. Rocks+ supports the Apache, Cloudera, Hortonworks and MapR distributions, and handles the entire process from configuring bare metal servers to managing an operational Hadoop cluster.

  • Zettaset Orchestrator

Zettaset Orchestrator is an end-to-end Hadoop management product that supports multiple Hadoop distributions. Zettaset touts Orchestrator’s UI-based experience and its ability to handle what the company calls MAAPS — management, availability, automation, provisioning and security. At least one large company, Zions Bancorporation, is a Zettaset customer.

 

Apache Hadoop 2.0 Alpha has been released1

Apache Hadoop community has just released Apache Hadoop 2.0.0 (alpha)

While only an alpha release (read: not ready to run in production), it is still an important step forward as it represents the very first release that delivers new and important capabilities, including:

In addition to these new capabilities, there are several planned enhancements that are on the way from the community, includingHDFS Snapshots and auto-failover for HA NameNode, along with further improvements to the stability and performance with the next generation of MapReduce (YARN). There are definitely good times ahead.

Again, please note that the Apache Hadoop community has decided to use the alpha moniker for this release since it is a preview release that is not yet ready for production deployments for the following reasons:

  • We still need to iterate over some of the APIs (especially with the switch to protobufs) before we declare them stable, i.e. something that can be supported over the long run in a compatible manner.
  • Several features including HDFS HA, NextGen MapReduce et al need a lot more testing and validation before they are ready for prime time.
  • While we are excited about the progress made for supporting HA for HDFS, auto-failover for HDFS NameNode and HA for NextGen MapReduce are still a work-in-progress.

Please visit the Apache Hadoop Releases page to download hadoop-2.0.0-alpha and visit the Documentation page for more information.

Apache HBase 0.94 has been released1

Apache HBase 0.94.0 has been released and can be downloaded here

This is the first major release since the January 22nd HBase 0.92 release.

In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).

Performance Related JIRAs

Below are a few of the important performance related JIRAs:

  • Read Caching improvements: HDFS stores data in one block file and its corresponding metadata (checksum) in another block file. This means that every read into the HBase block cache may consume up to two disk ops, one to the datafile and one to the checksum file.HBASE-5074: “Support checksums in HBase block cache” adds a block level checksum in the HFile itself in order to avoid one disk op,  boosting up the read performance. This feature isenabled by default.
  • Seek optimizations: Till now, if there were several StoreFiles for a column family in a region, HBase would seek in each such files and merge the results, even if the row/column we are looking for is in the most recent file.  HBASE-4465: “Lazy Seek optimization of StoreFile Scanners” optimizes scanner reads to read the most recent StoreFile first by lazily seeking the StoreFiles. This is achieved by introducing a fake keyvalue with its timestamp equal to the maximum timestamp present in the particular StoreFile. Thus, a disk seek is avoided until the KeyValueScanner for a StoreFile is bubbled up the heap, implying a need to do a real read operation.  This should provide a significant read performance boost, especially for IncrementColumnValue operations where we care only for latest value. This feature is enabledby default.
  • Write to WAL optimizations: HBase write throughput is upper bounded by the write rate of WAL where the log is replicated to a number of datanodes, depending on the replication factor.HBASE-4608: “HLog Compression” adds a custom dictionary-based compression of HLogs for faster replication on HDFS datanodes, thus improving overall write rate for HBase. This feature is considered experimental and is off by default.

New Feature Related JIRAs

Here is a list of some of the important JIRAs related to adding new features:

  • More powerful first aid box: The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. HBASE-5128: “Uber hbck”, adds these missing features to the first aid box.
  • Simplified Region Sizing: Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. HBASE-4365: “Heuristic for Region size” adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.
  • Smarter transaction semantics: Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations. HBASE-3584: “Atomic Put & Delete in a single transaction” enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is on by default.

This major release has a number of new features and bug fixes; a total of 397 resolved JIRAs with 140 enhancements and 180 bug fixes. It is compatible with 0.92. This opens up a window of opportunity to backport some of the cool features back in CDH4, which is based on the 0.92 branch.

 

 

About the Hadoop success1

Yes Hadoop is a success, as over the last few years it became the platform for parallel computation in Java.

But it’s all it is to me, a leadership by default. I’ve misleading you ,”About the Hadoop success”, while probably true, is not a praise of the Hadoop solution but is a small and short critical review of the solution:

  • No serious competitors so far has bring this leadership by default, competition is an essential requirement for software evolution. Hadoop has a wandering evolution. It all started from from Google’s map-reduce concept, small and well defined software concept but today no one of the users really understand  the undertaken path
  • Not production ready ? Well, stability and efficiency seems to remain awaited by many. The 1.0 version has been released and announced as a big event but still, were is the maturity of a 1.0 commercial product ?
  • Unfriendly eco-system is another recurring critics,referring to the “Hadoop ecosystem map” one obvious remark popup: complexity.
  • Data management is inefficient, we all agree it’s easier to code queries in Hive than by using MapReduce directly.All Hadoop data management are moving to higher level languages, i.e., to SQL and SQL-like languages and Hadoop should push ahead such move.

In summary, the Gartner Group has formulated the well-known “hype cycle”  to describe the evolution of a new technology from inception onward.

Current Hadoop is promised as the “best thing since sliced bread” by its advocates.

We hope that its shortcomings can be fixed in time for it to live up to its promise.

 

 

Legend:

  1. How did it all start: huge data on the web!
  2. Nutch built to crawl this web data
  3. Huge data had to saved: HDFS was born!
  4. How to use this data?
  5. Map reduce framework built for coding and running analytics (Java, any language through streaming/pipes)
  6. How to import unstructured data: web logs, click streams – fuse,webdav, chukwa, flume, Scribe
  7. Hiho and sqoop for loading data into HDFS – RDBMS can join the Hadoop band wagon!
  8. High level interfaces required over low level map reduce programming– Pig, Hive, Jaql
  9. BI tools with advanced UI reporting- drilldown etc- Intellicus
  10. Workflow tools over Map-Reduce processes and High level languages
  11. Monitor and manage Hadoop, run jobs/Hive, view HDFS – high level view- Hue, Karmasphere, Eclipse plugin, Cacti, Ganglia
  12. Support frameworks- Avro (Serialization), Zookeeper (Coordination)
  13. More High level interfaces/uses- Mahout, Elastic map Reduce
  14. OLTP- also possible – HBase

Apache Hive 0.9.0 has been released1

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

The 0.9.0 release continues the trend of extending Hive’s SQL support. Hive now understands the BETWEEN operator and the NULL-safe equality operator, plus several new user defined functions (UDF) have now been added. New UDFs include printf()sort_array(), and java_method(). Also, the concat_ws() function has been modified to support input parameters consisting of arrays of strings.

This Hive release also includes several significant improvements to the query compiler and execution engine. HIVE-2642 improved Hive’s ability to optimize UNION queries, HIVE-2881 made the the map-side JOIN algorithm more efficient, and Hive’s ability to generate optimized execution plans for queries that contain multiple GROUP BY clauses was significantly improved in HIVE-2621.

HBase users will also be interested in several improvements to Hive’s HBase StorageHandler, mainly:

 

Download

Apache MRUnit 0.9.0-incubating has been released1

MRUnit is a Java library that helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they’re deployed to a production system.

The MRUnit project is quite active, 0.9.0 is our fourth release since entering the incubator and we have added 4 new committers beyond the projects initial charter! We are very interested in having new contributors and committers join the project! Please join our mailing list to find out how you can help!

The MRUnit build process has changed to produce mrunit-0.9.0-hadoop1.jar and mrunit-0.9.0-hadoop2.jar instead of mrunit-0.9.0-hadoop020.jar, mrunit-0.9.0-hadoop100.jar and mrunit-0.9.0-hadoop023.jar. The hadoop1 classifier is for all Apahce Hadoop versions based off the 0.20.X line including 1.0.X. The hadoop2 classifier is for all Apache Hadoop versions based off the 0.23.X line including the unreleased 2.0.X.

This release contains 6 bug fixes, 15 improvements, and 2 new features. I will highlight a few below:

  • Support custom counter checking in MRUNIT-68
  • runTest() should optionally ignore output order in MRUNIT-91
  • Driver.runTest throws RuntimeException should it throw AssertionError in MRUNIT-54
  • o.a.h.mrunit.mapreduce.MapReduceDriver should support a combiner in MRUNIT-67
  • Better support for other serializations besides Writable: MRUNIT-70MRUNIT-86MRUNIT-99MRUNIT-77
  • Better error messages from validate, null checking and forgetting to set mappers and reducers: MRUNIT-74MRUNIT-66MRUNIT-65
  • add static convenience methods to PipelineMapReduceDriver class in MRUNIT-89
  • Test and Deprecate Driver.{*OutputFromString,*InputFromString} Methods in MRUNIT-48

An Apache 2.2 module serving files from MongoDB3

mod_gridfs“ is an Apache 2.x module that supports serving of files from MongoDB GridFS.

More details available here: https://bitbucket.org/onyxmaster/mod_gridfs/

 

 

CouchDB 1.2.0 has been released2

Big time for Apache CouchDB, the 1.2.0 version has been released and is now available for download.

You can grab your copy here:

http://couchdb.apache.org/

Windows packages are now available. Grab them at the same download link.

This release also coincides with a revamped project homepage!

This is a big release with lots of updates. Please also note that this release contains breaking changes.

These release notes are based on the NEWS file.

Performance

  • Added a native JSON parser

    Performance critical portions of the JSON parser are now implemented in C. This improves latency and throughput for all database and view operations. We are using the fabulous yajl library.

  • Optional file compression (database and view index files)

    This feature is enabled by default.

    All storage operations for databases and views are now passed through Google’s snappy compressor. The result is simple: since less data has to be transferred from and to disk and through the CPU & RAM, all database and view accesses are now faster and on-disk files are smaller. Compression can be changed to gzip compression with options that specify the compression ratio or it can be fully disabled as well.

  • Several performance improvements, especially regarding database writes and view indexing

    Combined with the two preceding improvements, we made some less obvious algorithmic improvements that take the Erlang runtime system into account when writing data to databases and view index files. The net result is much improved performance for most common operations including building views.

    The JIRA ticket (COUCHDB-976) has more information.

  • Performance improvements for the built-in changes feed filters _doc_ids and _design

Security

The security system got a major overhaul making it way more secure to run CouchDB as a public database server for CouchApps. Unfortunately we had to break a bit of backwards compatibility with this, but we think it is well worth the trouble.

  • Documents in the _users database can no longer be read by everyone

    Documents in the _users databases can now only be read by the respective authenticated user and administrators. Before, all docs were world-readable including their password hashes and salts.

  • Confidential information in the _replication database can no longer be read by everyone

    Similar to documents in the _users database, documents in the _replicator database now get passwords and OAuth tokens stripped when read by a user that is not the creator of the replication or an administrator.

  • Password hashes are now calculated by CouchDB instead of the client

    Previously, CouchDB relied on the client to hash and salt the user’s password. Now, it accepts plain text passwords and hashes them before they are committed to disk, following traditional best practices.

  • Allow persistent authentication cookies

    Cookie based authentication can now keep a user logged in over a browser restart.

  • OAuth secrets can now be stored in the users system database

    This is better for managing large numbers of users and tokens than the old, clumsy way of storing OAuth tokens in the configuration system and configuration system.

  • Updated bundled erlang_oauth library to the latest version

    The Erlang library that handles OAuth authentication has been updated to the latest version.

Build System

  • cURL is no longer required to build CouchDB as it is only required by the command line JavaScript test runner

    This makes building CouchDB on certain platforms easier.

HTTP API

  • Added a data_size property to database and view group information URIs

    With this you can now calculate how much actual data is stored in a database file or view index file and compare it with the file size that is already being reported. The difference is CouchDB-specific overhead most of which can be reclaimed during compaction. This is used to power the automatic compaction feature (see below).

  • Added optional field since_seq to replication objects/documents

    This allows you to start a replication from a certain database update sequence instead from the start.

  • The _active_tasks API now exposes more granular fields for each task type

    The replication and compaction tasks, e.g. report their progress in the task info.

  • Added built-in changes feed filter _view

    With this you can use a view’s map function as a changes filter instead of duplicating.

Core Storage

  • Added support for automatic compaction

    This feature is disabled by default, but it can be enabled in the configuration page in Futon or the .ini files.

    Compaction is a regular maintenance task for CouchDB. This can now be automated based on multiple variables:

    • A threshold for the file_size to disk_size ratio (say 70%)
    • A time window specified in hours and minutes (e.g 01:00-05:00)

    Compaction can be cancelled if it exceeds the closing time. Compaction for views and databases can be set to run in parallel, but that is only useful for setups where the database directory and view directory are on different disks.

    In addition, if there’s not enough space (2 × data_size) on the disk to complete a compaction, an error is logged and the compaction is not started.

Replicator

  • A new replicator implementation that offers more performance and configuration options

    The replicator has been rewritten from scratch. The new implementation is more reliable, faster and has more configuration than the previous implementation. If you have had any issues with replication in previous releases, we strongly recommend giving 1.2.0 a spin.

    Configuration options include:

    • Number of worker processes
    • Batch size per worker
    • Maximum number of HTTP connections
    • Number of connection retries

    See default.ini for the full list of options and their default values.

    This allows you to fine-tune replication behaviour tailored to your environment. A spotty mobile network connection can benefit from a single worker process and small batch sizes to reliably, albeit slowly, synchronise data. A full-duplex 10GigE server-to-server connection on a LAN can benefit from more workers and higher batch sizes. The exact values depend on your particular setup and we recommend some experimentation before settling on a set of values.

Futon

  • Futon’s Status screen (active tasks) now displays two new task status fields: Started on and Updated on
  • Simpler replication cancellation

    Running replications can now be cancelled with a single click.

Log System

  • Log correct stack trace in all cases

    In certain error cases, CouchDB would return a stack trace from the log system itself and hide the real error. Now CouchDB always returns the correct error.

  • Improvements to log messages for file-related errors

    CouchDB requires correct permissions for a number of files. Error messages related to file permission errors were not always obvious and are now improved.

Various Bugfixes

  • Fixed old index file descriptor leaks after a view cleanup
  • Fixes to the _changes feed heartbeat option when combined with a filter. It affected continuous pull replications with a filter
  • Fix use of OAuth with VHosts and URL rewriting
  • The requested_path property of query server request objects now has the path requested by clients before VHosts and rewriting
  • Fixed incorrect reduce query results when using pagination parameters
  • Made icu_driver work with Erlang R15B and later
  • Improvements to the build system and etap test suite
  • Avoid invalidating view indexes when running out of file descriptors

Breaking Changes

This release contains breaking changes:

http://wiki.apache.org/couchdb/Breaking_changes

It is very important that you understand these changes before you upgrade.

Follow LuxNoSQL on Twitter
 
Join the LuxNoSQL Community on LinkedIn