Riak 1.3.2 released

Performance and bugfix release on the Riak 1.3 series.
The release notes for Riak can be found here:

The packages can be found here:

New Features or Major Improvements for Riak

eLevelDB Verify Compactions

In Riak 1.2 we added code to have leveldb automatically shunt corrupted blocks to the lost/BLOCKS.bad file during a compaction. This was to keep the compactions from going into an infinite loop over an issue that A) read repair and AAE could fix behind the scenes and B) took up a bunch of customer support / engineering time to help customers fix manually.

Unfortunately, we did not realize that only one of two corruption tests was actually active during a compaction. There is a CRC test that applies to all blocks, including file metadata. Compression logic has a hash test that applies only to compressed data blocks. The CRC test was not active by default. Sadly, leveldb makes limited defensive tests beyond the CRC. A corrupted disk file could readily result in a leveldb / Riak crash … unless the bad block happened to be detected by the compression hash test.

Google’s answer to this problem is the paranoid_checks option, which defaults to false. Unfortunately setting this to true activates not only the compaction CRC test but also a CRC test of the recovery log. A CRC failure in the recovery log after a crash is expected, and utilized by the existing code logic to enable automated recovery upon next start up. paranoid_checks option will actually stop the automatic recovery if set to true. This second behavior is undesired.

This branch creates a new option, verify_compactions. The background CRC test previously controlled by paranoid_checks is now controlled by this new option. The recovery log CRC check is still controlled by paranoid_checks. verify_compactions defaults to true. paranoid_checks continues to default to false.

Note: CRC calculations are typically expensive. Riak 1.3 added code to leveldb to utilize Intel hardware CRC on 64bit servers where available. Riak 1.2 added code to leveldb to create multiple, prioritized compaction threads. These two prior features work to minimize / hide the impact of the increased CRC workload during background compactions.

Erlang Scheduler Collapse

All Erlang/OTP releases prior to R16B01 are vulnerable to the Erlang computation scheduler threads going asleep too aggressively. The sleeping periods reduce power consumption and inter-thread resource contention.

This release of Riak EDS requires a patch to the Erlang/OTP virtual machine to force sleeping scheduler threads to wake up a regular intervals. The flag +sfwi 500 must also be present in the vm.args file. This value is in milliseconds and may need tuning for your application. For the Open Source Riak release, the patch (and extra “vm.args” flags) are recommended: the patch can be found at: https://gist.github.com/evanmcc/a599f4c6374338ed672e.

Overload Protection / Work Shedding

As of Riak 1.3.2, Riak now includes built-in overload protection. If a Riak node becomes overloaded, Riak will now immediately respond {error, overload} rather than perpetually enqueuing requests and making the situation worse.

Previously, Riak would always enqueue requests. As an overload situation became worse, requests would take longer and longer to service, eventually getting to the point where requests would continually timeout. In extreme scenarios, Riak nodes could become unresponsive and ultimately crash.

The new overload protection addresses these issues.

The overload protection is configurable through app.config settings. The default settings have been tested on clusters of varying sizes and request rates and should be sufficient for all users of Riak. However, for completeness, the new settings are explained below.

There are two types of overload protection in Riak, each with different settings. The first limits the number of in-flight get and put operations initiated by a node in a Riak cluster. This is configured through the riak_kv/fsm_limit setting. The default is 50000. This limit is tracked separately for get and put requests, so the default allows up to 100000 in-flight requests in total.

The second type of overload protection limits the message queue size for individual vnodes, setting an upper bound on unserviced requests on a per-vnode basis. This is configured through the riak_core/vnode_overload_threshold setting and defaults to 10000 messages.

Setting either config setting to undefined in app.config will disable overload protection. This is not recommended. Note: not configuring the options at all will use the defaults mentioned above, ie. when missing from app.config.

The overload protection provides new stats that are exposed over the /stats endpoint.

The dropped_vnode_requests_total stat counts the number of messages discarded due to the vnode overload protection.

For the get/put overload protection, there are several new stats. The stats related to gets are listed below, there are equivalent versions for puts.

The node_get_fsm_active and node_get_fsm_active_60s stats shows how many gets are currently active on the node within the last second or last minute respectively. The node_get_fsm_in_rate and node_get_fsm_out_rate track the number of requests initiated and completed within the last second. Finally, the node_get_fsm_rejected, node_get_fsm_rejected_60s, and node_get_fsm_rejected_total track the number of requests discarded due to overload in their respective time windows.

Health Check Disabled

The health check feature that shipped in Riak 1.3.0 has been disabled as of Riak 1.3.2. The new overload protection feature serves a similar purpose and is much safer. Specifically, the health check approach was able to successfully recover from overload that was caused by slow nodes, but not from overload that was caused by incoming workload spiking beyond absolute cluster capacity. In fact, in the second case, the health check approach (divert overload traffic from one node to another) would exacerbate the problem.

Riak CS Open Sourced

Basho announced today that Riak CS is now open source under the Apache 2 license. Organizations and users can now access the source code on Github and download the latest packages from the downloads page.

Riak CS can be used to build private or public clouds or as reliable, available storage behind applications and platforms. Riak CS Enterprise is currently used by large corporations including Datapipe, Deutsche Vermögensberatung (DVAG), IDC Frontier, Rovio, and Yahoo! JAPAN.

Basho is a distributed systems company dedicated to making software that is available, fault-tolerant, and easy to operate at scale. Twenty-five percent of the Fortune 50 and thousands of open source users large and small run our flagship open source database, Riak. Riak CS takes distributed systems principles derived from production Riak users and applies it to the problem of large scale storage. We are excited to share this code with the world.

Riak CS features:

  • Highly available, fault-tolerant storage
  • Large object support
  • S3-compatible API and authentication
  • Multi-tenancy and per-user reporting
  • Simple operational model for adding capacity
  • Robust stats for monitoring and metrics

For users requiring multi-datacenter replication and enterprise-level support, Riak CS Enterprise (a commercial extension of Riak CS) is available.

New Features

Today we are also announcing several new features, available now as part of the open source edition.

  • Multipart upload. Upload very large files to Riak CS as a series of parts. Parts can be between 5MB and 5GB.
  • Support for GET range queries. Retrieve a range of bytes from a single object. This functionality is implemented in the “Range” request header of GET operations.
  • Per-bucket policies to restrict access to buckets based on source IP.
  • Riak CS Control. Riak CS Control is a standalone web administration tool for user management available on Github.

Riak v1.3 has been released

Riak v1.3 has been released,Active Anti-Entropy, a New Look for Riak Control, and Faster Multi-Datacenter Replication

Major features of this release:

  • Introduced Active Anti-Entropy
  • Improved Riak Enterprise’s multi-datacenter replication performance
  • Improved graphical user experience
  • Expanded IPv6 support
  • Improved MapReduce
  • Simplified log management

Download the new release here, check out the official release notes

Cassandra performance review


Original article available here


Four years ago, well before starting DataStax, I evaluated the then-current crop of distributed databases and explained why I chose Cassandra. In a lot of ways, Cassandra was the least mature of the options, but I chose to take a long view and wanted to work on a project that got the fundamentals right; things like documentation and distributed testscould come later.


2012 saw that validated in a big way, as the most comprehensive NoSQL benchmark to date was published at the VLDB conference by researchers at the University of Toronto. They concluded,

In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear in- creasing throughput from 1 to 12 nodes.

As a sample, here’s the throughput results from the mixed reads, writes, and (sequential) scans:

I encourage you to take a few minutes to skim the full results.

There are both architectural and implentation reasons for Cassandra’s dominating performance here. Let’s get down into the weeds and see what those are.


Cassandra incorporates a number of architectural best practices that affect performance. None are unique to Cassandra, but Cassandra is the only NoSQL system that incorporates all of them.

Fully distributed: Every Cassandra machine handles a proportionate share of every activity in the system. There are no special cases like the HDFS namenode or MongoDB mongos that require special treatment or special hardware to avoid becoming a bottleneck. And with every node the same, Cassandra is far simpler to install and operate, which has long-term implications for troubleshooting.

Log-structured storage engine: A log-structured engine that avoids overwrites to turn updates into sequential i/o is essential both on hard disks (HDD) and solid-state disks (SSD). On HDD, because the seek penalty is so high; on SSD, to avoid write amplification and disk failure. This is why you see mongodb performance go through the floor as the dataset size exceeds RAM.

Tight integration with its storage engine: Voldemort and Riak support pluggable storage engines, which both limits them to a lowest-common-denominator of key/value pairs, and limits the optimizations that can be done with the distributed replication engine.

Locally-managed storage: HBase has an integrated, log-structured storage engine, but relies on HDFS for replication instead of managing storage locally. This means HBase is architecturally incapable of supporting Cassandra-style optimizations like putting the commitlog on a separate disk, or mixing SSD and HDD in a single cluster with appropriate data pinned to each.


An architecture is only as good as its implementation. For the first years after Cassandra’s open-sourcing as an Apache project, every release was a learning experience. 0.3, 0.4, 0.5, 0.6, each attracted a new wave of users that exposed some previously unimportant weakness. Today, we estimate there are over a thousand production deployments of Cassandra, the most for any scalable database. Some are listed here. To paraphrase ESR, “With enough eyes, all performance problems are obvious.”

What are some implementation details relevant to performance? Let’s have a look at some of the options.


MongoDB can be a great alternative to MySQL, but it’s not really appropriate for the scale-out applications targeted by Cassandra. Still, as early members of the NoSQL category, the two do draw comparisons.

One important limitation in MongoDB is database-level locking. That is, only one writer may modify a given database at a time. Support for collection-level (a set of documents, analogous to a relational table) locking is planned. With either database- or collection-level locking, other writers or readers are locked out. Even a small number of writes can produce stalls in read performance.

Cassandra uses advanced concurrent structures to provide row-level isolation without locking. Cassandra eveneliminated the need for row-level locks for index updates in the recent 1.2 release.

A more subtle MongoDB limitation is that when adding or updating a field in a document, the entire document must be re-written. If you pre-allocate space for each document, you can avoid the associated fragmentation, but even with pre-allocation updating your document gets slower as it grows.

Cassandra’s storage engine only appends updated data, it never has to re-write or re-read existing data. Thus, updates to a Cassandra row or partition stay fast as your dataset grows.


Riak presents a document-based data model to the end user, but under the hood it maps everything to a key/value storage API. Thus, like MongoDB, updating any field in a document requires rewriting the whole thing.

However, Riak does emphasize the use of log-structured storage engines. Both the default BitCask backend and LevelDB are log-structured. Riak increasingly emphasizes LevelDB since BitCask does not support scan operations (which are required for indexes), but this brings its own set of problems.

LevelDB is a log-structured storage engine with a different approach to compaction than the one introduced by Bigtable. LevelDB trades more compaction i/o for less i/o at read time, which can be a good tradeoff for many workloads, but not all. Cassandra added support for leveldb-style compaction about a year ago.

LevelDB itself is designed to be an embedded database for the likes of Chrome, and clear growing pains are evident when pressed into service as a multi-user backend for Riak. (A LevelDB configuration for Voldemort also exists.) Basho cites “one stall every 2 hours for 10 to 30 seconds”, “cases that can still cause [compaction] infinite loops,” and no way to create snapshots or backups as of the recently released Riak 1.2.


HBase’s storage engine is the most similar to Cassandra’s; both drew on Bigtable’s design early on.

But despite a later start, Cassandra’s storage engine is far ahead of HBase’s today, in large part because building on HDFS instead of locally-managed storage makes everything harder for HBase. Cassandra added online snapshotsalmost four years ago; HBase still has a long ways to go.

HDFS also makes SSD support problematic for HBase, which is becoming increasingly relevant as SSD price/performance improves. Cassandra has excellent SSD support and even support for mixed SSD and HDD within the same cluster, with data pinned to the medium that makes the most sense for it.

Other differences that may not show up at benchmark time, but you would definitely notice in production:

HBase can’t delete data during minor compactions — you have to rewrite all the data in a region to reclaim disk space. Cassandra has deleted tombstones during minor compactions for over two years.

While you are running that major compaction, HBase gives you no way to throttle it and limit its impact on your application workload. Cassandra introduced this two years ago and continues to improve it. Dealing with local storage also lets Cassandra avoid polluting the page cache with sequential scans from compaction.

Compaction might seem like bookkeeping details, but it does impact the rest of the system. HBase limits you to two or three column families because of compaction and flushing limitations, forcing you to do sub-optimal things to your data model as a workaround.


I honestly think Cassandra is one to two years ahead of the competition, but I’m under no illusions that Cassandra itself is perfect. We have plenty of improvements to make still; from the recently released Cassandra 1.2 to our ticket backlog, there is no shortage of work to do.

Here are some of the areas I’d like to see Cassandra improve this year:

If working on an industry-leading, open-source database doing cutting edge performance work on the JVM sounds interesting to you, please get in touch.

Nine Databases in 45 Minutes

From http://www.datanami.com/datanami/2012-12-04/nine_databases_in_45_minutes.html


The following video provides a crash course in nine key databases:  Postgres, CouchDB, MarkLogic, Riak, VoltDB, MongoDB, Neo4j, HBase and Redis. All in just 45 minutes.

Miles Pomeroy, Chad Maughan, and Jonathan Geddes run through each database in five minutes each, thus the title, “9 Databases in 45 minutes.” The video proceeds in efficient and occasionally amusing fashion, where a countdown clock and a gong keep the presentations terse.

NoSQL Benchmark

There is probably no perfect NoSQL database. Every database has its advantages and disadvantages that become more or less important depending on your preferences and the type of tasks your trying to achieve.

Altoros Systems as performed an independent and interesting benchmark to help you sort out the current prons and crons between different solution including: HBase,Cassandra,Riak and MongoDb


What makes this research unique?

Often referred to as NoSQL, non-relational databases feature elasticity and scalability in combination with a capability to store big data and work with cloud computing systems, all of which make them extremely popular. NoSQL data management systems are inherently schema-free (with no obsessive complexity and a flexible data model) and eventually consistent (complying with BASE rather than ACID). They have a simple API, serve huge amounts of data and provide high throughput.

In 2012, the number of NoSQL products reached 120-plus and the figure is still growing. That variety makes it difficult to select the best tool for a particular case. Database vendors usually measure productivity of their products with custom hardware and software settings designed to demonstrate the advantages of their solutions. We wanted to do independent and unbiased research to complement the work done by the folks at Yahoo.

Using Amazon virtual machines to ensure verifiable results and research transparency (which also helped minimize errors due to hardware differences), we have analyzed and evaluated the following NoSQL solutions:

● Cassandra, a column family store
● HBase (column-oriented, too)
● MongoDB, a document-oriented database
● Riak, a key-value store

We also tested MySQL Cluster and sharded MySQL, taking them as benchmarks.

After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.

Riak 1.2.1 has been released

Riak 1.2.1 has been released, it includes some really important bug fixes.



  • Aggregation of non-streamed MapReduce results was improved. Previous versions used an O(n^2) process, where n is the number of outputs for a phase. The aggregation in Riak 1.2 is O(n). (riak_kv#331,333, riak-erlang-client#58,59).
  • Timeouts of MapReduce jobs now produce less error log spam. The safe-to-ignore-but-confusing {sink_died, normal} messages have been removed. (riak_pipe#45)
  • The riak-admin transfers command now reports the status of active transfers. This gives more insight into what transfers are occurring, their type, their running time, and the rate at which data is being transferred. Calling this command will no longer stall handoff.
  • The memory storage backend for Riak KV now supports secondary indexes and has a “test” mode that lets developers quickly clear all local storage (useful in the context of an external test suite).

Bugs Fixed


Riak 1.2 has been released

The Basho team just released Riak 1.2.

Here’s what’s new and improved since the Riak 1.1 release:

  • More efficiently add multiple Riak nodes to your cluster
  • Stage and review, then commit or abort cluster changes for easier operations; plus smoother handling of rolling upgrades
  • Better visibility into active handoffs
  • Repair Riak KV and Search partitions by attaching to the Riak Console and using a one-line command to recover from data corruption/loss
  • More performant stats for Riak; the addition of stats to Riak Search
  • 2i and Search usage thru the Protocol Buffers API
  • Official Support for Riak on FreeBSD
  • In Riak Enterprise: SSL encryption, better balancing and more granular control of replication across multiple data centers, NAT support




More information

Riak 1.2.0 RC2 has been released

Riak 1.2.0 RC2 has been release with the following bug fixes:
  • The new set of bugs fixed are the following:
  • riak_api – Add riak_core as application dep to riak_api.app
  • riak_api – Register riak_api_stat mod with riak_core at start up
  • Add eleveldb:close – Fixes MANIFEST file missing bug
  • riak_kv – Call eleveldb:close before destroy
  • riak_control – Resolve base64 cookie truncation race condition.
  • Fix FreeBSD package permissions on sbin
  • Create SmartOS SMF service for epmd


The downloads page has been updated with the new build:

The release notes have been updated with bugs fixed in RC2:

Riak 1.2 RC1 released

Riak 1.2 RC1 has been released, it introduces a bunch of improvments and new features such as:

  • Aggregation of non-streamed MapReduce results was improved. Previous versions used an O(n^2) process, where n is the number of outputs for a phase. The aggregation in Riak 1.2 is O(n). (riak_kv#331,333, riak-erlang-client#58,59).
  • Timeouts of MapReduce jobs now produce less error log spam. The safe-to-ignore-but-confusing {sink_died, normal}messages have been removed. (riak_pipe#45)
  • The riak-admin transfers command now reports the status of active transfers. This gives more insight into what transfers are occurring, their type, their running time, and the rate at which data is being transferred. Calling this command will no longer stall handoff.
  • The memory storage backend for Riak KV now supports secondary indexes and has a “test” mode that lets developers quickly clear all local storage (useful in the context of an external test suite).


    Download: http://basho.com/resources/downloads/

    Release note: https://github.com/basho/riak/blob/1.2/RELEASE-NOTES.md