Cassandra performance review

 

Original article available here

 

Four years ago, well before starting DataStax, I evaluated the then-current crop of distributed databases and explained why I chose Cassandra. In a lot of ways, Cassandra was the least mature of the options, but I chose to take a long view and wanted to work on a project that got the fundamentals right; things like documentation and distributed testscould come later.

 

2012 saw that validated in a big way, as the most comprehensive NoSQL benchmark to date was published at the VLDB conference by researchers at the University of Toronto. They concluded,

In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear in- creasing throughput from 1 to 12 nodes.

As a sample, here’s the throughput results from the mixed reads, writes, and (sequential) scans:

I encourage you to take a few minutes to skim the full results.

There are both architectural and implentation reasons for Cassandra’s dominating performance here. Let’s get down into the weeds and see what those are.

Architecture

Cassandra incorporates a number of architectural best practices that affect performance. None are unique to Cassandra, but Cassandra is the only NoSQL system that incorporates all of them.

Fully distributed: Every Cassandra machine handles a proportionate share of every activity in the system. There are no special cases like the HDFS namenode or MongoDB mongos that require special treatment or special hardware to avoid becoming a bottleneck. And with every node the same, Cassandra is far simpler to install and operate, which has long-term implications for troubleshooting.

Log-structured storage engine: A log-structured engine that avoids overwrites to turn updates into sequential i/o is essential both on hard disks (HDD) and solid-state disks (SSD). On HDD, because the seek penalty is so high; on SSD, to avoid write amplification and disk failure. This is why you see mongodb performance go through the floor as the dataset size exceeds RAM.

Tight integration with its storage engine: Voldemort and Riak support pluggable storage engines, which both limits them to a lowest-common-denominator of key/value pairs, and limits the optimizations that can be done with the distributed replication engine.

Locally-managed storage: HBase has an integrated, log-structured storage engine, but relies on HDFS for replication instead of managing storage locally. This means HBase is architecturally incapable of supporting Cassandra-style optimizations like putting the commitlog on a separate disk, or mixing SSD and HDD in a single cluster with appropriate data pinned to each.

Implementation

An architecture is only as good as its implementation. For the first years after Cassandra’s open-sourcing as an Apache project, every release was a learning experience. 0.3, 0.4, 0.5, 0.6, each attracted a new wave of users that exposed some previously unimportant weakness. Today, we estimate there are over a thousand production deployments of Cassandra, the most for any scalable database. Some are listed here. To paraphrase ESR, “With enough eyes, all performance problems are obvious.”

What are some implementation details relevant to performance? Let’s have a look at some of the options.

MongoDB

MongoDB can be a great alternative to MySQL, but it’s not really appropriate for the scale-out applications targeted by Cassandra. Still, as early members of the NoSQL category, the two do draw comparisons.

One important limitation in MongoDB is database-level locking. That is, only one writer may modify a given database at a time. Support for collection-level (a set of documents, analogous to a relational table) locking is planned. With either database- or collection-level locking, other writers or readers are locked out. Even a small number of writes can produce stalls in read performance.

Cassandra uses advanced concurrent structures to provide row-level isolation without locking. Cassandra eveneliminated the need for row-level locks for index updates in the recent 1.2 release.

A more subtle MongoDB limitation is that when adding or updating a field in a document, the entire document must be re-written. If you pre-allocate space for each document, you can avoid the associated fragmentation, but even with pre-allocation updating your document gets slower as it grows.

Cassandra’s storage engine only appends updated data, it never has to re-write or re-read existing data. Thus, updates to a Cassandra row or partition stay fast as your dataset grows.

Riak

Riak presents a document-based data model to the end user, but under the hood it maps everything to a key/value storage API. Thus, like MongoDB, updating any field in a document requires rewriting the whole thing.

However, Riak does emphasize the use of log-structured storage engines. Both the default BitCask backend and LevelDB are log-structured. Riak increasingly emphasizes LevelDB since BitCask does not support scan operations (which are required for indexes), but this brings its own set of problems.

LevelDB is a log-structured storage engine with a different approach to compaction than the one introduced by Bigtable. LevelDB trades more compaction i/o for less i/o at read time, which can be a good tradeoff for many workloads, but not all. Cassandra added support for leveldb-style compaction about a year ago.

LevelDB itself is designed to be an embedded database for the likes of Chrome, and clear growing pains are evident when pressed into service as a multi-user backend for Riak. (A LevelDB configuration for Voldemort also exists.) Basho cites “one stall every 2 hours for 10 to 30 seconds”, “cases that can still cause [compaction] infinite loops,” and no way to create snapshots or backups as of the recently released Riak 1.2.

HBase

HBase’s storage engine is the most similar to Cassandra’s; both drew on Bigtable’s design early on.

But despite a later start, Cassandra’s storage engine is far ahead of HBase’s today, in large part because building on HDFS instead of locally-managed storage makes everything harder for HBase. Cassandra added online snapshotsalmost four years ago; HBase still has a long ways to go.

HDFS also makes SSD support problematic for HBase, which is becoming increasingly relevant as SSD price/performance improves. Cassandra has excellent SSD support and even support for mixed SSD and HDD within the same cluster, with data pinned to the medium that makes the most sense for it.

Other differences that may not show up at benchmark time, but you would definitely notice in production:

HBase can’t delete data during minor compactions — you have to rewrite all the data in a region to reclaim disk space. Cassandra has deleted tombstones during minor compactions for over two years.

While you are running that major compaction, HBase gives you no way to throttle it and limit its impact on your application workload. Cassandra introduced this two years ago and continues to improve it. Dealing with local storage also lets Cassandra avoid polluting the page cache with sequential scans from compaction.

Compaction might seem like bookkeeping details, but it does impact the rest of the system. HBase limits you to two or three column families because of compaction and flushing limitations, forcing you to do sub-optimal things to your data model as a workaround.

Cassandra

I honestly think Cassandra is one to two years ahead of the competition, but I’m under no illusions that Cassandra itself is perfect. We have plenty of improvements to make still; from the recently released Cassandra 1.2 to our ticket backlog, there is no shortage of work to do.

Here are some of the areas I’d like to see Cassandra improve this year:

If working on an industry-leading, open-source database doing cutting edge performance work on the JVM sounds interesting to you, please get in touch.

HRider 1.0.1 has been released

The h-rider is a UI application that provides an easier way to view or manipulate the data saved in the – HBase™ – distributed database that supports structured data storage for large tables.

Welcome

https://github.com/NiceSystems/hrider

 

Nine Databases in 45 Minutes

From http://www.datanami.com/datanami/2012-12-04/nine_databases_in_45_minutes.html

 

The following video provides a crash course in nine key databases:  Postgres, CouchDB, MarkLogic, Riak, VoltDB, MongoDB, Neo4j, HBase and Redis. All in just 45 minutes.

Miles Pomeroy, Chad Maughan, and Jonathan Geddes run through each database in five minutes each, thus the title, “9 Databases in 45 minutes.” The video proceeds in efficient and occasionally amusing fashion, where a countdown clock and a gong keep the presentations terse.

NoSQL Benchmark

There is probably no perfect NoSQL database. Every database has its advantages and disadvantages that become more or less important depending on your preferences and the type of tasks your trying to achieve.

Altoros Systems as performed an independent and interesting benchmark to help you sort out the current prons and crons between different solution including: HBase,Cassandra,Riak and MongoDb

http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html

What makes this research unique?

Often referred to as NoSQL, non-relational databases feature elasticity and scalability in combination with a capability to store big data and work with cloud computing systems, all of which make them extremely popular. NoSQL data management systems are inherently schema-free (with no obsessive complexity and a flexible data model) and eventually consistent (complying with BASE rather than ACID). They have a simple API, serve huge amounts of data and provide high throughput.

In 2012, the number of NoSQL products reached 120-plus and the figure is still growing. That variety makes it difficult to select the best tool for a particular case. Database vendors usually measure productivity of their products with custom hardware and software settings designed to demonstrate the advantages of their solutions. We wanted to do independent and unbiased research to complement the work done by the folks at Yahoo.

Using Amazon virtual machines to ensure verifiable results and research transparency (which also helped minimize errors due to hardware differences), we have analyzed and evaluated the following NoSQL solutions:

● Cassandra, a column family store
● HBase (column-oriented, too)
● MongoDB, a document-oriented database
● Riak, a key-value store

We also tested MySQL Cluster and sharded MySQL, taking them as benchmarks.

After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.

HBase 0.94.1 has been released

HBase 0.94.1 has been released,it is a bug fix release and has 156 issues resolved against it.

It should be considered the current stable release of HBase at this point.

For a complete list of changes, see release notes.

Release notes: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310753&version=12320257

Download: http://www.apache.org/dyn/closer.cgi/hbase/

Apache HBase 0.94 has been released

Apache HBase 0.94.0 has been released and can be downloaded here

This is the first major release since the January 22nd HBase 0.92 release.

In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).

Performance Related JIRAs

Below are a few of the important performance related JIRAs:

  • Read Caching improvements: HDFS stores data in one block file and its corresponding metadata (checksum) in another block file. This means that every read into the HBase block cache may consume up to two disk ops, one to the datafile and one to the checksum file.HBASE-5074: “Support checksums in HBase block cache” adds a block level checksum in the HFile itself in order to avoid one disk op,  boosting up the read performance. This feature isenabled by default.
  • Seek optimizations: Till now, if there were several StoreFiles for a column family in a region, HBase would seek in each such files and merge the results, even if the row/column we are looking for is in the most recent file.  HBASE-4465: “Lazy Seek optimization of StoreFile Scanners” optimizes scanner reads to read the most recent StoreFile first by lazily seeking the StoreFiles. This is achieved by introducing a fake keyvalue with its timestamp equal to the maximum timestamp present in the particular StoreFile. Thus, a disk seek is avoided until the KeyValueScanner for a StoreFile is bubbled up the heap, implying a need to do a real read operation.  This should provide a significant read performance boost, especially for IncrementColumnValue operations where we care only for latest value. This feature is enabledby default.
  • Write to WAL optimizations: HBase write throughput is upper bounded by the write rate of WAL where the log is replicated to a number of datanodes, depending on the replication factor.HBASE-4608: “HLog Compression” adds a custom dictionary-based compression of HLogs for faster replication on HDFS datanodes, thus improving overall write rate for HBase. This feature is considered experimental and is off by default.

New Feature Related JIRAs

Here is a list of some of the important JIRAs related to adding new features:

  • More powerful first aid box: The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. HBASE-5128: “Uber hbck”, adds these missing features to the first aid box.
  • Simplified Region Sizing: Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. HBASE-4365: “Heuristic for Region size” adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.
  • Smarter transaction semantics: Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations. HBASE-3584: “Atomic Put & Delete in a single transaction” enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is on by default.

This major release has a number of new features and bug fixes; a total of 397 resolved JIRAs with 140 enhancements and 180 bug fixes. It is compatible with 0.92. This opens up a window of opportunity to backport some of the cool features back in CDH4, which is based on the 0.92 branch.

 

 

HBase 0.92.1 has been released

Apache HBase 0.92.1 is now available. This release is a marked improvement in system correctness, availability, and ease of use. It’s also backwards compatible with 0.92.0 — except for the removal of the rarely-used transform functionality from the REST interface in HBASE-5228.

Apache HBase 0.92.1 is a bug fix release covering 61 issues – including 6 blockers and 6 critical issues, such as:

HBase 0.90.6 has been released

Apache HBase 0.90.6 is now available. It is a bug fix release covering 31 bugs and 5 improvements.  Among them, 3 are blockers and 3 are critical, such as:

  • HBASE-5008HBase can not provide services to a region when it can’t flush the region, but considers it stuck in flushing,
  • HBASE-4773: HBaseAdmin may leak ZooKeeper connections,
  • HBASE-5060: HBase client may be blocked forever when there is a temporary network failure.

This release has improved system robustness and availability by fixing bugs that cause potential data loss, system unavailability, possible deadlocks, read inconsistencies and resource leakage.

The 0.90.6 release is backward compatible with 0.90.5. The fixes in this release will be included in CDH3u4.

Hypertable getting a new website and smashing HBase in performance

Hypertable just get a new website, check it out @ http://www.hypertable.com/, the documentation section have been deeply reviewed.

A new website, but not only, according to highscalability.com and their benchmark, Hypertable delivers 2X better throughput in most tests — HBase fails 41 and 167 billion record insert tests, overwhelmed by garbage collection — Both systems deliver similar results for random read uniform test

 

Read the full performance test at:

http://highscalability.com/blog/2012/2/7/hypertable-routs-hbase-in-performance-test-hbase-overwhelmed.html

 

HBase 0.90.5 has been released

After 81 issues resolved, HBase (the Hadoop database) version 0.90.5 has been released.

According to the roadmap this release aims to be a bug fixes release only and  is backward compatible with 0.90.4.

The 81 issues were including 5 considered blockers, and 11 considered critical.  Several resource leakage issues have also been resolved. It also includes some new supporting features including improvements to hbck and an offline meta-rebuild disaster recovery mechanism.

Download:

http://apache.mirrors.pair.com/hbase/hbase-0.90.5/