LevelDB | Data story

LevelDB a fast and lightweight key/value database library by Google2

LevelDB a fast and lightweight key/value database library by Google

https://code.google.com/p/leveldb/

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

Features

  • Keys and values are arbitrary byte arrays.
  • Data is stored sorted by key.
  • Callers can provide a custom comparison function to override the sort order.
  • The basic operations are Put(key,value)Get(key)Delete(key).
  • Multiple changes can be made in one atomic batch.
  • Users can create a transient snapshot to get a consistent view of data.
  • Forward and backward iteration is supported over the data.
  • Data is automatically compressed using the Snappy compression library.
  • External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions.
  • Detailed documentation about how to use the library is included with the source code.

Limitations

  • This is not a SQL database. It does not have a relational data model, it does not support SQL queries, and it has no support for indexes.
  • Only a single process (possibly multi-threaded) can access a particular database at a time.
  • There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library.

Performance

Here is a performance report (with explanations) from the run of the included db_bench program. The results are somewhat noisy, but should be enough to get a ballpark performance estimate.

Setup

We use a database with a million entries. Each entry has a 16 byte key, and a 100 byte value. Values used by the benchmark compress to about half their original size.

   LevelDB:    version 1.1
   Date:       Sun May  1 12:11:26 2011
   CPU:        4 x Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
   CPUCache:   4096 KB
   Keys:       16 bytes each
   Values:     100 bytes each (50 bytes after compression)
   Entries:    1000000
   Raw Size:   110.6 MB (estimated)
   File Size:  62.9 MB (estimated)

 

Write performance

The “fill” benchmarks create a brand new database, in either sequential, or random order. The “fillsync” benchmark flushes data from the operating system to the disk after every operation; the other write operations leave the data sitting in the operating system buffer cache for a while. The “overwrite” benchmark does random writes that update existing keys in the database.

 

   fillseq      :       1.765 micros/op;   62.7 MB/s     
   fillsync     :     268.409 micros/op;    0.4 MB/s (10000 ops)
   fillrandom   :       2.460 micros/op;   45.0 MB/s     
   overwrite    :       2.380 micros/op;   46.5 MB/s

 

Each “op” above corresponds to a write of a single key/value pair. I.e., a random write benchmark goes at approximately 400,000 writes per second.

Each “fillsync” operation costs much less (0.3 millisecond) than a disk seek (typically 10 milliseconds). We suspect that this is because the hard disk itself is buffering the update in its memory and responding before the data has been written to the platter. This may or may not be safe based on whether or not the hard disk has enough power to save its memory in the event of a power failure.

Read performance

We list the performance of reading sequentially in both the forward and reverse direction, and also the performance of a random lookup. Note that the database created by the benchmark is quite small. Therefore the report characterizes the performance of leveldb when the working set fits in memory. The cost of reading a piece of data that is not present in the operating system buffer cache will be dominated by the one or two disk seeks needed to fetch the data from disk. Write performance will be mostly unaffected by whether or not the working set fits in memory.

 

   readrandom   :      16.677 micros/op;  (approximately 60,000 reads per second)
   readseq      :       0.476 micros/op;  232.3 MB/s    
   readreverse  :       0.724 micros/op;  152.9 MB/s

 

LevelDB compacts its underlying storage data in the background to improve read performance. The results listed above were done immediately after a lot of random writes. The results after compactions (which are usually triggered automatically) are better.

 

   readrandom   :      11.602 micros/op;  (approximately 85,000 reads per second)   
   readseq      :       0.423 micros/op;  261.8 MB/s    
   readreverse  :       0.663 micros/op;  166.9 MB/s

 

Some of the high cost of reads comes from repeated decompression of blocks read from disk. If we supply enough cache to the leveldb so it can hold the uncompressed blocks in memory, the read performance improves again:

   readrandom   :       9.775 micros/op;  (approximately 100,000 reads per second before compaction)
   readrandom   :       5.215 micros/op;  (approximately 190,000 reads per second after compaction)

SSTable and Log Structured Storage: LevelDB1

SSTable and Log Structured Storage: LevelDB

By  on February 06, 2012 originally posted here

If Protocol Buffers is the lingua franca of individual data record at Google, then the Sorted String Table (SSTable) is one of the most popular outputs for storing, processing, and exchanging datasets. As the name itself implies, an SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads.

Unfortunately, the SSTable name itself has also been overloaded by the industry to refer to services that go well beyond just the sorted table, which has only added unnecessary confusion to what is a very simple and a useful data structure on its own. Let’s take a closer look under the hood of an SSTable and how LevelDB makes use of it.

SSTable: Sorted String Table

Imagine we need to process a large workload where the input is in Gigabytes or Terabytes in size. Additionally, we need to run multiple steps on it, which must be performed by different binaries – in other words, imagine we are running a sequence of Map-Reduce jobs! Due to size of input, reading and writing data can dominate the running time. Hence, random reads and writes are not an option, instead we will want to stream the data in and once we’re done, flush it back to disk as a streaming operation. This way, we can amortize the disk I/O costs. Nothing revolutionary, moving right along.

 

A “Sorted String Table” then is exactly what it sounds like, it is a file which contains a set of arbitrary, sorted key-value pairs inside. Duplicate keys are fine, there is no need for “padding” for keys or values, and keys and values are arbitrary blobs. Read in the entire file sequentially and you have a sorted index. Optionally, if the file is very large, we can also prepend, or create a standalone key:offset index for fast access. That’s all an SSTable is: very simple, but also a very useful way to exchange large, sorted data segments.

SSTable and BigTable: Fast random access?

Once an SSTable is on disk it is effectively immutable because an insert or delete would require a large I/O rewrite of the file. Having said that, for static indexes it is a great solution: read in the index, and you are always one disk seek away, or simply memmap the entire file to memory. Random reads are fast and easy.

Random writes are much harder and expensive, that is, unless the entire table is in memory, in which case we’re back to simple pointer manipulation. Turns out, this is the very problem that Google’s BigTable set out to solve: fast read/write access for petabyte datasets in size, backed by SSTables underneath. How did they do it?

SSTables and Log Structured Merge Trees

We want to preserve the fast read access which SSTables give us, but we also want to support fast random writes. Turns out, we already have all the necessary pieces: random writes are fast when the SSTable is in memory (let’s call it MemTable), and if the table is immutable then an on-disk SSTable is also fast to read from. Now let’s introduce the following conventions:

 

  1. On-disk SSTable indexes are always loaded into memory
  2. All writes go directly to the MemTable index
  3. Reads check the MemTable first and then the SSTable indexes
  4. Periodically, the MemTable is flushed to disk as an SSTable
  5. Periodically, on-disk SSTables are “collapsed together”

What have we done here? Writes are always done in memory and hence are always fast. Once the MemTablereaches a certain size, it is flushed to disk as an immutable SSTable. However, we will maintain all the SSTable indexes in memory, which means that for any read we can check the MemTable first, and then walk the sequence of SSTable indexes to find our data. Turns out, we have just reinvented the “The Log-Structured Merge-Tree” (LSM Tree), described by Patrick O’Neil, and this is also the very mechanism behind “BigTable Tablets“.

LSM & SSTables: Updates, Deletes and Maintenance

This “LSM” architecture provides a number of interesting behaviors: writes are always fast regardless of the size of dataset (append-only), and random reads are either served from memory or require a quick disk seek. However, what about updates and deletes?

Once the SSTable is on disk, it is immutable, hence updates and deletes can’t touch the data. Instead, a more recent value is simply stored in MemTable in case of update, and a “tombstone” record is appended for deletes. Because we check the indexes in sequence, future reads will find the updated or the tombstone record without ever reaching the older values! Finally, having hundreds of on-disk SSTables is also not a great idea, henceperiodically we will run a process to merge the on-disk SSTables, at which time the update and delete records will overwrite and remove the older data.

SSTables and LevelDB

Take an SSTable, add a MemTable and apply a set of processing conventions and what you get is a nice database engine for certain type of workloads. In fact, Google’s BigTable, Hadoop’s HBase, and Cassandra amongst others are all using a variant or a direct copy of this very architecture.

Simple on the surface, but as usual, implementation details matter a great deal. Thankfully, Jeff Dean and Sanjay Ghemawat, the original contributors to the SSTable and BigTable infrastructure at Google released LevelDB earlier last year, which is more or less an exact replica of the architecture we’ve described above:

  • SSTable under the hood, MemTable for writes
  • Keys and values are arbitrary byte arrays
  • Support for Put, Get, Delete operations
  • Forward and backward iteration over data
  • Built-in Snappy compression

Designed to be the engine for IndexDB in WebKit (aka, embedded in your browser), it is easy to embedfast, and best of all, takes care of all the SSTable and MemTable flushing, merging and other gnarly details.

Working with LevelDB: Ruby

LevelDB is a library, not a standalone server or service – although you could easily implement one on top. To get started, grab your favorite language bindings (ruby), and let’s see what we can do:

require ‘leveldb’ # gem install leveldb-ruby db = LevelDB::DB.new “/tmp/db” db.put “b”, “bar” db.put “a”, “foo” db.put “c”, “baz” puts db.get “a” # => foo db.each do |k,v| p [k,v] # => ["a", "foo"], ["b", "bar"], ["c", "baz"] end db.to_a # => [["a", "foo"], ["b", "bar"], ["c", "baz"]]

We can store keys, retrieve them, and perform a range scan all with a few lines of code. The mechanics of maintaining the MemTables, merging the SSTables, and the rest is taken care for us by LevelDB – nice and simple.

LevelDB in WebKit and Beyond

SSTable is a very simple and useful data structure – a great bulk input/output format. However, what makes the SSTable fast (sorted and immutable) is also what exposes its very limitations. To address this, we’ve introduced the idea of a MemTable, and a set of “log structured” processing conventions for managing the many SSTables.

All simple rules, but as always, implementation details matter, which is why LevelDB is such a nice addition to the open-source database engine stack. Chances are, you will soon find LevelDB embedded in your browser, on your phone, and in many other places. Check out the LevelDB source, scan the docs, and take it for a spin.

 

Google open sourcing LevelDB1

There is one more NoSQL solution has  Google  just open sourced LevelDB !

LevelDB is a key-value storage engine written in C++, and its source code has now been released under a BSD-style license. Google designed LevelDB as a building block for a higher-level storage system, and it will in fact form the basis for the IndexedDB API – a new web standard for using apps that need a database – in future versions of Google Chrome.

The things you need to know about levelDB:

  • Open source + great performance
  • This is not a SQL database so it does not have a relational data model, no  SQL queries and no support for indexes.
  • Only a single process (possibly multi-threaded) can access a particular database at a time.
  • There is no client-server support. So unlike MySQL,MongoDB and CouchDB for instance, your application is can not connect to and operate remotely to LevelDB: it is a library included in your code to provide support for LevelDB , just like SQLite (such  support as client server would be needed you would have to wrap your own server around the library).

 

Find out more and download the source code from the project page on Google Code

Follow LuxNoSQL on Twitter
 
Join the LuxNoSQL Community on LinkedIn