Riak CS Open Sourced

Basho announced today that Riak CS is now open source under the Apache 2 license. Organizations and users can now access the source code on Github and download the latest packages from the downloads page.

Riak CS can be used to build private or public clouds or as reliable, available storage behind applications and platforms. Riak CS Enterprise is currently used by large corporations including Datapipe, Deutsche Vermögensberatung (DVAG), IDC Frontier, Rovio, and Yahoo! JAPAN.

Basho is a distributed systems company dedicated to making software that is available, fault-tolerant, and easy to operate at scale. Twenty-five percent of the Fortune 50 and thousands of open source users large and small run our flagship open source database, Riak. Riak CS takes distributed systems principles derived from production Riak users and applies it to the problem of large scale storage. We are excited to share this code with the world.

Riak CS features:

  • Highly available, fault-tolerant storage
  • Large object support
  • S3-compatible API and authentication
  • Multi-tenancy and per-user reporting
  • Simple operational model for adding capacity
  • Robust stats for monitoring and metrics

For users requiring multi-datacenter replication and enterprise-level support, Riak CS Enterprise (a commercial extension of Riak CS) is available.

New Features

Today we are also announcing several new features, available now as part of the open source edition.

  • Multipart upload. Upload very large files to Riak CS as a series of parts. Parts can be between 5MB and 5GB.
  • Support for GET range queries. Retrieve a range of bytes from a single object. This functionality is implemented in the “Range” request header of GET operations.
  • Per-bucket policies to restrict access to buckets based on source IP.
  • Riak CS Control. Riak CS Control is a standalone web administration tool for user management available on Github.

Twitter's fatcache available on GitHub

fatcache is memcache on SSD. Think of fatcache as a cache for your big data.


There are two ways to think of SSDs in system design. One is to think of SSD as an extension of disk, where it plays the role of making disks fast and the other is to think of them as an extension of memory, where it plays the role of making memory fat. The latter makes sense when persistence (non-volatility) is unnecessary and data is accessed over the network. Even though memory is thousand times faster than SSD, network connected SSD-backed memory makes sense, if we design the system in a way that network latencies dominate over the SSD latencies by a large factor.

To understand why network connected SSD makes sense, it is important to understand the role distributed memory plays in large-scale web architecture. In recent years, terabyte-scale, distributed, in-memory caches have become a fundamental building block of any web architecture. In-memory indexes, hash tables, key-value stores and caches are increasingly incorporated for scaling throughput and reducing latency of persistent storage systems. However, power consumption, operational complexity and single node DRAM cost make horizontally scaling this architecture challenging. The current cost of DRAM per server increases dramatically beyond approximately 150 GB, and power cost scales similarly as DRAM density increases.

Fatcache extends a volatile, in-memory cache by incorporating SSD-backed storage.

SSD-backed memory presents a viable alternative for applications with large workloads that need to maintain high hit rate for high performance. SSDs have higher capacity per dollar and lower power consumption per byte, without degrading random read latency beyond network latency.

Fatcache achieves performance comparable to an in-memory cache by focusing on two design criteria:

  • Minimize disk reads on cache hit
  • Eliminate small, random disk writes

The latter is important due to SSDs’ unique write characteristics. Writes and in-place updates to SSDs degrade performance due to an erase-and-rewrite penalty and garbage collection of dead blocks. Fatcache batches small writes to obtain consistent performance and increased disk lifetime.

SSD reads happen at a page-size granularity, usually 4 KB. Single page read access times are approximately 50 to 70 usec and a single commodity SSD can sustain nearly 40K read IOPS at a 4 KB page size. 70 usec read latency dictates that disk latency will overtake typical network latency after a small number of reads. Fatcache reduces disk reads by maintaining an in-memory index for all on-disk data.


Intel releases open source GraphBuilder for big data

Intel has released an open source tool designed to improve firms’ handling and analysis of unstructured data.

Source: Intel official blog http://blogs.intel.com/intellabs/2012/12/06/graphbuilder/

Intel said that its GraphBuilder tool would aim to fill a market void in the handling of big data for computer learning. Currently available as a beta release, the tool allows developers to construct large graphs which can then be used with big data analysis frameworks.

“GraphBuilder not only constructs large-scale graphs fast but also offloads many of the complexities of graph construction, including graph formation, cleaning, compression, partitioning, and serialisation,” wrote Intel principal scientist Ted Willke.

“This makes it easy for just about anyone to build graphs for interesting research and commercial applications.”

Willke said that the tool was developed in a collaboration with researchers at the University of Washington in Seattle. The teams sought to address a perceived hole in the market for tools to build the graph data used for many big data analysis activities.

“Scanning the environment, we identified a more general hole in the open source ecosystem: A number of systems were out there to process, store, visualise, and mine graphs but, surprisingly, not to construct them from unstructured sources,” Willke explained.

“So, we set out to develop a demo of a scalable graph construction library for Hadoop.”

The researchers estimate that GraphBuilder can help big data platforms analyse data as much as 50 times faster than the conventional MapReduce system.

The project is one of many research efforts dedicated to improving the performance of big data analysis platforms. Last month, researchers from the University of California Berkeley showcased a pair of technologies dubbed ‘Spark’ and ‘Shark’ which promise to dramatically improve the performance of the Apache Hive big data system.

The big data market has been suffering from a general lack of qualified analysts and developers, say vendors. Companies have sought to help bridge the gap by extending training efforts and partnerships with universities.

Scalaris 0.5.0 codename "Saperda scalaris"

Scalaris 0.5.0 (codename “Saperda scalaris”) has been released. https://code.google.com/p/scalaris/

Scalaris is a scalable, transactional, distributed key-value store. It can be used for building scalable Web 2.0 services.

Scalaris uses a structured overlay with a non-blocking Paxos commit protocol for transaction processing with strong consistency over replicas. Scalaris is implemented in Erlang.

Discussion / Documentation / Download:



RethinkDB 1.2.2 has been released

RethinkDB Release 1.2.0 (Rashomon) was the first release of the product. It included:
  • JSON data model and immediate consistency support
  • Distributed joins, subqueries, aggregation, atomic updates
  • Hadoop-style map/reduce
  • Friendly web and command-line administration tools
  • Takes care of machine failures and network interrupts
  • Multi-datacenter replication and failover
  • Sharding and replication to multiple nodes
  • Queries are automatically parallelized and distributed
  • Lock-free operation via MVCC concurrency

RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn.

Simple programming model:

  • JSON data model and immediate consistency.
  • Distributed joins, subqueries, aggregation, atomic updates.
  • Hadoop-style map/reduce.

Easy administration:

  • Friendly web and command-line administration tools.
  • Takes care of machine failures and network interrupts.
  • Multi-datacenter replication and failover.

Horizontal scalability:

  • Sharding and replication to multiple nodes.
  • Queries are automatically parallelized and distributed.
  • Lock-free operation via MVCC concurrency.


RethinkDB is simple to use but complex under the hood. Read the FAQ for technical information on advanced features, architectural tradeoffs, and limitations.

PostgreSQL 9.2 introduced JSON built-in data type

PostgreSQL 9.2 has introduced a new feature related JSON; built-in data type. So you can now store inside your database directly JSON fields without the need of an external format checker as it is now directly inside Postgres core.

Built-in JSON data type.
Like the XML data type, we simply store JSON data as text, after checking
that it is valid. More complex operations such as canonicalization and
comparison may come later, but this is enough for now.
There are a few open issues here, such as whether we should attempt to
detect UTF-8 surrogate pairs represented as uXXXXuYYYY, but this gets
the basic framework in place.

A couple of system functions have also been added later to output some row or array data directly as json.

Add array_to_json and row_to_json functions.
Also move the escape_json function from explain.c to json.c where it
seems to belong.
Andrew Dunstan, Reviewed by Abhijit Menon-Sen.

What actually Postgres core does with JSON fields is to store them as text fields (so maximum size of 1GB) and top of that a string format check can be performed directly in core.

In case of a format error you will receive the following error message:

ERROR: invalid input syntax for type json


RedisLive the Redis dashboard

If you’re looking for something to get a better feeling about what’s going on with your Redis server, RedisLive is out and lets you visualize your Redis instances, analyze query patterns and spikes. Its open source and freely available at Github.



Apache Hive 0.9.0 has been released

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

The 0.9.0 release continues the trend of extending Hive’s SQL support. Hive now understands the BETWEEN operator and the NULL-safe equality operator, plus several new user defined functions (UDF) have now been added. New UDFs include printf()sort_array(), and java_method(). Also, the concat_ws() function has been modified to support input parameters consisting of arrays of strings.

This Hive release also includes several significant improvements to the query compiler and execution engine. HIVE-2642 improved Hive’s ability to optimize UNION queries, HIVE-2881 made the the map-side JOIN algorithm more efficient, and Hive’s ability to generate optimized execution plans for queries that contain multiple GROUP BY clauses was significantly improved in HIVE-2621.

HBase users will also be interested in several improvements to Hive’s HBase StorageHandler, mainly:



Opa unified all web programming Javascript, HTML, CSS, PHP, and SQL

Opa is an open source, simple and unified platform for writing web applications. All aspects are directly written in Opa: Frontend code, backend code, database queries and configuration. And everything is strongly statically typed.

Opa is available on MacOSX,Windows,Linux and FreeBSD

Designed specifically for web apps, Opa is a great tool for easily building real-time web applications and services as well as games! And thanks to it’s event-driven, non-blocking approach, Opa is perfect for writing any social application.


Learn more: http://www.opalang.org/

Microsoft Open Technologies released Redis on Windows

Microsoft Open Technologies released Redis on Windows


Microsoft has announced the release of an updated version of the Redis on Windows, the first deliverable from the company’s Microsoft Open Technologies subsidiary.

Redis is an open-source, networked, in-memory, key-value data store with optional durability. It is written in ANSI C. The development of Redis is sponsored by VMware. In a blog post, Claudio Caldato, principal program manager for Microsoft Open Technologies, said the unit’s first deliverable is “a new and significant iteration” of Redis on Windows.

“The major improvements in this latest version involve the process of saving data on disk,” Caldato said. “Redis on Linux uses an OS feature called Fork/Copy On Write. This feature is not available on Windows, so we had to find a way to be able to mimic the same behavior without changing completely the save-on-disk process so as to avoid any future integration issues with the Redis code.”

Thus, the new version of Redis on Windows implements the Copy On Write process at the application level: Instead of relying on the OS, Microsoft added code to Redis so that some data structures are duplicated in such a way that Redis can still serve requests from clients while saving data on disk (thus achieving the same effect that Fork/Copy On Write does automatically on Linux), Caldato said.

Developers can find the code for this new version on the new MS Open Techrepository in GitHub, which is currently the place to work on the Windows version of Redis as per guidance from Salvatore Sanfilippo, the original author of the project, Caldato said.

“We will also continue working with the community to create a solid Windows port,” he said. “We consider this not to be production-ready code, but a solid code base to be shared with the community to solicit feedback: as such, while we pursue stabilization, we are keeping the older version as default/stable on the GitHub repository. To try out the new code, please go to the bksavecow branch.”

Meanwhile, in the next few weeks, Microsoft plans to test the code extensively so that developers can use it for more serious testing, Caldato said.