Open Source | Data story

Spark the Open Source Future of Big Data1

Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much more quickly than with disk-based systems like Hadoop MapReduce.

To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets

Spark is open source under a BSD license, so download it to check it out.

MySQL man pages silently relicensed away from GPL1

[Amended] According to mysql.com this was a bug

 

It has recently been brought to attention that the MySQL man pages have been relicensed. The change was made rather silently going from MySQL 5.5.30 to MySQL 5.5.31. This affects all pages in the man/ directory of the source code.

You can tell the changes have come during this short timeframe (5.5.30->5.5.31). The old manual pages were released under the following license:

This documentation is free software; you can redistribute it and/or modify it only under the terms of the GNU General Public License as published by the Free Software Foundation; version 2 of the License.

The new man pages (following 5.5.31 and greater – still valid for 5.5.32) are released under the following license:

This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited.

This is clearly not very friendly of MySQL at Oracle.

 

 

UnQLite4

UnQLite is an Embeddable NoSQL (Key/Value store and Document-store) database engine. Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections, is contained in a single disk file. The database file format is cross-platform, you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.UnQLite features includes:

More information on the official website: http://www.unqlite.org/

TheBigDB of facts1

TheBigDB is a very loosely structured database of facts,free and open to everybody

http://thebigdb.com/

 

Through a very simple API you can browse the database and access facts such as:

  • { nodes: ["Gold", "atomic radius", "144 pm"] }
  • { nodes: ["Bill Clinton", "job", "President of the United States"], period: { from: “1993-01-20 12:00:00″, to: “2001-01-20 11:59:59″ } }
  • { nodes: ["Apple", "average weight", "150g"] }

That’s it. Really.

Anyone can create, upvote or downvote a statement.

There are no datatypes, namespaces, lists or domains. Just nodes, one after the other, with a simple and easy to use API to search through them.

Red Hat unveiled its Big Data strategy1

red hatRed Hat has outlined its big data strategy today. The company has announced that it is going to contribute its Storage Hadoop plug-in to the Apache Hadoop open community as a part of its big data strategy. Red Hat is focusing heavily on enterprise customers infrastructures and platforms in open hybrid cloud based environments.

The Red Hat Storage Hadoop provides compatibility for Apache Hadoop, a popular framework in its segment. Ranga Rangachari, VP and GM of Red Hat, storage business unit, claimed that opening their product to the comity will help transform Storage Hadoop into a highly robust, Hadoop compatible file system for big data. In a webcast, Rangachari said, “The Apache community is very significant. The community is the center of gravity for Hadoop development.”

 

He went on to further explain the company’s big data strategy – focusing on enterprise customers ideally suited for open hybrid cloud based environments. He mentioned that the company is developing a network based ecosystem where enterprise integrator partners deliver its big data products to enterprise customers.

 

Red Hat is working on a commercial OpenStack cloud control giant and has also created its own OpenShift platform cloud using various open source projects. The company has also acquired several existing products and has formed a mash-up of acquired as well as self-created code. The company also acquired Gluster to attain a cluster based file system running on X86. It can be used to compute on cloud based environments and eventually Hadoop MapReduce.

 

Red Hat plans to inform its customers that they will need to eventually dump HDFS and start using Red Hat’s Storage Server. The company believes its solution is more reliable and scalable when compared to HDFS. They also help resolve the NameNode problem, in a way.

 

The Red Hat Storage Server runs on Linux based X86 servers with SATA/SAS drives. These can be arranged into a RAID stack to protect the drives. The file system (clustered) can then ride ext3, ext4, XFS and other file systems. This is essentially titled GlusterFS – aggregating these file systems and presenting a global namespace to processors to access the cluster.

 

Companies looking to virtualize Hadoop and other big data environments can use Red Hat’s solutions in the long run, for added flexibility. The company is also working on a Hive connector for JBoss middleware. Hive, the data warehousing system riding on top of HDFS, allows users to make queries like SQL for the data in HDFS. GlusterFS presents it as HDFS to Hadoop.

 

Red Hat’s strategy reveals how the enterprise software company is focussed on the mainstream stable of tools for corporations in the future. The company is rightly headed towards making the most of the big data technology and enabling customers to find solutions that just work. It’ll be interesting to see how the company implements this strategy in 2013.

Riak CS Open Sourced1

Basho announced today that Riak CS is now open source under the Apache 2 license. Organizations and users can now access the source code on Github and download the latest packages from the downloads page.

Riak CS can be used to build private or public clouds or as reliable, available storage behind applications and platforms. Riak CS Enterprise is currently used by large corporations including Datapipe, Deutsche Vermögensberatung (DVAG), IDC Frontier, Rovio, and Yahoo! JAPAN.

Basho is a distributed systems company dedicated to making software that is available, fault-tolerant, and easy to operate at scale. Twenty-five percent of the Fortune 50 and thousands of open source users large and small run our flagship open source database, Riak. Riak CS takes distributed systems principles derived from production Riak users and applies it to the problem of large scale storage. We are excited to share this code with the world.

Riak CS features:

  • Highly available, fault-tolerant storage
  • Large object support
  • S3-compatible API and authentication
  • Multi-tenancy and per-user reporting
  • Simple operational model for adding capacity
  • Robust stats for monitoring and metrics

For users requiring multi-datacenter replication and enterprise-level support, Riak CS Enterprise (a commercial extension of Riak CS) is available.

New Features

Today we are also announcing several new features, available now as part of the open source edition.

  • Multipart upload. Upload very large files to Riak CS as a series of parts. Parts can be between 5MB and 5GB.
  • Support for GET range queries. Retrieve a range of bytes from a single object. This functionality is implemented in the “Range” request header of GET operations.
  • Per-bucket policies to restrict access to buckets based on source IP.
  • Riak CS Control. Riak CS Control is a standalone web administration tool for user management available on Github.

Twitter’s fatcache available on GitHub1

fatcache is memcache on SSD. Think of fatcache as a cache for your big data.

Overview

There are two ways to think of SSDs in system design. One is to think of SSD as an extension of disk, where it plays the role of making disks fast and the other is to think of them as an extension of memory, where it plays the role of making memory fat. The latter makes sense when persistence (non-volatility) is unnecessary and data is accessed over the network. Even though memory is thousand times faster than SSD, network connected SSD-backed memory makes sense, if we design the system in a way that network latencies dominate over the SSD latencies by a large factor.

To understand why network connected SSD makes sense, it is important to understand the role distributed memory plays in large-scale web architecture. In recent years, terabyte-scale, distributed, in-memory caches have become a fundamental building block of any web architecture. In-memory indexes, hash tables, key-value stores and caches are increasingly incorporated for scaling throughput and reducing latency of persistent storage systems. However, power consumption, operational complexity and single node DRAM cost make horizontally scaling this architecture challenging. The current cost of DRAM per server increases dramatically beyond approximately 150 GB, and power cost scales similarly as DRAM density increases.

Fatcache extends a volatile, in-memory cache by incorporating SSD-backed storage.

SSD-backed memory presents a viable alternative for applications with large workloads that need to maintain high hit rate for high performance. SSDs have higher capacity per dollar and lower power consumption per byte, without degrading random read latency beyond network latency.

Fatcache achieves performance comparable to an in-memory cache by focusing on two design criteria:

  • Minimize disk reads on cache hit
  • Eliminate small, random disk writes

The latter is important due to SSDs’ unique write characteristics. Writes and in-place updates to SSDs degrade performance due to an erase-and-rewrite penalty and garbage collection of dead blocks. Fatcache batches small writes to obtain consistent performance and increased disk lifetime.

SSD reads happen at a page-size granularity, usually 4 KB. Single page read access times are approximately 50 to 70 usec and a single commodity SSD can sustain nearly 40K read IOPS at a 4 KB page size. 70 usec read latency dictates that disk latency will overtake typical network latency after a small number of reads. Fatcache reduces disk reads by maintaining an in-memory index for all on-disk data.

https://github.com/twitter/fatcache

Intel releases open source GraphBuilder for big data1

Intel has released an open source tool designed to improve firms’ handling and analysis of unstructured data.

Source: Intel official blog http://blogs.intel.com/intellabs/2012/12/06/graphbuilder/

Intel said that its GraphBuilder tool would aim to fill a market void in the handling of big data for computer learning. Currently available as a beta release, the tool allows developers to construct large graphs which can then be used with big data analysis frameworks.

“GraphBuilder not only constructs large-scale graphs fast but also offloads many of the complexities of graph construction, including graph formation, cleaning, compression, partitioning, and serialisation,” wrote Intel principal scientist Ted Willke.

“This makes it easy for just about anyone to build graphs for interesting research and commercial applications.”

Willke said that the tool was developed in a collaboration with researchers at the University of Washington in Seattle. The teams sought to address a perceived hole in the market for tools to build the graph data used for many big data analysis activities.

“Scanning the environment, we identified a more general hole in the open source ecosystem: A number of systems were out there to process, store, visualise, and mine graphs but, surprisingly, not to construct them from unstructured sources,” Willke explained.

“So, we set out to develop a demo of a scalable graph construction library for Hadoop.”

The researchers estimate that GraphBuilder can help big data platforms analyse data as much as 50 times faster than the conventional MapReduce system.

The project is one of many research efforts dedicated to improving the performance of big data analysis platforms. Last month, researchers from the University of California Berkeley showcased a pair of technologies dubbed ‘Spark’ and ‘Shark’ which promise to dramatically improve the performance of the Apache Hive big data system.

The big data market has been suffering from a general lack of qualified analysts and developers, say vendors. Companies have sought to help bridge the gap by extending training efforts and partnerships with universities.

Scalaris 0.5.0 codename “Saperda scalaris”1

Scalaris 0.5.0 (codename “Saperda scalaris”) has been released. https://code.google.com/p/scalaris/

Scalaris is a scalable, transactional, distributed key-value store. It can be used for building scalable Web 2.0 services.

Scalaris uses a structured overlay with a non-blocking Paxos commit protocol for transaction processing with strong consistency over replicas. Scalaris is implemented in Erlang.

Discussion / Documentation / Download:

 

 

RethinkDB 1.2.2 has been released2

RethinkDB Release 1.2.0 (Rashomon) was the first release of the product. It included:
  • JSON data model and immediate consistency support
  • Distributed joins, subqueries, aggregation, atomic updates
  • Hadoop-style map/reduce
  • Friendly web and command-line administration tools
  • Takes care of machine failures and network interrupts
  • Multi-datacenter replication and failover
  • Sharding and replication to multiple nodes
  • Queries are automatically parallelized and distributed
  • Lock-free operation via MVCC concurrency

RethinkDB is built to store JSON documents, and scale to multiple machines with very little effort. It has a pleasant query language that supports really useful queries like table joins and group by, and is easy to setup and learn.


Simple programming model:

  • JSON data model and immediate consistency.
  • Distributed joins, subqueries, aggregation, atomic updates.
  • Hadoop-style map/reduce.

Easy administration:

  • Friendly web and command-line administration tools.
  • Takes care of machine failures and network interrupts.
  • Multi-datacenter replication and failover.

Horizontal scalability:

  • Sharding and replication to multiple nodes.
  • Queries are automatically parallelized and distributed.
  • Lock-free operation via MVCC concurrency.

 

RethinkDB is simple to use but complex under the hood. Read the FAQ for technical information on advanced features, architectural tradeoffs, and limitations.

Follow LuxNoSQL on Twitter
 
Join the LuxNoSQL Community on LinkedIn