Open Source | Data story

PostgreSQL introduced jsonb support3

Binary JSON

PostgreSQL has introduce jsonb.. a diamond in the crown of PostgreSQL 9.4.Based on an elegant hash opclass for GIN, which competes with MongoDB performance in contains operator .

Feature’s documentation : http://www.postgresql.org/docs/devel/static/datatype-json.html

Feature’s story:  http://obartunov.livejournal.com/177247.html

hRaven v0.9.81

The @twitterhadoop team just released hRaven v0.9.8

hRaven collects run time data and statistics from map reduce jobs running on Hadoop clusters and stores the collected job history in an easily queryable format. For the jobs that are run through frameworks (Pig or Scalding/Cascading) that decompose a script or application into a DAG of map reduce jobs for actual execution, hRaven groups job history data together by an application construct. This allows for easier visualization of all of the component jobs’ execution for an application and more comprehensive trending and analysis over time.

 
Requirements
  • Apache HBase (0.94+) – a running HBase cluster is required for the hRaven data storage
  • Apache Hadoop – hRaven current supports collection of job data on specific versions of Hadoop:
    • CDH upto CDH3u5, Hadoop 1.x upto MAPREDUCE-1016
    • Hadoop 1.x post MAPREDUCE-1016 and Hadoop 2.0 are supported in versions 0.9.4 onwards

https://github.com/twitter/hraven

 

 

Twemproxy v 0.3.0 has been released1

twemproxy v0.3.0 is out: bug fixes and support for smartos (solaris) / bsd (macos)

twemproxy (pronounced “two-em-proxy”), aka nutcracker is a fast and lightweight proxy for memcached and redis protocol. It was primarily built to reduce the connection count on the backend caching servers.

Features

  • Fast.
  • Lightweight.
  • Maintains persistent server connections.
  • Keeps connection count on the backend caching servers low.
  • Enables pipelining of requests and responses.
  • Supports proxying to multiple servers.
  • Supports multiple server pools simultaneously.
  • Shard data automatically across multiple servers.
  • Implements the complete memcached ascii and redis protocol.
  • Easy configuration of server pools through a YAML file.
  • Supports multiple hashing modes including consistent hashing and distribution.
  • Can be configured to disable nodes on failures.
  • Observability through stats exposed on stats monitoring port.
  • Works with Linux, *BSD, OS X and Solaris (SmartOS)

 

More details and source code available here: https://github.com/twitter/twemproxy

Apache Kafka 0.8.0 released1

Kafka is a Message Queue developed by LinkedIn that persists messages to disk in a very performant manner. It provides the functionality of a messaging system,  with a very unique design.

The 0.8.0 bring new features such as:

  • [KAFKA-50] – kafka intra-cluster replication support
  • [KAFKA-188] – Support multiple data directories
  • [KAFKA-202] – Make the request processing in kafka asynchonous
  • [KAFKA-203] – Improve Kafka internal metrics
  • [KAFKA-235] – Add a ‘log.file.age’ configuration parameter to force rotation of log files after they’ve reached a certain age
  • [KAFKA-429] – Expose JMX operation to set logger level dynamically
  • [KAFKA-475] – Time based log segment rollout
  • [KAFKA-545] – Add a Performance Suite for the Log subsystem
  • [KAFKA-546] – Fix commit() in zk consumer for compressed messages

Downloads version 0.8.0 here: http://kafka.apache.org/downloads.html

Cassandra 2.0, just big1

In five years, Apache Cassandra has grown into one of the most widely used NoSQL databases in the world and serves as the backbone for some of today’s most popular applications including as Facebook,Netflix,Twitter.

cassandra

 

This newest version, Cassandra 2.0 just announced, includes multiple new features. But perhaps the biggest of them is that “Cassandra 2.0 makes it easier than ever for developers to migrate from relational databases and become productive quickly.”

New features and improvements include:

  • Lightweight transactions allow ensuring operation linearizability similar to the serializable isolation level offered by relational databases, which prevents conflicts during concurrent requests
  • Triggers, which enable pushing performance-critical code close to the data it deals with, and simplify integration with event-driven frameworks like Storm
  • CQL enhancements such as cursors and improved index support
  • Improved compaction, keeping read performance from deteriorating under heavy write load
  • Eager retries to avoid query timeouts by sending redundant requests to other replicas if too much time elapses on the original request
  • Custom Thrift server implementation based on LMAX Disruptor that achieves lower message processing latencies and better throughput with flexible buffer allocation strategies

Official website

Official announce

Download

Spark the Open Source Future of Big Data1

Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much more quickly than with disk-based systems like Hadoop MapReduce.

To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python. You can also use Spark interactively from the Scala and Python shells to rapidly query big datasets

Spark is open source under a BSD license, so download it to check it out.

MySQL man pages silently relicensed away from GPL1

[Amended] According to mysql.com this was a bug

 

It has recently been brought to attention that the MySQL man pages have been relicensed. The change was made rather silently going from MySQL 5.5.30 to MySQL 5.5.31. This affects all pages in the man/ directory of the source code.

You can tell the changes have come during this short timeframe (5.5.30->5.5.31). The old manual pages were released under the following license:

This documentation is free software; you can redistribute it and/or modify it only under the terms of the GNU General Public License as published by the Free Software Foundation; version 2 of the License.

The new man pages (following 5.5.31 and greater – still valid for 5.5.32) are released under the following license:

This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited.

This is clearly not very friendly of MySQL at Oracle.

 

 

UnQLite5

UnQLite is an Embeddable NoSQL (Key/Value store and Document-store) database engine. Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections, is contained in a single disk file. The database file format is cross-platform, you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.UnQLite features includes:

More information on the official website: http://www.unqlite.org/

TheBigDB of facts1

TheBigDB is a very loosely structured database of facts,free and open to everybody

http://thebigdb.com/

 

Through a very simple API you can browse the database and access facts such as:

  • { nodes: ["Gold", "atomic radius", "144 pm"] }
  • { nodes: ["Bill Clinton", "job", "President of the United States"], period: { from: “1993-01-20 12:00:00″, to: “2001-01-20 11:59:59″ } }
  • { nodes: ["Apple", "average weight", "150g"] }

That’s it. Really.

Anyone can create, upvote or downvote a statement.

There are no datatypes, namespaces, lists or domains. Just nodes, one after the other, with a simple and easy to use API to search through them.

Red Hat unveiled its Big Data strategy1

red hatRed Hat has outlined its big data strategy today. The company has announced that it is going to contribute its Storage Hadoop plug-in to the Apache Hadoop open community as a part of its big data strategy. Red Hat is focusing heavily on enterprise customers infrastructures and platforms in open hybrid cloud based environments.

The Red Hat Storage Hadoop provides compatibility for Apache Hadoop, a popular framework in its segment. Ranga Rangachari, VP and GM of Red Hat, storage business unit, claimed that opening their product to the comity will help transform Storage Hadoop into a highly robust, Hadoop compatible file system for big data. In a webcast, Rangachari said, “The Apache community is very significant. The community is the center of gravity for Hadoop development.”

 

He went on to further explain the company’s big data strategy – focusing on enterprise customers ideally suited for open hybrid cloud based environments. He mentioned that the company is developing a network based ecosystem where enterprise integrator partners deliver its big data products to enterprise customers.

 

Red Hat is working on a commercial OpenStack cloud control giant and has also created its own OpenShift platform cloud using various open source projects. The company has also acquired several existing products and has formed a mash-up of acquired as well as self-created code. The company also acquired Gluster to attain a cluster based file system running on X86. It can be used to compute on cloud based environments and eventually Hadoop MapReduce.

 

Red Hat plans to inform its customers that they will need to eventually dump HDFS and start using Red Hat’s Storage Server. The company believes its solution is more reliable and scalable when compared to HDFS. They also help resolve the NameNode problem, in a way.

 

The Red Hat Storage Server runs on Linux based X86 servers with SATA/SAS drives. These can be arranged into a RAID stack to protect the drives. The file system (clustered) can then ride ext3, ext4, XFS and other file systems. This is essentially titled GlusterFS – aggregating these file systems and presenting a global namespace to processors to access the cluster.

 

Companies looking to virtualize Hadoop and other big data environments can use Red Hat’s solutions in the long run, for added flexibility. The company is also working on a Hive connector for JBoss middleware. Hive, the data warehousing system riding on top of HDFS, allows users to make queries like SQL for the data in HDFS. GlusterFS presents it as HDFS to Hadoop.

 

Red Hat’s strategy reveals how the enterprise software company is focussed on the mainstream stable of tools for corporations in the future. The company is rightly headed towards making the most of the big data technology and enabling customers to find solutions that just work. It’ll be interesting to see how the company implements this strategy in 2013.

Follow LuxNoSQL on Twitter
 
Join the LuxNoSQL Community on LinkedIn