Apple acquires FoundationDB

A notice on the FoundationDB site notes that it’s no longer offering downloads of its database software. Financial terms of the deal were not available. CEO David Rosenthal was previously VP of Engineering at Omniture and co-founded the company with COO Nick Lavezzo and Dave Scherer in 2009.

“Apple buys smaller technology companies from time to time, and we generally do not discuss our purpose or plans,” said Apple, in response to an inquiry about the company. FoundationDB has not responded to a request for comment.

FoundationDB had raised $22.7 million in two rounds from SV Angel, Sutter Hill and CrunchFund.

ArangoDB 2.5.1 released

ArangoDB 2.5.1  is available for download.

It mainly adds slow-query log and killing running queries through the Http API and the web-interface.

 

ArangoDB has an Http interface for retrieving the lists of currently executing AQL queries and the list of slow AQL queries.The added option --database.slow-query-threshold could be used to change the default AQL slow query threshold value on server start.

Running AQL queries can also be killed on the server. ArangoDB provides a kill facility via an Http interface. To kill a running query, its id (as returned for the query in the list of currently running queries) must be specified. The kill flag of the query will then be set, and the query will be aborted as soon as it reaches a cancellation point.

Challenging the reliability of distributed data system

There are two basic tasks that any computer system needs to accomplish:

  • storage
  • computation

Distributed systems allow to solve the same problem that you can solve on a single computer using multiple computers – usually, because the problem no longer fits on a single computer.

Distributed systems need to partition data or state up over lots of machines to scale. Adding machines increases the probability that some machine will fail, and to address this these systems typically have some kind of replicas or other redundancy to tolerate failures.

Where is the flaw in such reasoning?

It is the assumption that failures are independent. If you pickup pieces of identical hardware, run them on the same network gear and power systems, have the same people run and manage and configure them, and run the same (buggy) software on all of them. It would be incredibly unlikely that the failures on these machines would be independent of one another in the probabilistic sense that motivates a lot of distributed infrastructure. If you see a bug on one machine, the same bug is on all the machines. When you push bad config, it is usually game over no matter how many machines you push it to.

RethinkDB 2.0 RC now available

RethinkDB 2.0 release candidate is now available.

RethinkDB got my attention last year when introducing its changes command which basically provide what could be called “trigger subscription” as a way to subscribe to change notifications in the database.

The 2.0 sounds promising but still requires some testing

Time, Clocks and the Ordering of Events in a Distributed System

Written in 1978 by Leslie Lamport, this is a must read paper freely available hereafter:

http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf

Time, Clocks and the Ordering of Events in a Distributed System

Communications of the ACM 21, 7   (July 1978), 558-565.  Reprinted in several collections, including Distributed Computing: Concepts and Implementations, McEntire et al., ed.  IEEE Press, 1984.
PDF
Copyright © 1978 by the Association for Computing Machinery, Inc.Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. The definitive version of this paper can be found at ACM’s Digital Library –http://www.acm.org/dl/.


Jim Gray once told me that he had heard two different opinions of this paper: that it’s trivial and that it’s brilliant.  I can’t argue with the former, and I am disinclined to argue with the latter.

The origin of this paper was a note titled The Maintenance of Duplicate Databases by Paul Johnson and Bob Thomas.  I believe their note introduced the idea of using message timestamps in a distributed algorithm.  I happen to have a solid, visceral understanding of special relativity (see [5]).  This enabled me to grasp immediately the essence of what they were trying to do.  Special relativity teaches us that there is no invariant total ordering of events in space-time; different observers can disagree about which of two events happened first.  There is only a partial order in which an event e1 precedes an event e2 iff e1 can causally affect e2.  I realized that the essence of Johnson and Thomas’s algorithm was the use of timestamps to provide a total ordering of events that was consistent with the causal order.  This realization may have been brilliant.  Having realized it, everything else was trivial.  Because Thomas and Johnson didn’t understand exactly what they were doing, they didn’t get the algorithm quite right; their algorithm permitted anomalous behavior that essentially violated causality.  I quickly wrote a short note pointing this out and correcting the algorithm.

It didn’t take me long to realize that an algorithm for totally ordering events could be used to implement any distributed system.  A distributed system can be described as a particular sequential state machine that is implemented with a network of processors.  The ability to totally order the input requests leads immediately to an algorithm to implement an arbitrary state machine by a network of processors, and hence to implement any distributed system.  So, I wrote this paper, which is about how to implement an arbitrary distributed state machine.  As an illustration, I used the simplest example of a distributed system I could think of–a distributed mutual exclusion algorithm.

This is my most often cited paper.  Many computer scientists claim to have read it.  But I have rarely encountered anyone who was aware that the paper said anything about state machines.  People seem to think that it is about either the causality relation on events in a distributed system, or the distributed mutual exclusion problem.  People have insisted that there is nothing about state machines in the paper.  I’ve even had to go back and reread it to convince myself that I really did remember what I had written.

The paper describes the synchronization of logical clocks.  As something of an afterthought, I decided to see what kind of synchronization it provided for real-time clocks.  So, I included a theorem about real-time synchronization.  I was rather surprised by how difficult the proof turned out to be.  This was an indication of what lay ahead in [62].

This paper won the 2000 PODC Influential Paper Award (later renamed the Edsger W. Dijkstra Prize in Distributed Computing).  It won an ACM SIGOPS Hall of Fame Award in 2007.

Storage market

We’ve been for quite a long time into a trend of ever cheaper storage capacity.The GB price have been falling until the industry crisis from 2011-2012 (flooding in Thailand). Even if the trend remains it seems like something I’ve changed.

1) The market is no longer seeking cheaper capacity storage but instead faster storage. SSD technologies will change the market just like the multi-core change the CPU market when power and heat issues forced chip makers in the early 2000’s to move from single to multiple cores to enable CPU design to keep pace with Moore’s Law

2)Even the demand of storage capacity has relieve for personal files. People no longer need to download and store locally their movies and music. Instead they can subscribe to legal offer such as Netflix,Spotify or benefit from service like Google Drive or DropBox. Not only cheaper but also simpler to use they have changed the market for times to come.

 

1956, IBM’s hard drive ….10 mo

10mb

http://old-photos.blogspot.be/2011/06/hard-drive.html

 

The recent trend illustrated

blog-cost-per-gb-1 blog-historical-vs-actual-1

http://blog.backblaze.com/2013/11/26/farming-hard-drives-2-years-and-1m-later/

 

 

Postgres Outperforms MongoDB and Ushers in New Developer Reality

According to EnterpriseDB’s recent benchmark, Postgres Outperforms MongoDB and Ushers in New Developer Reality

Potgres would outperform MongoDB performance but also the MongoDB data size requirement would be outperformed by by approx. 25%

EDB found that Postgres outperforms MongoDB in selecting, loading and inserting complex document data in key workloads involving 50 million records:

  • Ingestion of high volumes of data was approximately 2.1 times faster in Postgres
  • MongoDB consumed 33% more the disk space
  • Data inserts took almost 3 times longer in MongoDB
  • Data selection took more than 2.5 times longer in MongoDB than in Postgres

Find the full article here

The benchmark tools is available on GitHub: https://github.com/EnterpriseDB/pg_nosql_benchmark

SQL or NoSQL – understanding the underlying issues

I tried recently to explain how it is not one or the other: SQL, NoSQL once again is not the question, the choice to be made. But instead, it is the the underlying issues which has to be understood and used to drive your choice.

  • Ability to scale

If you’re application can’t serve any longer its users, whatever how good and smart it used to work it is no longer working ….. So scaling, trough the techniques  of clustering,sharding and distributed process had become a must. One requirement that few RDBMS have been able to implement. Obviously the historical reasons, the old ways, are responsible: traditionally the SQL database was running on a single machine (one single big server with the biggest cpu available and all the RAM you could have afford). Before scaling solutions were made available, performance issue tried to be solved using cache techniques(memcached was created in 2003) but is all the same problem, if your application and service stop to serve its users it is game over.

  • ACID – transactional database

Most application does not need to support transaction, the ability  for a single process to perform multiple data-manipulation and finally enforce this set of operations or cancel them all, at any step, those rolling back to the initial data situation(before your program starts). Such feature, is available for all programs(and related instance) accessing a database concurrently. Such magic and complex set of features ensure to provide so called consistency and integrity. As I said, most application does not need to support transaction. Most NoSQL databases are non-ACID and does not support transaction.

  • Data model

Traditional RDBMS have relied on the relational models which can be overly restrictive. A strong relational models, when modelling complex data, requires skills and time to be created, maintained and documented(in view of knowledge transfer). In practice the relational data model will limit your future development since you can’t easily change a relational models. The NoSQL solution provides different data structure such as document,graph and key-value which enable non-relational data models. To make a long story short ,the data model (relational or not) will not ease your designs (still highly critical) but it will eventually ease its implementations.

 

YaJUG: Cassandra

Jeudi 2 octobre 2014 de 18:00 à 20:00  –  Luxembourg City, Luxembourg

18H00 : Introduction à Cassandra – Duy Hai DOAN (DataStax)

Cassandra est la base NoSQL orientée colonnes derrière les grandes entreprises comme Netflix, Sony Entertainment, Apple …
Une première session couvre la présentation générale de Cassandra et de son architecture. La deuxième session aborde le modèle de données et les bonnes pratiques de modélisation: comment passer du monde SQL au monde NoSQL avec Cassandra.

19H10 : Outillage de la solution Cassandra – Michaël Figuière (DataStax)

Présentation des outils pour aider le développeur Java à travailler efficacement avec Cassandra.

Big Data top paying skills

 According to kdnuggets the Big Data related skills led the list of top paying technical skills (six-figure salaries) in 2013.

The study focus on  technology professionals in the U.S. who enjoyed raises over the last year(2013).

Average U.S. tech salaries increased nearly three percent to $87,811 in 2013, up from $85,619 the previous year.Technology professionals understand they can easily find ways to grow their career in 2014, with two-thirds of respondents (65%) confident in finding a new, better position. That overwhelming confidence matched with declining salary satisfaction (54%, down from 57%) will keep tech-powered companies on edge about their retention strategies.

Companies are willing to pay hefty amounts to professionals with Big Data skills.


According to a report released on Jan 29, 2014 an average salary for a professional having knowledge and experience in programming language R was $115,531 in year 2013. 

Other Big Data oriented skills such as NoSQL, MapReduce, Cassandra, Pig, Hadoop, MongoDB are among top 10 paying skills. 

 

Source: kdnuggets