Challenging the reliability of distributed data system

There are two basic tasks that any computer system needs to accomplish:

  • storage
  • computation

Distributed systems allow to solve the same problem that you can solve on a single computer using multiple computers – usually, because the problem no longer fits on a single computer.

Distributed systems need to partition data or state up over lots of machines to scale. Adding machines increases the probability that some machine will fail, and to address this these systems typically have some kind of replicas or other redundancy to tolerate failures.

Where is the flaw in such reasoning?

It is the assumption that failures are independent. If you pickup pieces of identical hardware, run them on the same network gear and power systems, have the same people run and manage and configure them, and run the same (buggy) software on all of them. It would be incredibly unlikely that the failures on these machines would be independent of one another in the probabilistic sense that motivates a lot of distributed infrastructure. If you see a bug on one machine, the same bug is on all the machines. When you push bad config, it is usually game over no matter how many machines you push it to.

Time, Clocks and the Ordering of Events in a Distributed System

Written in 1978 by Leslie Lamport, this is a must read paper freely available hereafter:

Time, Clocks and the Ordering of Events in a Distributed System

Communications of the ACM 21, 7   (July 1978), 558-565.  Reprinted in several collections, including Distributed Computing: Concepts and Implementations, McEntire et al., ed.  IEEE Press, 1984.
Copyright © 1978 by the Association for Computing Machinery, Inc.Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or The definitive version of this paper can be found at ACM’s Digital Library –

Jim Gray once told me that he had heard two different opinions of this paper: that it’s trivial and that it’s brilliant.  I can’t argue with the former, and I am disinclined to argue with the latter.

The origin of this paper was a note titled The Maintenance of Duplicate Databases by Paul Johnson and Bob Thomas.  I believe their note introduced the idea of using message timestamps in a distributed algorithm.  I happen to have a solid, visceral understanding of special relativity (see [5]).  This enabled me to grasp immediately the essence of what they were trying to do.  Special relativity teaches us that there is no invariant total ordering of events in space-time; different observers can disagree about which of two events happened first.  There is only a partial order in which an event e1 precedes an event e2 iff e1 can causally affect e2.  I realized that the essence of Johnson and Thomas’s algorithm was the use of timestamps to provide a total ordering of events that was consistent with the causal order.  This realization may have been brilliant.  Having realized it, everything else was trivial.  Because Thomas and Johnson didn’t understand exactly what they were doing, they didn’t get the algorithm quite right; their algorithm permitted anomalous behavior that essentially violated causality.  I quickly wrote a short note pointing this out and correcting the algorithm.

It didn’t take me long to realize that an algorithm for totally ordering events could be used to implement any distributed system.  A distributed system can be described as a particular sequential state machine that is implemented with a network of processors.  The ability to totally order the input requests leads immediately to an algorithm to implement an arbitrary state machine by a network of processors, and hence to implement any distributed system.  So, I wrote this paper, which is about how to implement an arbitrary distributed state machine.  As an illustration, I used the simplest example of a distributed system I could think of–a distributed mutual exclusion algorithm.

This is my most often cited paper.  Many computer scientists claim to have read it.  But I have rarely encountered anyone who was aware that the paper said anything about state machines.  People seem to think that it is about either the causality relation on events in a distributed system, or the distributed mutual exclusion problem.  People have insisted that there is nothing about state machines in the paper.  I’ve even had to go back and reread it to convince myself that I really did remember what I had written.

The paper describes the synchronization of logical clocks.  As something of an afterthought, I decided to see what kind of synchronization it provided for real-time clocks.  So, I included a theorem about real-time synchronization.  I was rather surprised by how difficult the proof turned out to be.  This was an indication of what lay ahead in [62].

This paper won the 2000 PODC Influential Paper Award (later renamed the Edsger W. Dijkstra Prize in Distributed Computing).  It won an ACM SIGOPS Hall of Fame Award in 2007.

SQL or NoSQL – understanding the underlying issues

I tried recently to explain how it is not one or the other: SQL, NoSQL once again is not the question, the choice to be made. But instead, it is the the underlying issues which has to be understood and used to drive your choice.

  • Ability to scale

If you’re application can’t serve any longer its users, whatever how good and smart it used to work it is no longer working ….. So scaling, trough the techniques  of clustering,sharding and distributed process had become a must. One requirement that few RDBMS have been able to implement. Obviously the historical reasons, the old ways, are responsible: traditionally the SQL database was running on a single machine (one single big server with the biggest cpu available and all the RAM you could have afford). Before scaling solutions were made available, performance issue tried to be solved using cache techniques(memcached was created in 2003) but is all the same problem, if your application and service stop to serve its users it is game over.

  • ACID – transactional database

Most application does not need to support transaction, the ability  for a single process to perform multiple data-manipulation and finally enforce this set of operations or cancel them all, at any step, those rolling back to the initial data situation(before your program starts). Such feature, is available for all programs(and related instance) accessing a database concurrently. Such magic and complex set of features ensure to provide so called consistency and integrity. As I said, most application does not need to support transaction. Most NoSQL databases are non-ACID and does not support transaction.

  • Data model

Traditional RDBMS have relied on the relational models which can be overly restrictive. A strong relational models, when modelling complex data, requires skills and time to be created, maintained and documented(in view of knowledge transfer). In practice the relational data model will limit your future development since you can’t easily change a relational models. The NoSQL solution provides different data structure such as document,graph and key-value which enable non-relational data models. To make a long story short ,the data model (relational or not) will not ease your designs (still highly critical) but it will eventually ease its implementations.


Neo4J introducing labels in 2.0

A big new feature in Neo4j 2.0 are node labels and real, automatic indexes. Here you can quickly get an update on this extension of the property graph model.

You can watch the presentation here.

With the new node-label feature you can assign any number of types from your domain to a node. Imagine labels like Person, Location, Product, Project, User etc. Adding, querying and removing labels is supported in all Neo4j-APIs: Cypher, Java-API and REST-API (Batch-Inserter is in the works).

Starting with Neo4j 2.0 a last missing piece to Cypher functionality was added too. The new labels allow to provide label-based indexes which are handled automatically by the database. That means after an index is created all existing nodes with the label and properties are added to it behind the scenes and after the completion of that task the index will be updated transactionally.

These indexes are used by Cypher to perform index based lookups based on the label and properties that are part of the index. That either happens automatically for simple expressions or with an explicit index hint.


Neo4j Preview 2.0.0-M01 is available for download

"Taming big data" IBM's best practices for the care of big data

Infographics “Taming big data” provided by IBM.

Certain things cannot be overlooked when dealing with data. Best practices must be instituted for the care of big data just as they have long been in small data. Before enjoying big data’s amazing analytical feats, you must first get it under control – with tools that are up to the challenge of implementing best practices in a big data world.

  • availability
  • management
  • disaster recovery
  • provisioning
  • optimization
  • backup & restore
  • security
  • governance
  • auditing
  • replication
  • virtualization
  • archiving


Probability, The Analysis of Data

Probability, The Analysis of Data – Volume 1

is a free book available online, it provides educational material in the area of data analysis.

  • The project features comprehensive coverage of all relevant disciplines including probability, statistics, computing, and machine learning.
  • The content is almost self-contained and includes mathematical prerequisites and basic computing concepts.
  • The R programming language is used to demonstrate the contents. Full code is available, facilitating reproducibility of experiments and letting readers experiment with variations of the code.
  • The presentation is mathematically rigorous, and includes derivations and proofs in most cases.
  • HTML versions are freely available on the website Hardcopies are available at affordable prices.

Handling "schema" change in production

I often heard and read about such situation, you started a brand new application based on a NoSQL datastore everything goes fine so far, you’re almost happy but all of the sudden you face a critical point: you need to change the “schema” for your application and you’re already live,running production solution.

From this point in time, you must ask yourself, is my  amount of data relatively small(i.e. documents count) so I can run a batch process in order to update all the documents in bulk , writing a small conversion program.

Unfortunately won’t always turn this way and sometimes due to the big amount of data you’re dealing with,  performing bulk batch updates wouldn’t feasible due to the time and impact on performance.

In such case you must consider a Lazy Update Approach , this is where in your application you can check whether the document is in the ‘previous schema’ when you need to read it in and update it when you write it out again.

Over time this will eventually migrate documents in ‘previous schema’  to the new, though it’s possible that you may end up with documents that rarely get accessed and so remain in an ‘previous schema’. You must then wait for the number of documents that remain in the  ‘previous schema’  to be small enough so  you could run batch jobs to update these remaining documents.

During this conversion process, you need to be very careful to any process which perform operation over multiple documents, this is the downside, those process might need to be rewrited as well and at least carefully reviewed.

BigData – Key Figures

Back to basics, facts and key figures about the data:
  • Bad data or poor data quality costs US businesses $600 billion annually.
  • 247 billion e-mail messages are sent each day… up to 80% of them are spam.
  • Poor data or “lack of understanding the data” are cited as the #1 reasons for overrunning project costs.
  • 70% of data is created by individuals – but enterprises are responsible for storing and managing 80% of it. (source)
  • We can expect a 40-60 per cent projected annual growth in the volume of data generated, while media intensive sectors, including financial services, will see year on year data growth rates of over 120 per cent.
  • Every hour, enough information is consumed by internet traffic to fill 7 million DVDs.  Side by side, they’d scale Mount Everest 95 times.
  • The volume of data that businesses collect is exploding: in 15 of the US economy’s 17 sectors, for example, companies with upward of 1,000 employees store, on average, more information than the Library of Congress does (source).
  • 48 hours worth of video is posted on YouTube every hour of everyday (source).
  • Every month 30 billion pieces of content are shared on Facebook (source).
  • By 2020 the production of data will be 44 times what we produced in 2009. (source)
  • If an average Fortune 1000 company can increase the usability of its data by just 10%, the company could expect an increase of over 2 billion dollars. (Source: InsightSquared infographic)

Dead simple design with Reddit's database

Reddit’s database has two tables

Steve Huffman talks about Reddit’s approach to data storage in a High Scalability post from 2010. I was surprised to learn that they only have two tables in their database.

Lesson: Don’t worry about the schema.

[Reddit] used to spend a lot of time worrying about the database, keeping everthing nice and normalized. You shouldn’t have to worry about the database. Schema updates are very slow when you get bigger. Adding a column to 10 million rows takes locks and doesn’t work. They used replication for backup and for scaling. Schema updates and maintaining replication is a pain. They would have to restart replication and could go a day without backups. Deployments are a pain because you have to orchestrate how new software and new database upgrades happen together.

Instead, they keep a Thing Table and a Data Table. Everything in Reddit is a Thing: users, links, comments, subreddits, awards, etc. Things keep common attribute like up/down votes, a type, and creation date. The Data table has three columns: thing id, key, value. There’s a row for every attribute. There’s a row for title, url, author, spam votes, etc. When they add new features they didn’t have to worry about the database anymore. They didn’t have to add new tables for new things or worry about upgrades. Easier for development, deployment, maintenance.

The price is you can’t use cool relational features. There are no joins in the database and you must manually enforce consistency. No joins means it’s really easy to distribute data to different machines. You don’t have to worry about foreign keys are doing joins or how to split the data up. Worked out really well. Worries of using a relational database are a thing of the past.

This fits with a piece I read the other day about how MongoDB has high adoption for small projects because it lets you just start storing things, without worrying about what the schema or indexes need to be. Reddit’s approach lets them easily add more data to existing objects, without the pain of schema updates or database pivots. Of course, your mileage is going to vary, and you should think closely about your data model and what relationships you need.

White paper – Evolving Role of the Data Warehouse in the Era of Big Data

Infoworld has publish a great white paper: “The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics (Ralph Kimball)”

In this white paper, we describe the rapidly evolving landscape for designing an enterprise data warehouse (EDW) to support business analytics in the era of “big data.” We describe the scope and challenges of building and evolving a very stable and successful EDW architecture to meet new business requirements. These include extreme integration, semi- and un-structured data sources, petabytes of behavioral and image data accessed through MapReduce/Hadoop as well as massively parallel relational databases, and then structuring the EDW to support advanced analytics. This paper provides detailed guidance for designing and administering the necessary processes for deployment. This white paper has been written in response to a lack of specific guidance in the industry as to how the EDW needs to respond to the big data analytics challenge, and what necessary design elements are needed to support these new requirements.


Big_Data_Analytics” – The Evolving Role of the Enterprise Data Warehouse in the Era of Big Data Analytics A Kimball Group White Paper By Ralph Kimball