Challenging the reliability of distributed data system

There are two basic tasks that any computer system needs to accomplish:

  • storage
  • computation

Distributed systems allow to solve the same problem that you can solve on a single computer using multiple computers – usually, because the problem no longer fits on a single computer.

Distributed systems need to partition data or state up over lots of machines to scale. Adding machines increases the probability that some machine will fail, and to address this these systems typically have some kind of replicas or other redundancy to tolerate failures.

Where is the flaw in such reasoning?

It is the assumption that failures are independent. If you pickup pieces of identical hardware, run them on the same network gear and power systems, have the same people run and manage and configure them, and run the same (buggy) software on all of them. It would be incredibly unlikely that the failures on these machines would be independent of one another in the probabilistic sense that motivates a lot of distributed infrastructure. If you see a bug on one machine, the same bug is on all the machines. When you push bad config, it is usually game over no matter how many machines you push it to.

Definition: Availability

Availability = uptime / (uptime + downtime)

Availability from a technical perspective is mostly about being fault tolerant. Because the probability of a failure occurring increases with the number of components, the system should be able to compensate so as to not become less reliable as the number of components increases.

For example, availability rate for a given service over an entire year mean the following:

Availability % How much downtime is allowed per year?
90% (“one nine”) More than a month
99% (“two nines”) Less than 4 days
99.9% (“three nines”) Less than 9 hours
99.99% (“four nines”) Less than an hour
99.999% (“five nines”) ~ 5 minutes
99.9999% (“six nines”) ~ 31 seconds

The 4 V's of Big Data

The 4 V’s of Big Data

The challenges associated with Big Data are the “4 V’s”: Volume, Velocity, Variety, and Value.

The "4 V's" of Big Data: Volume, Velocity, Variety, and Value.           <em>Source: Oracle.</em>

The “4 V’s” of Big Data: Volume, Velocity, Variety, and Value. Source: Oracle.


  • The Volume challenge exists because most businesses generate much more data than what their systems were designed to handle.
  • The Velocity challenge exists if a company’s data analysis or data storage runs slower than its data generation. This could be because of customer clicks on your website or thousands of sales transactions every second — a good problem to have.
  • The Variety challenge exists because of the need to process different types of data to produce the desired insights. This could include, for example, analyzing data from social networks, databases and customer service call records at the same time.
  • The Value challenge applies to deriving valuable insights from data, which is the most important of all V’s in my view. A company can usually collect all the data but the challenge is to ask the right questions to get value from it.

What exactly is big data?

Explanations vary, of course, but we might agree that big data is high-volume, high-velocity, and high-variety information that requires new tools and skills to manage.

Definition "information" term

Drucker (1988) wrote, information is data endowed with meaning and purpose

Definition “YAML” term

YAML  is a human-readable data serialization format that takes concepts from programming languages such as C, Perl, and Python, and ideas from XML and the data format of electronic mail (RFC 2822). YAML was first proposed by Clark Evans in 2001, who designed it together with Ingy döt Net and Oren Ben-Kiki. It is available for several programming languages.

YAML is a recursive acronym for “YAML Ain’t Markup Language”. Early in its development, YAML was said to mean “Yet Another Markup Language”, but was retronymed to distinguish its purpose as data-oriented, rather than document markup.


Sample document

Data structure hierarchy is maintained by outline indentation.

Sample document

Data structure hierarchy is maintained by outline indentation.

receipt:     Oz-Ware Purchase Invoice
date:        2007-08-06
    given:   Dorothy
    family:  Gale

    - part_no:   A4786
      descrip:   Water Bucket (Filled)
      price:     1.47
      quantity:  4

    - part_no:   E1628
      descrip:   High Heeled "Ruby" Slippers
      size:      8
      price:     100.27
      quantity:  1

bill-to:  &id001
    street: |
            123 Tornado Alley
            Suite 16
    city:   East Centerville
    state:  KS

ship-to:  *id001

specialDelivery:  >
    Follow the Yellow Brick
    Road to the Emerald City.
    Pay no attention to the
    man behind the curtain.


Standard review – ISO 4217 – Currency

ISO 4217 is a standard published by the International Standards Organization, which delineates currency designators, country codes (alpha and numeric) and references to minor units in three tables:

An updated and freely data source for Country and Currency code is available here:

The ISO 4217 maintenance agency (MA), SIX Interbank Clearing, is responsible for maintaining the list of codes.

Definition “Kalman filter” term

According to wikipedia, we found the following definition

The Kalman filter, also known as linear quadratic estimation (LQE), is an algorithm which uses a series of measurements observed over time, containing noise (random variations) and other inaccuracies, and produces estimates of unknown variables that tend to be more precise than those that would be based on a single measurement alone. More formally, the Kalman filter operatesrecursively on streams of noisy input data to produce a statistically optimal estimate of the underlying system state. The filter is named for Rudolf (Rudy) E. Kálmán, one of the primary developers of its theory.

The Kalman filter has numerous applications in technology. A common application is for guidance, navigation and control of vehicles, particularly aircraft and spacecraft. Furthermore, the Kalman filter is a widely applied concept in time series econometrics.

The algorithm works in a two-step process: in the prediction step, the Kalman filter produces estimates of the current state variables, along with their uncertainties. Once the outcome of the next measurement (necessarily corrupted with some amount of error, including random noise) is observed, these estimates are updated using a weighted average, with more weight being given to estimates with higher certainty. Because of the algorithm’s recursive nature, it can run in real time using only the present input measurements and the previously calculated state; no additional past information is required.

From a theoretical standpoint, the main assumption of the Kalman filter is that the underlying system is a linear dynamical system and that all error terms and measurements have a Gaussian distribution (often a multivariate Gaussian distribution). Extensions and generalizations to the method have also been developed, such as the Extended Kalman Filter and the Unscented Kalman filter which work on nonlinear systems. The underlying model is a Bayesian model similar to a hidden Markov model but where the state space of the latent variables is continuous and where all latent and observed variables have Gaussian distributions.



About JSON genesis

Video from  IEEE Computing Conversations

Interview with Douglas Crockford about the development of JavaScript Object Notation (JSON)


  • Crockford is likeably humble about the origins of JSON. Rather than claiming he inventedJSON he instead says he discovered it:
“I don’t claim to have invented it, because it already existed in nature. I just saw it, recognized the value of it, gave it a name, and a description, and showed its benefits. I don’t claim to be the only person to have discovered it.”


  • Crockford tried very hard to strip unnecessary stuff from JSON so it stood a better chance of being language independent. When confronted with push back about JSON not being a “standard” Crockford registered, put up a specification that documented the data format, and declared it as a standard.


  • Crockford wanted something that made his life easier. He needed JSON when building an application where a client written in JavaScript needed to communicate with a server written in Java.   He wanted something where the data serialization matched the data structures available to both programming language environments.


IETF working on a convention for HTTP access to JSON resources

Internet draft is working on A Convention for HTTP Access to JSON Resources


This document codifies a convention for accessing JSON representations of resources via HTTP.

Status of this Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.