Challenging the reliability of distributed data system

There are two basic tasks that any computer system needs to accomplish:

  • storage
  • computation

Distributed systems allow to solve the same problem that you can solve on a single computer using multiple computers – usually, because the problem no longer fits on a single computer.

Distributed systems need to partition data or state up over lots of machines to scale. Adding machines increases the probability that some machine will fail, and to address this these systems typically have some kind of replicas or other redundancy to tolerate failures.

Where is the flaw in such reasoning?

It is the assumption that failures are independent. If you pickup pieces of identical hardware, run them on the same network gear and power systems, have the same people run and manage and configure them, and run the same (buggy) software on all of them. It would be incredibly unlikely that the failures on these machines would be independent of one another in the probabilistic sense that motivates a lot of distributed infrastructure. If you see a bug on one machine, the same bug is on all the machines. When you push bad config, it is usually game over no matter how many machines you push it to.

Comments are closed.