I was discussing with a colleague of mine the other day about handling of node failures in a distributed system, esp the case where a node A thinks that node C is down but node B thinks otherwise. The ramifications of this could be fatal to the system unless handled properly.
Later I came across the term “Byzantine faults” in another unrelated search and started following lot of other literature around this. Here are couple of good reads.
- “Julian Browne” in his post on “Gangstas Don’t scale” refers to “Byzantine fault tolerance” and “Two general problem” addresses the common misconceptions around “make everything asynchronous” to solve all scaling issues in distributed systems. I agree. Like in any other system, there is no silver bullet. In distributed systems, you got to expect “failures” and design for that. Containing efffects of faults is key to the success and asynchronous communication doesn’t make it easy.
- Byzantine fault detectors for solving consensus – Authors propose an algorithm to solve the consensus problem in asynchronous distributed systems that are subject to crash faults.
- Gregor Hohpe post on “Your coffe-shop doesn’t use 2-phase commit protocol“.
- Trade-offs between availability and consistency – In this post, Werner Vogels talks about different notion called “eventually consistent” employed in Amazon’s Dynamo‘s architecture.