What Does Being ‘Fault-Tolerant’ Even Mean?

By: Geeq  on Oct 4, 2019

By design, explaining what a blockchain is quite complicated. Even more so when you’re trying to explain a new method of building one, which comes with its own rules (The Geeq Proof of Honesty protocol). With that said, we want to start breaking down the complicated parts into delightfully simple explainer articles that won’t give anyone a headache. In this article, we will be explaining what it means to be ‘fault-tolerant’ when it comes to distributed systems and what it looks like. This will hopefully explain why we love telling you that our system is 99% Byzantine Fault Tolerant (BFT).

Why We’re Talking About ‘Fault Tolerance’

Being in the business of building a new blockchain ecosystem means we’re naturally obsessed with designing the most fault-tolerant platform possible.

Due to the fact, a blockchain is a type of ‘distributed system’, where the individual parts of the network can be miles apart,being ‘fault-tolerant’ is extremely important and it’s why we’re going to start off by explaining a little bit about distributed systems.

The Basics of Distributed Systems

To start us off, it’s important to know a blockchain is a type of distributed system. But considering this topic gets quite tricky at times, we’ll be explaining principles as simply as possible so you can easily grasp what being fault-tolerant actually means.

On the whole, a distributed system can be classified as a group of computers operating as one to reach the same goal. While the processes amongst this network are completed separately, the system appears to be working in unison to the end-user.

There are plenty of examples of distributed systems in our everyday lives. Hybrid cars, for instance, have a motor, battery pack, on-board charger, wheels, etc. which are all indeed separate parts of the system. Yet, these parts continue to work as one unit in order to complete the objective of propelling the car backward or forwards.

In a blockchain, the distributed systems in question are individual devices. With each individual member of the network being known as a “node”, “peer”, “miner”, “validator”, or “actor”, amongst many other terms.

What are the key features of a distributed system?

  1. Concurrency
    Each computer within the network will complete in unison, meaning they are working concurrently. To put it simply, each node in the network does everything at the same time.
  2. Timing difficulties
    In order to coordinate it’s quite hard to say which computers in a distributed system did a thing first… In other words, it’s tricky to see which node discovered, mined, or validated the block first. This is due to the fact that computers, even if they are set at the same time, naturally move out of sync. Which is an unfortunate consequence, because in distributed systems we need a way of telling what event happened first
  3. Ability to deal with faulty components
    Every system will have faults at some point or another. Whether that’s process crashing; message abandonment, distortion or duplication; or even a network partitioning, delaying or dropping messages. Sometimes systems just go haywire.

This is why it’s important for systems to be “Fault-tolerant”, meaning they can still carry out their job despite failing components.

One Node failing doesn’t spoil the party

E.g. Going back to our hybrid car analogy, just because the battery dies doesn’t mean the car grinds to an immediate halt, the gasoline tank will kick in allowing the car to still operate.

Faults and failures can be grouped into three categories

  • Crash-fail: The component stops working without warning (e.g. we’ve all had a sudden computer crash right?)
  • Omission: The component sends a message but no other node receives it
  • Byzantine: There are some malicious actors within the network who break the network on purpose by blocking, altering, or refusing to send messages.

Being Fault Tolerant

So bringing this all back to Geeq™, it’s important that our distributed system (i.e. everything within the Geeq™ ecosystem) is fault-tolerant. We’ve briefly covered the types of faults above but it’s important to know how a system deals with these faults.

There are three types of fault-tolerance you need to know:

  1. Simple fault-tolerance
    In this type of system, the network makes the assumption that each computer/node does one of two things: they either play by the rules, or they fail. In principle, this handles things like crash-failures and omissions (see above), but cannot handle malicious nodes.
  2. Byzantine Fault Tolerant (BFT)
    A Byzantine Fault Tolerant system is designed to handle nodes that choose to be “Byantine” as well as those which just simply crash.
  3. Byzantine and Rational (BAR) fault-tolerance
    While nodes can be downright malicious and Byzantine, sometimes nodes will deviate from the network rules if it rational to do so. Therefore, a BAR fault-tolerant system understands that sometimes nodes can be either: Byzantine, Honest and always following protocol, or Rational (only following protocol if it makes sense).

In order to be a fully functioning and secure distributed system, a blockchain requires adequate protocols that manage and successfully navigate these faults. Otherwise, it wouldn’t be secure, safe, or effective enough for confident usage.

Early blockchain ecosystems used Proof of Work, Proof of Stake and many other protocol variations struggled when it came to BFT and BAR. This stopped many industries adopting blockchain technology, particularly in the IoT space where security is paramount. It’s for this reason that our founders, Stephanie So and John Conley, came together to develop the bespoke and revolutionary protocol: Proof of Honesty.

And There You Have It!

You’re delightfully simple explainer and what it means to be fault-tolerant! Well done on getting through it. While we’d love to dive in and thoroughly explain how Proof of Honesty remains fault-tolerant and truly Geeq™ out over it, this requires a whole dedicated article.