§12.2.

Network

A distributed system is a collection of networked computers that appear to users as a single coherent system. [1] So far, we have already created distributed systems that involve the browser, server and database:

  • The web browser communicates with the web server by HTTP/HTTPS

  • The web server communicates to a database management system (e.g., PostgreSQL or MongoDB) using a proprietary network protocol

During development, these systems are often running on the same server. In production, they typically run on separate servers.

If the software grows in popularity, it may soon be necessary to run them on multiple computers because no single computer can handle all of the demand.

Shifting from a small number of computers to a distributed system of networked computers creates new challenges. There are more points of failure and more ways that things can go wrong.

Fallacies of distributed computing

The eight fallacies of distributed computing [2] are assumptions that developers too often make about distributed computing:

  1. The network is reliable

  2. Latency is zero

  3. Bandwidth is infinite

  4. The network is secure

  5. Topology doesn’t change

  6. There is one administrator

  7. Transport cost is zero

  8. The network is homogeneous

Real-world practice inevitably violates these assumptions:

The network is unreliable

IP packets are routinely lost or incorrectly routed. Firewalls can block access. Some protocols can fail. For example, a running microwave oven creates interference that can interrupt WiFi packets.

Latency is high

The laws of physics prevent information from being transmitted faster than the speed of light. It takes 240 milliseconds (each direction) for satellite internet or 70 milliseconds (each direction) for fiber optic cable to send data from Sydney to Silicon Valley. In practice, there are many other reasons for delays: slow routers, overloaded or shared links, inefficient message encodings or signaling, and transmission over slower mediums.

Bandwidth is limited

Users in rural locations, in developing nations, onboard aircraft or onboard cruise ships will not be able to download large files. Residential high-speed broadband may also be overloaded when users are streaming video.

The network is insecure

As I noted in Chapter 9, the network is extremely hostile. Nothing can be trusted. Even internal networks can be insecure and attacks could come from authenticated users.

Topology changes

The network is constantly changing. Your users may install software in new and unexpected configurations. End-users might change firewall configurations or switch from wired ethernet to a mobile hotspot.

Information technology departments are complex

End-users might not be able to change their network settings. In large corporations, there may be multiple people in charge of IT decisions and configuration.

Overhead is everywhere

Every part of the networking stack adds overheads: when data is streamed to the operating system, when the operating system encodes the data into TCP/IP packets, when the network card encodes it onto ethernet frames and when signals propagate through the physical medium. Every layer incurs a penalty of additional headers and processing time for encoding and translation.

The internet is heterogenous

Users may use: Linux, Windows, Mac, iOS or Android; Firefox, Chrome, Edge, Safari, Internet Explorer, Opera, KaiOS Browser. They may have different network configurations and use VPNs, firewalls and proxies. Some devices may be faulty. Some devices may not strictly follow internet standards.

Reliable systems from unreliable components

Users have come to expect the ability to access websites anytime and anywhere, and businesses today expect to offer 24×7 availability. [3]

The goal of 24×7 availability brings us to the most exciting problem in internet programming:

How can reliable systems be built when networks are unreliable?

Most machines stop working when an essential component is damaged or removed. For example, A fridge with a broken thermostat will stop cooling food.

However, it would be unacceptable for networks to fail when a single server crashes. Network and server failure happens far too often for this to be sensible.

Instead, we should design our systems to avoid single points of failure. In a sense, this is the challenge faced by life itself: individual cells in our bodies continuously die and are replaced, yet we live on as people.

Is it possible?

Before discussing how to build reliable systems, it is important first to consider if it is possible to build reliable systems from unreliable parts.

Once we have established the possibility of building such systems, we can then explore how to realize such systems.

The following is a simple (but highly impractical) scheme:

  • Deploy two copies of the same application on separate servers (e.g., 1.example.com and 2.example.com).

  • As a user, you must always open both URLs in separate windows.

  • As a user, you must perform every action identically on both websites (e.g., you’ll create an account on each site and then upload the same messages to each website).

  • If one of the servers crashes, then you telephone a system administrator. They replace the failed server with a copy of the server that did not fail.

Of course, very few users would have the patience to entertain this kind of impractical setup. However, it does suggest that there are ways to build reliable systems. As long as one server stays online, it should be possible for the administrators to restore the other system.

Reflection: Reliability in DNS

How does the design of the domain name system (DNS) ensure reliability?

Tip
I discussed the operation of the DNS in Chapter 3.
Reflection: Automating reliability

In the example above, it relied on users needing to keep two separate websites simultaneously up-to-date. How could this idea be extended so that it is performed automatically without depending on a diligent user?

Not only reliability

Reliability is not the only concern. Systems need to be practical, easy to use, fast and responsive. As much as possible, users should experience what they believe is a single, reliable, fast computer. The distributed system hides the fact that it has many unreliable and perhaps slow components.

Performance efficiency and reliability were among the quality characteristics discussed in Chapter 10. In this chapter and the following chapters, I will focus on the end-user experience of these quality characteristics. In particular, I will consider performance efficiency and reliability in terms of the following quantitative metrics:

  • Throughput: the total number of requests handled for some duration (e.g., a website might have a throughput of 10,000 requests per second)

  • Response time: the total time to respond to a single request (e.g., on average it might take 100ms to respond to a request)

  • Latency: the amount of delay caused by the network (e.g., there might be a 140ms latency for requests between Sydney and Silicon Valley due to the speed of light)

  • Scalability: how well the site grows with demand (e.g., if the same software can support more users by installing additional servers, then the site is scalable)

  • Availability: the proportion of the time users can access the system (e.g., Google is said to routinely achieve 99.999% availability: less than 5.26 minutes of downtime per year [4]

Distributed architectures

In this and the following two chapters, I will address how to build reliable systems on unreliable networks, in three parts:

  1. In this chapter, I will explore the architectures that allow multiple servers to be combined to appear as a single system (Chapter 12)

  2. Next, I will explore how to ensure that combining multiple servers does not result in data loss or corruption (Chapter 13)

  3. Finally, I will explore how to increase performance so that reliable does not mean slow (Chapter 14)


1. van Steen and Tanenbaum (2016) ‘A brief introduction to distributed systems’, Computing, no. 98, pp. 697-1009.
2. These fallacies are attributed to Peter Deutch while working at Sun Microsystems.
3. In the early days of the web, some websites were only available during ‘business hours’ and many others had routine maintenance that prevented use every night from midnight to 1am.
4. Sometimes a distinction is made between availability and reliability. In such cases, availability refers to the system’s schedule. Reliability refers to what proportion of the scheduled time the system is actually available. For example, a system that works perfectly except for scheduled maintenance on Sunday mornings from 1 am to 2 am has 99.4% availability and 100% reliability.