13.2: Concurrency, distribution and consistency (Advanced Internet Programming)

There is a danger in building and testing web applications on ‘localhost’, without testing on production servers. On ‘localhost’, the code is run only in an idealized way:

one user at a time,
making one request at a time,
running on a fast computer with a high clock speed. ^[1]

Even if you’re using multiple browser windows at once, it will be rare for your server to need to handle two requests at precisely the same time.

After deploying your system on the web with thousands or millions of simultaneous users, simultaneous requests will no longer be rare.

The challenge of concurrency

Consider a simple counter. For example, the count of the number of ‘likes’ on a social media post. Incrementing the counter might involve the following three operations:

Read the current count from the database (Read x)
Add one to the count (Compute x' = x + 1)
Save the updated count to the database (Write x')

During development, this code should never fail. Every request increases the counter by one.

Suppose there are two requests, one after the other, starting with a value of 15. The result should be 17. The following table depicts the chronological steps:

Timestep	Alice	Bobby
1	Read x (15)
2	Compute x' (16)
3	Write x' (16)
4
5		Read x (16)
6		Compute x' (17)
7		Write x' (17)

Now, suppose we were to have multiple simultaneous users and attempt nävely to run both simultaneously by interleaving the operations? The term concurrency refers to this idea of performing updates (or indeed, any code) simultaneously.

The following table depicts one possible outcome:

Timestep	Alice	Bobby
1	Read x (15)
2		Read x (15)
3	Compute x' (16)
4		Compute x' (16)
5	Write x' (16)
6		Write x' (16)
7

The result of simultaneously performing two updates is 16, instead of 17. We have lost an update!

This example illustrates the challenge introduced by concurrency. Many simple algorithms designed for single-user, request-at-a-time processing are unreliable and buggy if executed simultaneously. The difficulty of concurrency is something we can understand intuitively: any work (assignments, cooking, cleaning, writing, sports) is less complex when only one person is involved; a team that works together expends effort in coordination.

Can concurrency be avoided?

Is it feasible to avoid concurrency altogether?

Event sourcing, discussed in Chapter 10, is one approach that can be adapted to avoid concurrency. In event sourcing, every update is an event or command. By processing these events one-by-one extremely fast, end-users might not notice that the system is not concurrent. Unfortunately, using event sourcing to avoid concurrency will only succeed if a single computer can handle all the updates. For a single computer to process many requests, the updates must be simple and the database small enough to be stored in memory (RAM).

In practice, updates can take a long time:

Disk drives are slow to read: it takes time to read data from disk drives (solid-state and hard disk drives are far slower than memory reads)
Complex updates can be slow to compute (not every operation will be as simple as adding one to a counter)
Other systems need to be involved (e.g., when a separate system must confirm the update)
Disk drives are slow to write: writing data is even slower than reading data (and it is often necessary to wait for the write to be confirmed to ensure no data loss in the event of a power outage)

Together, these factors combine so that a single request may take a long time to handle. ^[2] If a single individual request takes a server 10 milliseconds to process, the server will only handle 100 requests per second. Most of the time, the server’s CPU will be waiting for the disk drives to fetch/store data and other systems to confirm the update. Concurrency is essential because if the server can perform 200 such requests concurrently, then even though each request will take 10 milliseconds to complete, the system will be able to handle 20,000 requests per second.

The challenge of distribution

It is challenging to manage concurrency on a single computer. It is nightmarishly difficult when separate servers perform concurrent updates.

In addition to ensuring that concurrent updates do not result in data loss or corruption, it is a challenge to ensure that data is consistent across multiple servers. In particular, there are two contradictory problems:

If there is just one copy of data, then there is a single point of failure: if the server is damaged, then the data is lost. ^[3]
If there are many copies, then there is the challenge of keeping those copies synchronized and consistent.

Can we avoid distributed systems?

Ultimately, dealing with the challenges of distribution can’t be avoided. As tempting as it may be to have a single server, a single server untenable in real production environments.

A single server is a single point of failure. Businesses that require high availability and high reliability cannot have single points of failure. If a business needs to be resilient against the crash of any given server, there must be more than one server.

Furthermore, the laws of physics dictate that a single server will create a poor experience for globally distant users. The speed of light adds 140 milliseconds to the round-trip latency of any message from Sydney to Silicon Valley. ^[4] Any action that needs to take less than 140 milliseconds will be impossible over these distances (e.g., responsive remote desktop streaming or video game streaming). The only way to ensure that worldwide users experience low latencies is to deploy servers around the world.

1. Desktop CPUs have a high clock speed and few cores. Server CPUs typically have more cores running at low clock speeds. In comparison to a server, your desktop CPU will have faster response times but lower overall throughput.

2. Computers and disks tend to perform less efficiently on random accesses. If a system can sort and reorder requests, then it can benefit from better cache utilization and faster disk seeks.

3. Keeping regular backups is critical but it is not enough. When a server crashes, any recent updates that are not in the latest backup may be lost.

4. Sydney to San Jose by the Southern Cross Cable is 14345 kilometers. Traveling this distance in Corning SMF-28 fiber cable at 204190477 meters per second will take 70.3 milliseconds each direction.