8.3: Data persistence (Advanced Internet Programming)

Typically, we do not want to lose data after a server restart. Therefore data must be stored in non-volatile storage (i.e., on disks in your computer’s filesystem).

A simple solution

The simplest way to store data in Node.js is by saving directly to disk. For example, the following replacement for the persistence layer saves/restores the shopping list to a file:

const fs = require('fs');

// Retrieve all items in the shopping list
// returns an array of { description: string, quantity: number }
function findAllItems() {
    try {
        // Try to load the items from the filesystem, as JSON
        return JSON.parse(fs.readFileSync('items.json'));
    } catch {
        // Otherwise use a default setting
        return [];
    }
}

// Insert a single item into the shopping list
// description is a string, quantity is an integer
// the parameters are assumed to be valid (non-null, non-empty)
function insertItem(description, quantity) {
    // Load the current items
    let items = findAllItems();
    // Add the new record
    items.push({ description, quantity });
    // Overwrite the list of items in the filesystem
    fs.writeFileSync('items.json', JSON.stringify(items));
}

module.exports = { findAllItems, insertItem };

Tip	This code is a drop-in replacement for the persistence layer of the in-memory shopping list. Changes to the domain and persistence layer are unnecessary: layering isolates persistence changes to a single module.

Why use a database management system?

While directly writing to the filesystem could be used for persistence, it has severe problems for applications running at scale:

It is very slow and becomes slower as more data increases. Every new entry requires that all the existing data be read and rewritten. On my computer, adding the first shopping list item takes 90 microseconds. However, adding one single item to a list that contains 5,000,000 items takes 8 seconds!
It can cause data corruption. If power is unexpectedly lost while writing the file, you may lose half of the data.
It does not safely handle multiple applications. If you open the underlying file items.json in an editor, your changes could be lost or corrupted if the server updates the file at the same time you make your changes.

The solution to these problems is to use a database management system ^[1] (or, informally, a “database”).

Database management systems serve at least three purposes:

Storage: They store data in disks (and memory) for fast and efficient retrieval, update and querying. Also, the database management system ensures no data loss during a crash or power outage.
Distribution: They ensure that all users over a network can access (sufficiently) up-to-date data.
Concurrency: They ensure that all users can simultaneously update data without risk of corruption.

In this chapter, I will discuss the storage aspect of the database management systems. Distribution and concurrency are covered in Chapters 13 and 14.

Tip

Strictly speaking, a database is a collection of data, and the database management system is software that manages the database. For example, Oracle, Microsoft SQL Server, MySQL and MongoDB are all database management systems. You use those systems to create and organize data for your applications. For example, you might build a shopping list database using the MongoDB database management system. In casual usage, the word database can refer to either the underlying data or the software that manages the data.

Kinds of DBMS

The mid-1970s saw the rise of relational databases and SQL (structured query language). Since then, they have been the dominant form of database management system.

Well-known examples of SQL database management systems include the following:

PostgreSQL is a fully-featured open-source system designed for scalability.
SQLite3 is a lightweight open-source system, designed for embedding into other applications.
MySQL (or MariaDB) is a fully-featured and easy-to-use system under an open-source license or a commercial license.
Microsoft SQL Server is a user-friendly but powerful commercial database system.
Oracle Database is an expensive commercial database system that is popular with large enterprises.

While competitors have produced alternatives to relational database management systems, few gained widespread recognition before the year 2000.

In the early 2000s, demand created by the web’s explosive growth drove the development of various new database technologies and products. Many of these products do not support SQL, so these new database systems are known as NoSQL. ^[2]

These new NoSQL database systems include broad categories such as the following:

Document stores that store entire XML, HTML or JSON documents in a searchable format (e.g., MongoDB, CouchDB)
Column-oriented DBMS store data in columns rather than in rows for fast querying in data analytics and data mining applications (e.g., Cassandra, HBase, SAP HANA)
Key-value stores are simplified databases that only support tables with two columns: a key, and a corresponding value (e.g., Berkeley DB, Redis)
Graph databases store complex graph structures such as social networks and genetic relationships (e.g., Neo4J, Amazon Neptune)

I will cover two main database systems:

PostgreSQL (a SQL database)
MongoDB (a document store)

These database management systems are some of the most popular choices for new web applications. You deploy these systems on your hardware or use the managed versions offered by cloud platforms:

Database management system	Managed cloud offerings
PostgreSQL	ElephantSQL, AWS RDS for PostgreSQL, Azure Database for PostgreSQL, Google Cloud SQL for PostgreSQL
MongoDB	MongoDB Atlas, Amazon DocumentDB (with MongoDB compatibility)

Database management system

Managed cloud offerings

PostgreSQL

ElephantSQL, AWS RDS for PostgreSQL, Azure Database for PostgreSQL, Google Cloud SQL for PostgreSQL

MongoDB

MongoDB Atlas, Amazon DocumentDB (with MongoDB compatibility)

Using a DBMS

It may help to consider the relationship between a DBMS and your application, through layering:

Web browser

Web server

Database server (Database management system)

A database server is a separate program that specializes in handling data requests.

The web browser is a client that connects to the web server to retrieve an HTML page. The web server is, in turn, a client that connects to a database to retrieve data.

During development, all three of these servers can run on the same computer. i.e., Your web browser connects to localhost to access the web server; the web server connects to a different port on localhost to access the database. When deployed to production, these layers will typically run on separate computers. i.e., The browser runs on your end-users’ computer; the web servers may be running on many low-cost CPU-optimized servers; you will typically have a small number of more expensive database servers with excellent disk performance.

Your web server communicates to your database over a network connection. Database queries are encoded into a proprietary binary format and sent to the database server. The database server runs the query and then will encode the results and send them back to the web server. The web server will then format the results into HTML for the end-user.

1. In theory, you could solve all of the data storage problems yourself without using a database. The solution would involve careful programming and sophisticated storage strategies. However, the result of all that work is that you’d be writing a database management system from scratch.

2. Despite the name, some NoSQL database management systems support basic SQL queries. NoSQL refers to databases that don’t provide the full suite of SQL features: SQL, relational storage and ACID transactions (I discuss transactions in Chapter 13).