Skip to main content

Command Palette

Search for a command to run...

Write-Ahead Logging (WAL) in PostgreSQL: How It Works

Updated
5 min read
K

Software Engineer - Backend

PostgreSQL guarantees durability using Write-Ahead Logging (WAL). WAL ensures that once a transaction is committed, its changes survive crashes, even if the actual table or index pages were never written to disk at the time of failure.

This article goes step by step into how WAL works, why heap writes are random and inefficient, what checkpoints are, what the background writer does, and why PostgreSQL requires both checkpoints and periodic writes.


Why WAL is Needed

Naïve design: write table and index changes directly to disk at commit.
Problems with this:

  • Random I/O: Table (heap) updates are scattered across pages. Writing them all at commit time means slow, random disk writes. HDDs suffer worst (seek latency), but even SSDs pay a cost due to write amplification (small random writes force entire erase-block rewrites).

  • Inconsistency risk: If the database crashes midway, table pages may be left partially written, corrupting the table.

WAL solves this by logging compact, sequential records of changes instead of writing full data pages at commit time.

  • WAL records are appended to a sequential log.

  • Sequential disk writes are much faster and more reliable.

  • Actual data pages can be written lazily in the background.


How WAL Works (Memory and Disk)

  1. Shared Buffers (RAM):
    Table and index pages live in shared memory. When you INSERT/UPDATE/DELETE, only shared buffers are updated.

  2. WAL Buffers (RAM):
    A WAL record describing the change ("insert tuple at page X, offset Y") is generated in memory.

  3. WAL Segment Files (disk):
    On commit, PostgreSQL flushes WAL buffers to disk (in pg_wal/). The transaction is considered committed only after WAL is safely persisted to disk.
    The actual heap and index files may still have dirty pages in RAM.

  4. Data Files (heap/index, disk):
    Dirty data pages are written later, in bulk, by the background writer and at checkpoints.


WAL Structure

  • WAL is stored in pg_wal/ as segment files (default 16 MB each).

  • Each segment is divided into WAL pages (8 KB, same as heap page size).

  • WAL is append-only: written sequentially until a segment is full, then the next is created (or recycled).


Why Heap Writes Are Random

Heap (table) writes cannot be sequential:

  • Inserts reuse free space from any page (tracked by the Free Space Map).

  • Updates insert a new row version, often into a different page.

  • Deletes only mark tuples dead wherever they exist.

This means dirty pages are scattered across the table file. Writing them to disk = random writes. Random writes are inherently slower than sequential writes.

  • On HDDs → each random write triggers a disk seek (slow).

  • On SSDs → random writes cause write amplification (flash blocks must be erased and rewritten even for small updates).

👉 WAL avoids this problem because WAL writes are always sequential appends.


Recovery With WAL

If PostgreSQL crashes:

  1. On restart, it finds the last checkpoint (more below).

  2. Replays WAL records written after that checkpoint.

  3. Applies missing changes to heap/index pages on disk.

  4. Ignores uncommitted transactions (their WAL is discarded).

Because WAL is always flushed before commit returns, recovery can always replay committed changes safely.


What fsync Means

WAL durability relies on fsync().

  • fsync() is a system call that blocks until all buffered writes are physically stored on stable media.

  • It does not return after queuing writes — it only returns once the OS and disk confirm persistence.

  • PostgreSQL uses this to guarantee that committed transactions survive crashes.


Checkpoints and the Background Writer

PostgreSQL has two processes involved in writing dirty buffers to disk: the background writer and checkpoints.

Background Writer

  • Runs every bgwriter_delay (default: 200 ms).

  • Periodically issues write() system calls to copy dirty pages from PostgreSQL’s shared buffers into the OS page cache.

  • Does not call fsync(). This means the data may still sit in the OS cache and not reach durable storage until the OS decides to flush it.

  • Why no fsync? Because calling fsync on every small batch would defeat the purpose — it would force frequent, expensive durability guarantees. The background writer’s job is purely to smooth I/O and reduce checkpoint spikes, not to provide durability guarantees.

  • Actual durability is ensured by checkpoints.


Checkpoints

  • A checkpoint is when PostgreSQL ensures that all dirty pages in shared buffers are written to disk.
    It then records a checkpoint marker in WAL.

    • This does not make transactions durable — durability comes from WAL flush at commit.

    • Instead, checkpoints guarantee that WAL replay during crash recovery can start from the last checkpoint, not from the beginning of the WAL.

    • This bounds crash recovery time: the more frequent the checkpoints, the less WAL needs to be replayed after a crash.

Checkpoints can be triggered automatically (based on time or WAL volume) or manually. They work together with the background writer, which writes dirty pages gradually so the checkpoint has less work to do.


Why Both Are Needed

  • Background writer alone: spreads writes, reduces checkpoint spikes, but doesn’t give recovery a safe starting point. Without checkpoints, WAL would grow forever and recovery would require replaying the entire WAL history.

  • Checkpoint alone: would force writing all dirty pages in one burst, causing massive I/O spikes.

  • Together: background writer smooths I/O; checkpoints ensure durability and bound recovery time.


WAL Lifecycle After Checkpoint

  • WAL before the last checkpoint is no longer needed for crash recovery.

  • PostgreSQL can then:

    • Recycle old WAL segments (rename and reuse them).

    • Delete them if not needed.

  • However, if archiving (archive_mode=on) or replication requires those WAL files, PostgreSQL will keep them until they are safely copied.


Summary

  • Heap writes = random I/O → slow and unsafe at commit.

  • WAL = sequential log, fsync’d at commit → fast and durable.

  • Data files (heap/index) are written later by background writer (smooths I/O only) and checkpoints (flush + fsync = durability).

  • Checkpoints provide the actual durability guarantee and mark recovery-safe positions in WAL, bounding WAL growth.

  • Background writer smooths I/O between checkpoints.

  • WAL segments are recycled or deleted after checkpoints, unless replication or archiving needs them.

👉 Together, WAL + background writer + checkpoints make PostgreSQL both performant and crash-safe.

More from this blog

Backend Software Engineering with Krishna Kumar Mahto

21 posts

Backend Software Engineer, focused on Java-based backend applications, PostgreSQL in databases, Kafka/ActiveMQ/RabbitMQ in messaging.