PostgreSQL WAL: Write-Ahead Logging, Checkpoints, Background Writer

PostgreSQL guarantees durability using Write-Ahead Logging (WAL). WAL ensures that once a transaction is committed, its changes survive crashes, even if the actual table or index pages were never written to disk at the time of failure.

This article goes step by step into how WAL works, why heap writes are random and inefficient, what checkpoints are, what the background writer does, and why PostgreSQL requires both checkpoints and periodic writes.

Why WAL is Needed

Naïve design: write table and index changes directly to disk at commit.
Problems with this:

Random I/O: Table (heap) updates are scattered across pages. Writing them all at commit time means slow, random disk writes. HDDs suffer worst (seek latency), but even SSDs pay a cost due to write amplification (small random writes force entire erase-block rewrites).
Inconsistency risk: If the database crashes midway, table pages may be left partially written, corrupting the table.

WAL solves this by logging compact, sequential records of changes instead of writing full data pages at commit time.

WAL records are appended to a sequential log.
Sequential disk writes are much faster and more reliable.
Actual data pages can be written lazily in the background.

How WAL Works (Memory and Disk)

Shared Buffers (RAM):
Table and index pages live in shared memory. When you INSERT/UPDATE/DELETE, only shared buffers are updated.
WAL Buffers (RAM):
A WAL record describing the change ("insert tuple at page X, offset Y") is generated in memory.
WAL Segment Files (disk):
On commit, PostgreSQL flushes WAL buffers to disk (in pg_wal/). The transaction is considered committed only after WAL is safely persisted to disk.
The actual heap and index files may still have dirty pages in RAM.
Data Files (heap/index, disk):
Dirty data pages are written later, in bulk, by the background writer and at checkpoints.

WAL Structure

WAL is stored in pg_wal/ as segment files (default 16 MB each).
Each segment is divided into WAL pages (8 KB, same as heap page size).
WAL is append-only: written sequentially until a segment is full, then the next is created (or recycled).

Why Heap Writes Are Random

Heap (table) writes cannot be sequential:

Inserts reuse free space from any page (tracked by the Free Space Map).
Updates insert a new row version, often into a different page.
Deletes only mark tuples dead wherever they exist.

This means dirty pages are scattered across the table file. Writing them to disk = random writes. Random writes are inherently slower than sequential writes.

On HDDs → each random write triggers a disk seek (slow).
On SSDs → random writes cause write amplification (flash blocks must be erased and rewritten even for small updates).

👉 WAL avoids this problem because WAL writes are always sequential appends.

Recovery With WAL

If PostgreSQL crashes:

On restart, it finds the last checkpoint (more below).
Replays WAL records written after that checkpoint.
Applies missing changes to heap/index pages on disk.
Ignores uncommitted transactions (their WAL is discarded).

Because WAL is always flushed before commit returns, recovery can always replay committed changes safely.

What fsync Means

WAL durability relies on fsync().

fsync() is a system call that blocks until all buffered writes are physically stored on stable media.
It does not return after queuing writes — it only returns once the OS and disk confirm persistence.
PostgreSQL uses this to guarantee that committed transactions survive crashes.

Checkpoints and the Background Writer

PostgreSQL has two processes involved in writing dirty buffers to disk: the background writer and checkpoints.

Background Writer

Runs every bgwriter_delay (default: 200 ms).
Periodically issues write() system calls to copy dirty pages from PostgreSQL’s shared buffers into the OS page cache.
Does not call fsync(). This means the data may still sit in the OS cache and not reach durable storage until the OS decides to flush it.
Why no fsync? Because calling fsync on every small batch would defeat the purpose — it would force frequent, expensive durability guarantees. The background writer’s job is purely to smooth I/O and reduce checkpoint spikes, not to provide durability guarantees.
Actual durability is ensured by checkpoints.

Checkpoints

A checkpoint is when PostgreSQL ensures that all dirty pages in shared buffers are written to disk.
It then records a checkpoint marker in WAL.
- This does not make transactions durable — durability comes from WAL flush at commit.
- Instead, checkpoints guarantee that WAL replay during crash recovery can start from the last checkpoint, not from the beginning of the WAL.
- This bounds crash recovery time: the more frequent the checkpoints, the less WAL needs to be replayed after a crash.

Checkpoints can be triggered automatically (based on time or WAL volume) or manually. They work together with the background writer, which writes dirty pages gradually so the checkpoint has less work to do.

Why Both Are Needed

Background writer alone: spreads writes, reduces checkpoint spikes, but doesn’t give recovery a safe starting point. Without checkpoints, WAL would grow forever and recovery would require replaying the entire WAL history.
Checkpoint alone: would force writing all dirty pages in one burst, causing massive I/O spikes.
Together: background writer smooths I/O; checkpoints ensure durability and bound recovery time.

WAL Lifecycle After Checkpoint

WAL before the last checkpoint is no longer needed for crash recovery.
PostgreSQL can then:
- Recycle old WAL segments (rename and reuse them).
- Delete them if not needed.
However, if archiving (archive_mode=on) or replication requires those WAL files, PostgreSQL will keep them until they are safely copied.

Summary

Heap writes = random I/O → slow and unsafe at commit.
WAL = sequential log, fsync’d at commit → fast and durable.
Data files (heap/index) are written later by background writer (smooths I/O only) and checkpoints (flush + fsync = durability).
Checkpoints provide the actual durability guarantee and mark recovery-safe positions in WAL, bounding WAL growth.
Background writer smooths I/O between checkpoints.
WAL segments are recycled or deleted after checkpoints, unless replication or archiving needs them.

👉 Together, WAL + background writer + checkpoints make PostgreSQL both performant and crash-safe.

Write-Ahead Logging (WAL) in PostgreSQL: How It Works

Why WAL is Needed

How WAL Works (Memory and Disk)

WAL Structure

Why Heap Writes Are Random

Recovery With WAL

What fsync Means

Checkpoints and the Background Writer

Background Writer

Checkpoints

Why Both Are Needed

WAL Lifecycle After Checkpoint

Summary

Comments

Database Management Systems

Primary Key/Unique ID Column Data Type Considerations

More from this blog

If AI Writes Code, Where Do Guardrails Live - a Conjecture

Understanding Garbage Collection (GC) and Stop-The-World (STW) in the JVM

How PostgreSQL VACUUM Works: Dead Tuples, FSM/VM, and Locking

How PostgreSQL Decides Row Visibility with MVCC

Command Palette

Why WAL is Needed

How WAL Works (Memory and Disk)

WAL Structure

Why Heap Writes Are Random

Recovery With WAL

What fsync Means

Checkpoints and the Background Writer

Background Writer

Checkpoints

Why Both Are Needed

WAL Lifecycle After Checkpoint

Summary

Comments

Database Management Systems

Primary Key/Unique ID Column Data Type Considerations

More from this blog