MBOX/PST archive ingest at scale: checkpoint recovery and progress determinism

March 22, 20263 min readDylan DahalEngineering

Why archive ingest matters

Email archives are how people switch clients. A Gmail export produces an MBOX file. An Outlook export produces a PST file. These are the two most common archive formats, and both are designed for batch export, not streaming ingest.

TwinMail must ingest these archives into the local vault — parsing messages, extracting metadata, building the full-text search index, and encrypting everything — without losing progress if the process is interrupted.

A 10 GB Gmail export contains roughly 500,000 messages. On a mid-range laptop, full ingest takes 15-30 minutes. Users will close their laptop, run out of battery, or simply get impatient. The ingest pipeline must handle all of these gracefully.

The checkpoint model

TwinMail's archive ingest is built around checkpoint recovery. The pipeline maintains a checkpoint file alongside the archive being ingested. The checkpoint records:

The byte offset into the archive file
The number of messages successfully ingested
The current state of the FTS (full-text search) index
A hash of the last successfully processed message

If ingest is interrupted, resuming from the checkpoint skips all previously processed messages and continues from the exact byte where processing stopped.

Checkpoint {
  archive_path: PathBuf,
  archive_hash: Blake3Hash,
  byte_offset: u64,
  messages_ingested: u64,
  fts_watermark: u64,
  last_message_hash: Blake3Hash,
  created_at: ISO8601,
  updated_at: ISO8601,
}

The checkpoint is written to disk every 100 messages or every 10 seconds, whichever comes first. This bounds the maximum reprocessing work on resume to approximately 100 messages.

Deterministic progress tracking

Users need to know how long an import will take. This is harder than it sounds.

MBOX files are not indexed — you cannot determine the total message count without scanning the entire file. PST files have an internal B-tree structure that provides a message count, but it is occasionally inaccurate for damaged archives.

TwinMail solves this with a two-pass approach:

Pass 1 (fast scan): Read the archive sequentially, counting message boundaries without parsing message bodies. For MBOX, this means scanning for From line delimiters. For PST, this means walking the B-tree node headers. This pass runs at near disk-read speed (roughly 2 GB/s on SSD) and produces a total message count.

Pass 2 (ingest): Parse each message fully, extract metadata, build FTS entries, and encrypt into the vault. Progress is reported as messages_ingested / total_from_pass_1.

The fast scan adds 3-5 seconds for a 10 GB archive. The deterministic progress bar it enables is worth the cost.

Corruption tolerance

Real-world archives are frequently corrupt. Partial exports, encoding errors, truncated messages, and invalid MIME structures are common. TwinMail's parser is designed to recover from corruption rather than abort.

The corruption handling strategy depends on the failure type:

Failure	Strategy
Invalid MIME encoding	Ingest as raw text, flag for user review
Truncated message	Skip, log warning, continue from next message boundary
Invalid date header	Use file modification time as fallback
Duplicate Message-ID	Deduplicate, keep the first occurrence
Unrecognized attachment encoding	Store raw bytes, skip decoding

Every corruption event is logged to an ingest report that the user can review after import completes. The report shows exactly which messages were skipped or modified, with enough context to locate the original in the archive file.

FTS integration

Every ingested message is tokenized and added to the full-text search index as part of the ingest pipeline. The FTS index uses SQLite FTS5 with a custom tokenizer that handles email-specific patterns:

Email addresses are indexed as both full addresses and local parts
Subject line prefixes (Re:, Fwd:, etc.) are stripped before indexing
Quoted reply content is indexed but with reduced weight
Attachment filenames are indexed

The FTS index is built incrementally during ingest. The checkpoint records the FTS watermark (the highest indexed message ID) so that resume does not require reindexing previously processed messages.

Performance characteristics

Benchmarks on a 2024 M-series MacBook with SSD:

Archive	Size	Messages	Fast scan	Full ingest	Resume overhead
Gmail MBOX	5 GB	250K	2.1s	8 min	< 3s
Gmail MBOX	15 GB	750K	6.3s	24 min	< 3s
Outlook PST	3 GB	180K	1.4s	6 min	< 3s
Outlook PST	10 GB	500K	4.8s	18 min	< 3s

Resume overhead is the time to validate the checkpoint, verify the archive hash prefix, seek to the recorded byte offset, and begin processing. It is dominated by the FTS watermark verification.

Design principles

The archive ingest pipeline follows three principles:

Never lose progress. Any interruption should be recoverable with minimal reprocessing.
Never block on corruption. A corrupt message should not prevent the remaining 499,999 messages from being ingested.
Always show accurate progress. The user should know how far along the import is and how long it will take, within reasonable bounds.

These principles optimize for trust. Users are entrusting TwinMail with their entire email history. The ingest pipeline is the first interaction they have with that trust relationship.

archive MBOX PST ingest checkpoint recovery