MBOX/PST archive ingest at scale: checkpoint recovery and progress determinism
Why archive ingest matters
Email archives are how people switch clients. A Gmail export produces an MBOX file. An Outlook export produces a PST file. These are the two most common archive formats, and both are designed for batch export, not streaming ingest.
TwinMail must ingest these archives into the local vault — parsing messages, extracting metadata, building the full-text search index, and encrypting everything — without losing progress if the process is interrupted.
A 10 GB Gmail export contains roughly 500,000 messages. On a mid-range laptop, full ingest takes 15-30 minutes. Users will close their laptop, run out of battery, or simply get impatient. The ingest pipeline must handle all of these gracefully.
The checkpoint model
TwinMail's archive ingest is built around checkpoint recovery. The pipeline maintains a checkpoint file alongside the archive being ingested. The checkpoint records:
- The byte offset into the archive file
- The number of messages successfully ingested
- The current state of the FTS (full-text search) index
- A hash of the last successfully processed message
If ingest is interrupted, resuming from the checkpoint skips all previously processed messages and continues from the exact byte where processing stopped.
Checkpoint {
archive_path: PathBuf,
archive_hash: Blake3Hash,
byte_offset: u64,
messages_ingested: u64,
fts_watermark: u64,
last_message_hash: Blake3Hash,
created_at: ISO8601,
updated_at: ISO8601,
}
The checkpoint is written to disk every 100 messages or every 10 seconds, whichever comes first. This bounds the maximum reprocessing work on resume to approximately 100 messages.
Deterministic progress tracking
Users need to know how long an import will take. This is harder than it sounds.
MBOX files are not indexed — you cannot determine the total message count without scanning the entire file. PST files have an internal B-tree structure that provides a message count, but it is occasionally inaccurate for damaged archives.
TwinMail solves this with a two-pass approach:
Pass 1 (fast scan): Read the archive sequentially, counting message boundaries without parsing message bodies. For MBOX, this means scanning for From line delimiters. For PST, this means walking the B-tree node headers. This pass runs at near disk-read speed (roughly 2 GB/s on SSD) and produces a total message count.
Pass 2 (ingest): Parse each message fully, extract metadata, build FTS entries, and encrypt into the vault. Progress is reported as messages_ingested / total_from_pass_1.
The fast scan adds 3-5 seconds for a 10 GB archive. The deterministic progress bar it enables is worth the cost.
Corruption tolerance
Real-world archives are frequently corrupt. Partial exports, encoding errors, truncated messages, and invalid MIME structures are common. TwinMail's parser is designed to recover from corruption rather than abort.
The corruption handling strategy depends on the failure type:
| Failure | Strategy |
|---|---|
| Invalid MIME encoding | Ingest as raw text, flag for user review |
| Truncated message | Skip, log warning, continue from next message boundary |
| Invalid date header | Use file modification time as fallback |
| Duplicate Message-ID | Deduplicate, keep the first occurrence |
| Unrecognized attachment encoding | Store raw bytes, skip decoding |
Every corruption event is logged to an ingest report that the user can review after import completes. The report shows exactly which messages were skipped or modified, with enough context to locate the original in the archive file.
FTS integration
Every ingested message is tokenized and added to the full-text search index as part of the ingest pipeline. The FTS index uses SQLite FTS5 with a custom tokenizer that handles email-specific patterns:
- Email addresses are indexed as both full addresses and local parts
- Subject line prefixes (Re:, Fwd:, etc.) are stripped before indexing
- Quoted reply content is indexed but with reduced weight
- Attachment filenames are indexed
The FTS index is built incrementally during ingest. The checkpoint records the FTS watermark (the highest indexed message ID) so that resume does not require reindexing previously processed messages.
Performance characteristics
Benchmarks on a 2024 M-series MacBook with SSD:
| Archive | Size | Messages | Fast scan | Full ingest | Resume overhead |
|---|---|---|---|---|---|
| Gmail MBOX | 5 GB | 250K | 2.1s | 8 min | < 3s |
| Gmail MBOX | 15 GB | 750K | 6.3s | 24 min | < 3s |
| Outlook PST | 3 GB | 180K | 1.4s | 6 min | < 3s |
| Outlook PST | 10 GB | 500K | 4.8s | 18 min | < 3s |
Resume overhead is the time to validate the checkpoint, verify the archive hash prefix, seek to the recorded byte offset, and begin processing. It is dominated by the FTS watermark verification.
Design principles
The archive ingest pipeline follows three principles:
- Never lose progress. Any interruption should be recoverable with minimal reprocessing.
- Never block on corruption. A corrupt message should not prevent the remaining 499,999 messages from being ingested.
- Always show accurate progress. The user should know how far along the import is and how long it will take, within reasonable bounds.
These principles optimize for trust. Users are entrusting TwinMail with their entire email history. The ingest pipeline is the first interaction they have with that trust relationship.