Data Extraction Pipeline Reference

⚡ TL;DR — What the Extraction Pipeline Does

Zera Books Pipeline

✓ Normalizes all description formats
✓ Standardizes date formats (3 options)
✓ Detects duplicate transactions
✓ Validates arithmetic (balances/totals)
✓ Handles multi-currency extraction
✓ Preserves original raw values

Raw Converter Output

✗ Raw bank POS codes in descriptions
✗ Inconsistent date formats mixed in same file
✗ No duplicate detection
✗ No balance verification
✗ No multi-currency handling

Extraction Pipeline Overview

After OCR or text extraction, raw data passes through a 6-stage cleaning pipeline before being written to the output file. Each stage runs in order and flags issues without discarding data.

Raw Field Capture

Raw values from OCR or text extraction are captured exactly as found. No modification at this stage — every value is preserved as the original source record.

Description Normalization

Bank transaction descriptions are cleaned: POS codes, card suffixes, geographic noise, timestamps, and transaction type prefixes are removed. The cleaned description is written to the Description column; raw text goes to Original Description.

Date Standardization

All date formats found in the document are normalized to the user's configured output format. Ambiguous dates are resolved using statement context. Invalid dates trigger a flag for manual review.

Amount Validation

All amounts are parsed to numeric values. Thousand separators, currency symbols, and parenthetical negatives are handled. The sum of debits, credits, and running balance is verified against the statement's reported totals.

Duplicate Detection

All transactions are compared for duplicates using date proximity (±3 days), amount (exact match), and normalized description (80%+ similarity). Likely duplicates are flagged — not removed — for user review.

Output Assembly & Flagging

Clean values are assembled into the output schema. Any flagged fields (low OCR confidence, potential duplicates, arithmetic mismatches, ambiguous dates) are marked with a flag column or highlighted cell in the output.

Description Normalization

Raw bank transaction descriptions are notoriously messy. Zera Books applies a 4-pass normalization to produce human-readable merchant names while preserving the original string.

Noise Type	Example Raw	Normalized Output
POS code prefix	POS PURCHASE 04241 AMAZON.COM WA US	Amazon.com
Card number suffix	VISA PURCHASE STAPLES #1234 XXXX-4521	Staples
Geographic noise	GOOGLE *ADS MOUNTAIN VIEW CA 94043	Google Ads
Date stamp embedded	ACH DEBIT QUICKBOOKS 20250103	QuickBooks
Transaction type prefix	WIRE TRANSFER OUT: SUPPLIER INC ACCT:9821	Supplier Inc
All-caps legacy format	MICROSOFT CORPORATION REDMOND WA	Microsoft Corporation

Before / After Example

Before (Raw)

POS PURCHASE 04241
AMZN MKTP US WA
VISA DEBIT XXXX9821
01/15 12:34:22

After (Normalized)

Amazon Marketplace

(date in Date column)

Clean Data From Day One

Zera Books normalizes, deduplicates, and validates every extracted transaction before it reaches your accounting software.

Try for one week →

Duplicate Detection Methodology

Duplicate transactions in financial data cause double-counting errors in accounting software. Zera Books flags potential duplicates before export using a three-signal comparison.

Date Proximity Window

Transactions within ±3 calendar days of each other are checked as potential duplicates. The 3-day window accounts for bank processing delays, statement cutoff artifacts, and pending/posted date mismatches.

Amount Exact Match

Amounts must match exactly (to 2 decimal places) to trigger a duplicate flag. A $124.99 and a $125.00 transaction will not be flagged as duplicates even if descriptions match.

Description Similarity (80%+ threshold)

Normalized descriptions are compared using fuzzy matching. 80%+ similarity triggers a flag. "Amazon.com" and "Amazon Marketplace" would be flagged; "Amazon.com" and "Adobe Systems" would not.

Cross-Document Detection

When multiple statements are uploaded in the same batch, duplicate detection runs across documents as well. Overlapping statement periods are a common source of duplicate transactions in accounting workflows.

Flag-Not-Remove Policy

Potential duplicates are flagged in a "Duplicate?" column with Y/N values. They are never silently removed. Users review flagged rows before export to prevent accidental deletion of legitimate transactions.

Bulk Duplicate Review

A duplicate review panel shows all flagged pairs side-by-side with their similarity scores. Users confirm or dismiss each pair with one click before finalizing the export.

Arithmetic Validation & Data Quality

99.6%Extraction accuracy rate

0Silent arithmetic mismatches exported

6Pipeline stages before output

3-dayDuplicate detection proximity window

Validation Check	Method	On Failure
Running balance	Recalculate from opening balance + each transaction	Flag row where mismatch begins; report discrepancy amount
Total debits/credits	Sum extracted debits and credits; compare to statement totals	Flag output with "Totals mismatch: $X" warning
Invoice line item sum	Sum line totals; compare to extracted subtotal and total	Flag invoice with arithmetic discrepancy
Date sequence	Verify transaction dates are in non-decreasing order	Flag out-of-order dates for review (common with pending transactions)
Opening/closing match	Calculated closing balance must equal extracted closing balance	Export blocked; user must review before download

Frequently Asked Questions

How does Zera Books detect duplicate transactions?

Duplicate detection compares date, amount, and normalized description across all transactions in a document. Transactions within 3 days of each other with identical amounts and similar descriptions (80%+ similarity) are flagged as potential duplicates. Users review and confirm before export.

How does Zera Books standardize transaction dates?

All date formats extracted from documents are normalized to a single target format chosen by the user (MM/DD/YYYY, DD/MM/YYYY, or ISO 8601). Ambiguous dates are resolved using the statement's regional context and other date patterns in the document.

What description cleaning does Zera Books apply?

Zera Books removes POS codes, card number suffixes, geographic noise, and transaction type prefixes from raw bank descriptions. "POS PURCHASE 04241 AMAZON.COM WA US" becomes "Amazon.com". The original description is always preserved in a separate column.

How does multi-currency extraction work?

When a statement contains transactions in multiple currencies, Zera Books extracts the original currency and amount for each transaction. A Currency column is added to the output. Conversion to a base currency is not performed automatically — that step is left to the accounting software.

⚡ TL;DR — What the Extraction Pipeline Does

Extraction Pipeline Overview

Raw Field Capture

Description Normalization

Date Standardization

Amount Validation

Duplicate Detection

Output Assembly & Flagging

Description Normalization

Before / After Example

Clean Data From Day One

Duplicate Detection Methodology

Date Proximity Window

Amount Exact Match

Description Similarity (80%+ threshold)

Cross-Document Detection

Flag-Not-Remove Policy

Bulk Duplicate Review

Arithmetic Validation & Data Quality

Frequently Asked Questions

How does Zera Books detect duplicate transactions?

How does Zera Books standardize transaction dates?

What description cleaning does Zera Books apply?

How does multi-currency extraction work?

Table of Contents

Related Resources

Get Clean, Validated Data Every Time

⚡ TL;DR — What the Extraction Pipeline Does

Extraction Pipeline Overview

Raw Field Capture

Description Normalization

Date Standardization

Amount Validation

Duplicate Detection

Output Assembly & Flagging

Description Normalization

Before / After Example

Clean Data From Day One

Duplicate Detection Methodology

Date Proximity Window

Amount Exact Match

Description Similarity (80%+ threshold)

Cross-Document Detection

Flag-Not-Remove Policy

Bulk Duplicate Review

Arithmetic Validation & Data Quality

Frequently Asked Questions

How does Zera Books detect duplicate transactions?

How does Zera Books standardize transaction dates?

What description cleaning does Zera Books apply?

How does multi-currency extraction work?

Table of Contents

Related Resources

Zera AI Reference

Zera OCR Reference

PDF to Excel Reference

Zera Books Blog

Get Clean, Validated Data Every Time