Cleaning & Normalization Duplicate Detection Arithmetic Validation

Data Extraction Pipeline Reference

How Zera Books cleans, standardizes, and validates extracted financial data before export — description normalization, date standardization, duplicate detection, and arithmetic validation methodology.

★★★★★ 4.9 Trustpilot 99.6% extraction accuracy 847M+ transactions processed Zero balance discrepancies exported
Try for one week → See feature page ↗

⚡ TL;DR — What the Extraction Pipeline Does

Zera Books Pipeline
  • Normalizes all description formats
  • Standardizes date formats (3 options)
  • Detects duplicate transactions
  • Validates arithmetic (balances/totals)
  • Handles multi-currency extraction
  • Preserves original raw values
Raw Converter Output
  • Raw bank POS codes in descriptions
  • Inconsistent date formats mixed in same file
  • No duplicate detection
  • No balance verification
  • No multi-currency handling
1

Extraction Pipeline Overview

After OCR or text extraction, raw data passes through a 6-stage cleaning pipeline before being written to the output file. Each stage runs in order and flags issues without discarding data.

1

Raw Field Capture

Raw values from OCR or text extraction are captured exactly as found. No modification at this stage — every value is preserved as the original source record.

2

Description Normalization

Bank transaction descriptions are cleaned: POS codes, card suffixes, geographic noise, timestamps, and transaction type prefixes are removed. The cleaned description is written to the Description column; raw text goes to Original Description.

3

Date Standardization

All date formats found in the document are normalized to the user's configured output format. Ambiguous dates are resolved using statement context. Invalid dates trigger a flag for manual review.

4

Amount Validation

All amounts are parsed to numeric values. Thousand separators, currency symbols, and parenthetical negatives are handled. The sum of debits, credits, and running balance is verified against the statement's reported totals.

5

Duplicate Detection

All transactions are compared for duplicates using date proximity (±3 days), amount (exact match), and normalized description (80%+ similarity). Likely duplicates are flagged — not removed — for user review.

6

Output Assembly & Flagging

Clean values are assembled into the output schema. Any flagged fields (low OCR confidence, potential duplicates, arithmetic mismatches, ambiguous dates) are marked with a flag column or highlighted cell in the output.


2

Description Normalization

Raw bank transaction descriptions are notoriously messy. Zera Books applies a 4-pass normalization to produce human-readable merchant names while preserving the original string.

Noise TypeExample RawNormalized Output
POS code prefixPOS PURCHASE 04241 AMAZON.COM WA USAmazon.com
Card number suffixVISA PURCHASE STAPLES #1234 XXXX-4521Staples
Geographic noiseGOOGLE *ADS MOUNTAIN VIEW CA 94043Google Ads
Date stamp embeddedACH DEBIT QUICKBOOKS 20250103QuickBooks
Transaction type prefixWIRE TRANSFER OUT: SUPPLIER INC ACCT:9821Supplier Inc
All-caps legacy formatMICROSOFT CORPORATION REDMOND WAMicrosoft Corporation

Before / After Example

Before (Raw)
POS PURCHASE 04241
AMZN MKTP US WA
VISA DEBIT XXXX9821
01/15 12:34:22
After (Normalized)
Amazon Marketplace


(date in Date column)

Clean Data From Day One

Zera Books normalizes, deduplicates, and validates every extracted transaction before it reaches your accounting software.

Try for one week →
3

Duplicate Detection Methodology

Duplicate transactions in financial data cause double-counting errors in accounting software. Zera Books flags potential duplicates before export using a three-signal comparison.

Date Proximity Window

Transactions within ±3 calendar days of each other are checked as potential duplicates. The 3-day window accounts for bank processing delays, statement cutoff artifacts, and pending/posted date mismatches.

Amount Exact Match

Amounts must match exactly (to 2 decimal places) to trigger a duplicate flag. A $124.99 and a $125.00 transaction will not be flagged as duplicates even if descriptions match.

Description Similarity (80%+ threshold)

Normalized descriptions are compared using fuzzy matching. 80%+ similarity triggers a flag. "Amazon.com" and "Amazon Marketplace" would be flagged; "Amazon.com" and "Adobe Systems" would not.

Cross-Document Detection

When multiple statements are uploaded in the same batch, duplicate detection runs across documents as well. Overlapping statement periods are a common source of duplicate transactions in accounting workflows.

Flag-Not-Remove Policy

Potential duplicates are flagged in a "Duplicate?" column with Y/N values. They are never silently removed. Users review flagged rows before export to prevent accidental deletion of legitimate transactions.

Bulk Duplicate Review

A duplicate review panel shows all flagged pairs side-by-side with their similarity scores. Users confirm or dismiss each pair with one click before finalizing the export.


4

Arithmetic Validation & Data Quality

99.6%Extraction accuracy rate
0Silent arithmetic mismatches exported
6Pipeline stages before output
3-dayDuplicate detection proximity window
Validation CheckMethodOn Failure
Running balanceRecalculate from opening balance + each transactionFlag row where mismatch begins; report discrepancy amount
Total debits/creditsSum extracted debits and credits; compare to statement totalsFlag output with "Totals mismatch: $X" warning
Invoice line item sumSum line totals; compare to extracted subtotal and totalFlag invoice with arithmetic discrepancy
Date sequenceVerify transaction dates are in non-decreasing orderFlag out-of-order dates for review (common with pending transactions)
Opening/closing matchCalculated closing balance must equal extracted closing balanceExport blocked; user must review before download

?

Frequently Asked Questions

How does Zera Books detect duplicate transactions?

Duplicate detection compares date, amount, and normalized description across all transactions in a document. Transactions within 3 days of each other with identical amounts and similar descriptions (80%+ similarity) are flagged as potential duplicates. Users review and confirm before export.

How does Zera Books standardize transaction dates?

All date formats extracted from documents are normalized to a single target format chosen by the user (MM/DD/YYYY, DD/MM/YYYY, or ISO 8601). Ambiguous dates are resolved using the statement's regional context and other date patterns in the document.

What description cleaning does Zera Books apply?

Zera Books removes POS codes, card number suffixes, geographic noise, and transaction type prefixes from raw bank descriptions. "POS PURCHASE 04241 AMAZON.COM WA US" becomes "Amazon.com". The original description is always preserved in a separate column.

How does multi-currency extraction work?

When a statement contains transactions in multiple currencies, Zera Books extracts the original currency and amount for each transaction. A Currency column is added to the output. Conversion to a base currency is not performed automatically — that step is left to the accounting software.

Related Resources

Explore related Zera Books features and documentation.

Get Clean, Validated Data Every Time

Zera Books cleans, normalizes, deduplicates, and validates every extraction before it reaches your accounting software — no manual cleanup required.

Try for one week →