How Zera Books cleans, standardizes, and validates extracted financial data before export — description normalization, date standardization, duplicate detection, and arithmetic validation methodology.
After OCR or text extraction, raw data passes through a 6-stage cleaning pipeline before being written to the output file. Each stage runs in order and flags issues without discarding data.
Raw values from OCR or text extraction are captured exactly as found. No modification at this stage — every value is preserved as the original source record.
Bank transaction descriptions are cleaned: POS codes, card suffixes, geographic noise, timestamps, and transaction type prefixes are removed. The cleaned description is written to the Description column; raw text goes to Original Description.
All date formats found in the document are normalized to the user's configured output format. Ambiguous dates are resolved using statement context. Invalid dates trigger a flag for manual review.
All amounts are parsed to numeric values. Thousand separators, currency symbols, and parenthetical negatives are handled. The sum of debits, credits, and running balance is verified against the statement's reported totals.
All transactions are compared for duplicates using date proximity (±3 days), amount (exact match), and normalized description (80%+ similarity). Likely duplicates are flagged — not removed — for user review.
Clean values are assembled into the output schema. Any flagged fields (low OCR confidence, potential duplicates, arithmetic mismatches, ambiguous dates) are marked with a flag column or highlighted cell in the output.
Raw bank transaction descriptions are notoriously messy. Zera Books applies a 4-pass normalization to produce human-readable merchant names while preserving the original string.
| Noise Type | Example Raw | Normalized Output |
|---|---|---|
| POS code prefix | POS PURCHASE 04241 AMAZON.COM WA US | Amazon.com |
| Card number suffix | VISA PURCHASE STAPLES #1234 XXXX-4521 | Staples |
| Geographic noise | GOOGLE *ADS MOUNTAIN VIEW CA 94043 | Google Ads |
| Date stamp embedded | ACH DEBIT QUICKBOOKS 20250103 | QuickBooks |
| Transaction type prefix | WIRE TRANSFER OUT: SUPPLIER INC ACCT:9821 | Supplier Inc |
| All-caps legacy format | MICROSOFT CORPORATION REDMOND WA | Microsoft Corporation |
Zera Books normalizes, deduplicates, and validates every extracted transaction before it reaches your accounting software.
Try for one week →Duplicate transactions in financial data cause double-counting errors in accounting software. Zera Books flags potential duplicates before export using a three-signal comparison.
Transactions within ±3 calendar days of each other are checked as potential duplicates. The 3-day window accounts for bank processing delays, statement cutoff artifacts, and pending/posted date mismatches.
Amounts must match exactly (to 2 decimal places) to trigger a duplicate flag. A $124.99 and a $125.00 transaction will not be flagged as duplicates even if descriptions match.
Normalized descriptions are compared using fuzzy matching. 80%+ similarity triggers a flag. "Amazon.com" and "Amazon Marketplace" would be flagged; "Amazon.com" and "Adobe Systems" would not.
When multiple statements are uploaded in the same batch, duplicate detection runs across documents as well. Overlapping statement periods are a common source of duplicate transactions in accounting workflows.
Potential duplicates are flagged in a "Duplicate?" column with Y/N values. They are never silently removed. Users review flagged rows before export to prevent accidental deletion of legitimate transactions.
A duplicate review panel shows all flagged pairs side-by-side with their similarity scores. Users confirm or dismiss each pair with one click before finalizing the export.
| Validation Check | Method | On Failure |
|---|---|---|
| Running balance | Recalculate from opening balance + each transaction | Flag row where mismatch begins; report discrepancy amount |
| Total debits/credits | Sum extracted debits and credits; compare to statement totals | Flag output with "Totals mismatch: $X" warning |
| Invoice line item sum | Sum line totals; compare to extracted subtotal and total | Flag invoice with arithmetic discrepancy |
| Date sequence | Verify transaction dates are in non-decreasing order | Flag out-of-order dates for review (common with pending transactions) |
| Opening/closing match | Calculated closing balance must equal extracted closing balance | Export blocked; user must review before download |
Duplicate detection compares date, amount, and normalized description across all transactions in a document. Transactions within 3 days of each other with identical amounts and similar descriptions (80%+ similarity) are flagged as potential duplicates. Users review and confirm before export.
All date formats extracted from documents are normalized to a single target format chosen by the user (MM/DD/YYYY, DD/MM/YYYY, or ISO 8601). Ambiguous dates are resolved using the statement's regional context and other date patterns in the document.
Zera Books removes POS codes, card number suffixes, geographic noise, and transaction type prefixes from raw bank descriptions. "POS PURCHASE 04241 AMAZON.COM WA US" becomes "Amazon.com". The original description is always preserved in a separate column.
When a statement contains transactions in multiple currencies, Zera Books extracts the original currency and amount for each transaction. A Currency column is added to the output. Conversion to a base currency is not performed automatically — that step is left to the accounting software.
Explore related Zera Books features and documentation.
Zera Books cleans, normalizes, deduplicates, and validates every extraction before it reaches your accounting software — no manual cleanup required.
Try for one week →