PDF Table Ingestion

Use this wizard to walk through each phase of the PDF Table Ingestion pipeline and verify the corresponding tables.

PDF Table Ingestion

This page explains how the PDF Table Ingestion pipeline works, which database tables should contain data at each stage, and how reviewers can verify correctness throughout the process. The ingestion pipeline is fully auditable and designed to make every transformation visible and reviewable.


1. Document Registration

When a PDF is uploaded, the system stores the file and creates a root DocumentRecord. All downstream extraction, staging, and promotion steps link back to this record.

Tables that should contain data
  • DocumentRecord — one row per uploaded PDF.

This record is the anchor for the entire ingestion run. Every page, row, cell, staging record, and promoted entity traces back to this ID.


2. Page Extraction (PDF → Page Images)

The system splits the PDF into individual pages and converts each page into an image. These images are used for table detection and extraction.

Tables that should contain data
  • PageExtractionRecord — one row per page, including page number and extraction status.

If page extraction fails for a page, no table rows or cells can be produced for that page.


3. Table Extraction (Page Images → Rows & Cells)

Each page image is analyzed for tables. Detected tables are broken into rows and cells. This is the raw, uncleaned representation of the PDF’s tabular content.

Tables that should contain data
  • ExtractedTableRow — one row per detected table row.
  • ExtractedTableCell — one row per cell, including column index and extracted text.

Reviewers can inspect these rows and cells to confirm that the extraction engine correctly interpreted the PDF. This is the “ground truth” before any mapping or domain logic is applied.


4. Mapping & Staging (Rows → Domain JSON)

Each extracted row is transformed into a domain-specific JSON object using the DomainModelDefinition and MappingRules. The result is stored in the universal staging table.

Tables that should contain data
  • DomainModelDefinition — defines the target entity type and JSON schema.
  • MappingRules — defines how extracted columns map to domain properties.
  • UniversalStagingRow — one row per mapped extracted row, including JSON data and validation errors.

Staging is the primary reviewer checkpoint. Reviewers can see what the system thinks each row represents, whether required fields are present, and whether the row is ready for promotion.


5. Promotion (Staging → Domain Entities)

Validated staging rows are promoted into real domain entities (e.g., ResidentFee). Promotion is metadata-driven and uses the JSON stored in staging.

Tables that should contain data
  • Domain Entities (e.g., ResidentFee, ResidentFeeVersion) — the final civic data.
  • PromotionLog — links staging rows to promoted entities and records timestamps and errors.

Promotion is the final step where data becomes official. Promotion logs ensure full traceability and auditability across the entire ingestion lifecycle.


End-to-End Table Flow

Phase Tables What You Should See
Upload DocumentRecord One row per PDF
Page Extraction PageExtractionRecord One row per page
Table Extraction ExtractedTableRow, ExtractedTableCell Raw rows and cells
Mapping DomainModelDefinition, MappingRules, UniversalStagingRow Domain JSON + validation
Promotion Domain Entities, PromotionLog Final civic data + audit trail

Reviewer Checklist

  • Confirm the PDF is registered in DocumentRecord.
  • Verify all pages extracted successfully in PageExtractionRecord.
  • Inspect raw rows and cells in ExtractedTableRow and ExtractedTableCell.
  • Review mapped JSON and validation results in UniversalStagingRow.
  • Confirm promoted entities and audit trail in PromotionLog and domain tables.