PDF Table Ingestion
Use this wizard to walk through each phase of the PDF Table Ingestion pipeline and verify the corresponding tables.
PDF Table Ingestion
This page explains how the PDF Table Ingestion pipeline works, which database tables should contain data at each stage, and how reviewers can verify correctness throughout the process. The ingestion pipeline is fully auditable and designed to make every transformation visible and reviewable.
1. Document Registration
When a PDF is uploaded, the system stores the file and creates a root DocumentRecord. All downstream extraction, staging, and promotion steps link back to this record.
- DocumentRecord — one row per uploaded PDF.
This record is the anchor for the entire ingestion run. Every page, row, cell, staging record, and promoted entity traces back to this ID.
2. Page Extraction (PDF → Page Images)
The system splits the PDF into individual pages and converts each page into an image. These images are used for table detection and extraction.
- PageExtractionRecord — one row per page, including page number and extraction status.
If page extraction fails for a page, no table rows or cells can be produced for that page.
3. Table Extraction (Page Images → Rows & Cells)
Each page image is analyzed for tables. Detected tables are broken into rows and cells. This is the raw, uncleaned representation of the PDF’s tabular content.
- ExtractedTableRow — one row per detected table row.
- ExtractedTableCell — one row per cell, including column index and extracted text.
Reviewers can inspect these rows and cells to confirm that the extraction engine correctly interpreted the PDF. This is the “ground truth” before any mapping or domain logic is applied.
4. Mapping & Staging (Rows → Domain JSON)
Each extracted row is transformed into a domain-specific JSON object using the DomainModelDefinition and MappingRules. The result is stored in the universal staging table.
- DomainModelDefinition — defines the target entity type and JSON schema.
- MappingRules — defines how extracted columns map to domain properties.
- UniversalStagingRow — one row per mapped extracted row, including JSON data and validation errors.
Staging is the primary reviewer checkpoint. Reviewers can see what the system thinks each row represents, whether required fields are present, and whether the row is ready for promotion.
5. Promotion (Staging → Domain Entities)
Validated staging rows are promoted into real domain entities (e.g., ResidentFee). Promotion is metadata-driven and uses the JSON stored in staging.
- Domain Entities (e.g., ResidentFee, ResidentFeeVersion) — the final civic data.
- PromotionLog — links staging rows to promoted entities and records timestamps and errors.
Promotion is the final step where data becomes official. Promotion logs ensure full traceability and auditability across the entire ingestion lifecycle.
End-to-End Table Flow
| Phase | Tables | What You Should See |
|---|---|---|
| Upload | DocumentRecord | One row per PDF |
| Page Extraction | PageExtractionRecord | One row per page |
| Table Extraction | ExtractedTableRow, ExtractedTableCell | Raw rows and cells |
| Mapping | DomainModelDefinition, MappingRules, UniversalStagingRow | Domain JSON + validation |
| Promotion | Domain Entities, PromotionLog | Final civic data + audit trail |
Reviewer Checklist
- Confirm the PDF is registered in DocumentRecord.
- Verify all pages extracted successfully in PageExtractionRecord.
- Inspect raw rows and cells in ExtractedTableRow and ExtractedTableCell.
- Review mapped JSON and validation results in UniversalStagingRow.
- Confirm promoted entities and audit trail in PromotionLog and domain tables.