directory-pipeline

Plan: automatic section detection for multi-section directories

Status: Planned — Phase 0 not yet started; best implemented in a local session.

What this is: an implementation plan (no code yet) for a detect_sections stage that analyzes a volume’s per-page OCR output and drafts a sections.txt marking the structural sections of a city directory (alphabetical name list, street/reverse directory, classified business directory, front matter, ads). Created 2026-06-15. Intended to be implemented by a local instance with the actual output/ data on disk (it is gitignored, so cloud sessions can’t see it).


TL;DR for the implementer


Background: the gap

sections.txt marks the first page of each structural section in a volume and routes per-section prompts:

# filename-of-first-page    label
0015_p15020coll12:2453.jpg  alphabetical
0181_p15020coll12:2619.jpg  street
0311_p15020coll12:2749.jpg  business

Today this file is written by hand (see the artifact table in docs/pipeline-stages.md). Everything downstream already reads it:

Consumer What it does with sections Reference
utils/section_utils.py parse file → label per page, per-section prompt path, boundary test load_sections, section_for_page, prompt_for_page, is_section_boundary
pipeline/extract_entries.py resets carry-forward context at each boundary; switches to ner_prompt_{label}.md extract_entries.py:1112-1123, :1167-1177
pipeline/run_gemini_ocr.py switches to ocr_prompt_{label}.md per section (section lookup via section_utils)
pipeline/generate_prompt.py “sections mode”: samples pages from each run, generates per-section prompts generate_prompt.py (sections branch)
pipeline/select_pages.py seeds selection from section boundaries (section lookup)
main.py / cli/main.py --sections flag + _resolve_sections() plumbing to declarative stages via from_ctx main.py:208 _resolve_sections, stages.py Opt("sections", "--sections", from_ctx=True)

So the feature is unusually well-scoped: produce a draft sections.txt; consume nothing new.


Why this is worth doing (the payoff)

Separate NER prompts matter because a city directory stacks sections whose entry schemas genuinely differ — sometimes inverted:

Section Entry shape Context model Why one global prompt fails it
alphabetical (resident/business name list) Surname, Given (spouse), occupation, employer, h/r/bds address — a person record, dense abbreviations alphabetical letter (≈irrelevant to fields) Green-Book-style STATE→CITY→CATEGORY prompt has no purchase here
classified (business / buyers’ guide) business_name, address, phone under ALL-CAPS category headings CATEGORY carries down Maps almost directly onto prompts/examples/ner_prompt_greenbook.md
street (reverse directory) house_number → occupant under a street heading — address-first, name-second street heading + cross-streets An inverse of the alphabetical prompt; a name-first prompt mis-segments it, and carry-forward context from the alphabetical section actively corrupts it
frontmatter / register officials, churches, lodges — many micro-formats Usually excluded from extraction, not NER’d

Value scales with schema divergence, not with boundary count. Two flavors of business list want one prompt; alphabetical-vs-street want two. Tulsa 1921 (mature 20th-c. Polk-style directory: alphabetical + classified + street + register, likely 4 runs) is a strong case. Hearne’s Brooklyn 1852 (earlier, flatter: register/ads + one big alphabetical list, likely 2 runs) is a weaker but real case. Display ads, if sprinkled one page at a time, are a per-page flag (is_advertisement), not a contiguous section — don’t force them into runs.

A second payoff beyond prompt quality: scoping. Knowing the front-matter/register run lets you exclude it from extraction (cleaner CSV, lower cost) the same way included_pages.txt already trims frontmatter.


Goal / non-goals

Goal. A pipeline detect-sections <DIR> stage that reads per-page OCR artifacts, classifies each page into a section type, smooths the per-page labels into contiguous runs, and writes:

  1. sections_report.csv — one row per page: predicted label, confidence, and every feature (so a human can eyeball and tune).
  2. sections_draft.txt — the existing sections.txt format, boundaries only, with a # DRAFT — review then rename to sections.txt header.
  3. (optional) sections_review.html — self-contained thumbnails-per-boundary report, mirroring tools/review_ocr.py.

Non-goals.


Where it sits: the two-pass calibration workflow

Detection needs OCR text, and per-section prompt generation needs detected sections, so a sectioned volume is calibrated in two passes (this is expected and matches the existing calibrate-once pattern):

Pass 1 (calibrate):
  download → (surya/gemini) OCR → [align] → detect-sections → sections_draft.txt
                                              │
                                       human reviews / promotes → sections.txt
                                              │
                             generate-prompts --sections   → ner_prompt_{label}.md ×N
Pass 2 (run):
  extract-entries --sections sections.txt   → per-section routing + context resets

Place the StageDef after align_ocr and before extract_entries in stages.py STAGES order. Like detect_columns, it is opt-in (only runs when --detect-sections is passed); it is not added to the default run/guided chains, so automated runs are unaffected. It writes a draft artifact and changes nothing about extraction unless the user later passes --sections.


Input tiers (degrade gracefully)

Read the richest artifact available per page; fall back cleanly:

  1. {stem}_{model}_aligned.json (best) — lines[].bbox (geometry) + gemini_text (clean text). All features available. Parse with the same shape align_ocr.py writes (see its output schema in docs/pipeline-stages.md).
  2. {stem}_surya.jsonlines[].bbox + text (uncorrected, fine for structure). All geometry + text features available.
  3. {stem}_{model}.txt (Gemini plain text) — text features + line-count density only; no per-line geometry. For col_count, fall back to columns_report.csv if present (it is produced by detect-columns/surya-detect, which run on images independent of OCR — a free geometry signal even in this tier).

Auto-detect the model slug exactly as the other scripts do (utils/models.py discover_ocr_slug() + pipeline_state.json); accept --model override. Respect included_pages.txt scope if present (see tools/review_ocr.py:_load_scope).


Per-page features (the meat)

Put these in importable pure functions so they unit-test without images/network (mirror how tests/test_align_parse.py imports parse_surya from align_ocr.py).

Geometry (needs bbox; tier 1–2, or columns_report.csv for col_count):

Text (needs line text; all tiers):

Keep all thresholds as named module constants (repo convention — cf. fix_entries.py:_INFER_RULES), so Phase 0 can tune them in one place.


Label taxonomy

The controlled vocabulary the stage emits (these become ner_prompt_{label}.md suffixes, so keep them short and filename-safe):

frontmatter · alphabetical · classified · street · advertisements · unknown

Starting classification rules (calibrate in Phase 0 — these are hypotheses, not law):

Score each candidate; confidence = top_score - second_score (margin). Low margin → flag for human review and/or model adjudication (Phase 5).


Phases

Phase 0 — Validate & calibrate (read-only; do this first, ~1–2 hrs)

Write scripts/dump_section_features.py (or a notebook cell) that runs the feature extractor only over a volume dir and prints/plots the per-page series: col_count, density (n_lines vs neighbor median), pct_digit_leading, pct_allcaps_short, alpha_monotone_frac, plus the first ~3 lines of each page as a content sniff. Run on:

Gate: the section runs should be visually obvious in those series. If they are, the deterministic path is confirmed and the model is just an adjudicator. If they are not cleanly separable, lean harder on Phase 5 and report back before wiring. This script is reused verbatim as the stage’s feature extractor — not throwaway.

Phase 1 — Feature extractor module

pipeline/detect_sections.py with importable pure functions:

Phase 2 — Classify, smooth, emit

Phase 3 — (optional) HTML review report

sections_review.html, self-contained, mirroring tools/review_ocr.py: a per-page label strip (color per label) + a thumbnail card for each detected boundary page so the user can confirm/correct the first-page-of-run picks at a glance. This is the human-confirmation surface that makes a 90%-right draft a 30-second edit.

Phase 4 — Wire as an opt-in stage

Mirror detect_columns at every touchpoint (grep detect_columns / detect-columns to find them all — current list):

Phase 5 — (optional) Model adjudication at the margins

For pages with low classification margin or near a candidate boundary only, escalate to a section_type enum classification. Two interchangeable backends, gated so the deterministic core never depends on either:

Phase 6 — (optional) Per-section prompt generation polish

Confirm generate_prompt.py --sections sections.txt samples pages from each run and emits ner_prompt_{label}.md (+ ocr_prompt_{label}.md) well for these volumes; tune the per-section meta-prompt if the street/reverse section needs an address-first schema hint. Mostly exists — verify and refine, don’t rebuild.


Verification ground rules

Tests to add

File / function map

pipeline/detect_sections.py
  PageInput                      # dataclass: filename, idx, lines[{bbox,text}], page_w, page_h
  load_page_text_and_geometry()  # tiered loader: aligned.json > surya.json > .txt (+columns_report.csv)
  extract_page_features()        # -> dict of the features above
  classify_page()                # -> (label, confidence)
  smooth_runs()                  # per-page labels -> contiguous runs
  emit_boundaries()              # runs -> [(first_page_filename, label)]
  write_report()                 # sections_report.csv
  write_draft()                  # sections_draft.txt (never sections.txt)
  build_review_html()            # (Phase 3) sections_review.html
  main()                         # CLI mirroring detect_columns.py

scripts/dump_section_features.py # (Phase 0) read-only feature dump; reuses the extractor

tests/test_detect_sections.py    # pure-function unit tests

Constants (one block, top of module): TALL_LINE_RATIO=1.4, CENTERED_TOL=0.10, STREET_DIGIT_FRAC=0.45, CLASSIFIED_CAPS_FRAC=0.12, ALPHA_MONOTONE_FRAC=0.6, MIN_RUN=3, SMOOTH_WINDOW=3, … (all tuned in Phase 0).

Risks & caveats


Session log

(Append one entry per working session: date, phase touched, what landed, what’s next.)