directory-pipeline

Turn a public digital archive URL into a structured, browsable CSV — no manual transcription, no custom code per collection type.

Give it a URL from the Library of Congress, Internet Archive, or any institution that publishes a public IIIF manifest (CONTENTdm repositories included). It downloads the scans, OCRs them, and extracts entries into a structured CSV. With the enrichment steps, every row links back to the exact location in the original scan.

Built for digitized historical directories — city directories, gazetteers, trade directories — but works on just about any historical document with regular entry-like structure.

Interactive data explorer with facet filters, field distribution charts, and a detail panel with IIIF source thumbnail

The auto-generated data explorer. Categorical facets generated from the data, full-text search, IIIF page thumbnails, and a “View in source document” deep link for every row.


Quick start

Requires Python 3.11+ and uv.

uv sync        # installs dependencies and the `pipeline` command

# One-time calibration for a new collection type — generates OCR and NER
# prompts that will work for any item with the same entry structure
pipeline ingest    https://archive.org/details/ldpd_11290437_000/
pipeline calibrate https://archive.org/details/ldpd_11290437_000/

# Automated run — produces the entries CSV + browsable HTML explorer
pipeline run       https://archive.org/details/ldpd_11290437_000/

# Or the human-in-the-loop version of the same run: adds page scoping,
# #xywh= bounding boxes, and interactive alignment review (needs GPU or Apple Silicon)
pipeline guided    https://archive.org/details/ldpd_11290437_000/

Calibrate once, run many. pipeline calibrate is a one-time step per collection type: it opens a browser UI where you pick 4–10 representative pages, then has Gemini analyze them and write tailored OCR and extraction prompts. For any additional volume in the same series, point to the first volume’s NER prompt and skip calibration entirely:

pipeline run https://archive.org/details/ldpd_11290437_001/ \
  --ner-prompt output/ldpd_11290437_000/ner_prompt.md

Requires GEMINI_API_KEY. Can run on the free tier — no billing required for collections up to ~150 pages. Flex inference (~50% cheaper API calls, 1–15 min latency per request, best-effort availability) is on by default; pass --no-flex for time-sensitive runs.


What you get

A CSV where every row is one extracted entry. Field names are driven entirely by your NER prompt — no code changes necessary for a new document or volume of the same type. An example entry from a travel guide:

name address city state category canvas_fragment
Mrs. Simmons Tourist Home 418 Johnson St Augusta GA Tourist Home https://...#xywh=142,890,1240,68

The canvas_fragment column is a IIIF URI pointing back to the source scan. With the precision upgrade, it includes a #xywh= bounding box pinpointing the exact line. The data explorer and map use this to link directly to the highlighted entry in the original document.

Alongside the extracted data CSV, pipeline run also generates a self-contained HTML data explorer (shown above). With geocoding enabled on materials with address fields, you also get:

Leaflet map with clustered markers, category filter sidebar, and a popup with IIIF source thumbnail

Markers clustered and color-coded by category. Popups include a IIIF thumbnail fetched directly from the source institution’s image server.


Examples

Two published collections built with this pipeline:

Tulsa 1921 city directory explorer: entry detail with a crop of the source scan highlighting the matched line Green Book explorer: faceted entry list with a detail panel showing cross-year listings and a location map

Entry detail views from the published explorers. Left: each Tulsa entry renders the exact line from the source scan via its canvas_fragment URI. Right: a Green Book establishment with its listing history across volumes.

Both explorers received additional front-end design work beyond what the pipeline generates. The pipeline produces the entry CSVs, IIIF manifests, and a baseline HTML explorer; these published sites build on that output.


Installation

Requires Python 3.11+ and uv.

uv sync                  # core: Gemini OCR + entry extraction
uv sync --extra gpu      # add Surya OCR (GPU or Apple Silicon recommended)
uv sync --extra geo      # add geocoding + map generation
uv sync --all-extras     # everything

This installs the pipeline command (run pipeline --help for all subcommands):

pipeline run    <URL>          # automated: download → OCR → extract → explore
pipeline guided <URL>          # human-in-loop: page selection + alignment review
pipeline ingest <URL>          # download only
pipeline calibrate <URL|DIR>   # select sample pages + generate prompts (once per collection type)
pipeline ocr    <DIR>          # Surya OCR + Gemini OCR + align bboxes
pipeline review  <DIR>         # interactive alignment review (browser UI)
pipeline extract <DIR>         # NER extraction + explorer
pipeline geo     <DIR>         # geocode entries + build map (needs address fields)
pipeline postprocess <DIR>     # fix + combine volumes + rebuild explorer

Each subcommand wraps the underlying python main.py <URL> [flags] stage interface — see docs/pipeline-stages.md for the flag-level reference and every artifact each stage produces.

Set your API keys (or copy .env.template to .env):

export GEMINI_API_KEY=your_key_here
export GOOGLE_MAPS_API_KEY=your_key_here   # optional; enables address-level geocoding

Going further

Goal Command
Add spatial bounding boxes to every row pipeline ocr output/<vol>/ → details
Interactively fix unmatched lines pipeline review output/<vol>/
Improve accuracy on complex layout-dependent materials pipeline extract output/<vol>/ --mode multimodal → details
Geocode entries and build a map pipeline geo output/<vol>/
Full pipeline with page scoping + alignment review pipeline guided <URL>
Clean + merge volumes after extraction pipeline postprocess output/<collection>/
Use pipeline pieces from a notebook or script from pipeline.api import iter_canvases, … → details
Export W3C/IIIF annotations python -m pipeline.iiif.export_annotations

Multimodal extraction

By default extraction sends the OCR text to Gemini. Adding --mode multimodal also sends the page image, which lets the model see section headers, column boundaries, and layout cues that are often lost after OCR normalization:

pipeline extract output/<vol>/ --mode multimodal

This is most valuable for materials where geographic or thematic section headings fall mid-page (the model can see the heading visually rather than relying on text order), multi-column layouts where reading order is ambiguous, or any collection where state/category context shifts frequently within a page. In testing on Green Book volumes it eliminated mid-page geographic attribution errors entirely, compared to text-only mode.

The cost increase is negligible — each page image is resized to ≤768 px and counts as one tile (~258 input tokens), adding roughly $0.00006 per page at standard rates. See docs/costs.md for a full breakdown.


Precision upgrade

The core pipeline run path gives you a canvas URI per row. Adding Surya OCR and alignment upgrades every canvas_fragment to a #xywh= bounding box — the exact line on the page, usable by any IIIF viewer:

pipeline ocr    output/<vol>/    # Surya bboxes + alignment (requires GPU or Apple Silicon)
pipeline review output/<vol>/    # optional: fix unmatched lines interactively

NW alignment result drawn on a source page — orange bounding boxes on matched lines, unmatched Gemini lines listed in the margin in red


Costs

A single ~80-page volume costs roughly ~$0.30 in Gemini API charges with --flex or $0.60 at standard rates, and can run within the free tier’s daily quota (~15–20 minutes at 15 RPM). See docs/costs.md for a full breakdown including platform costs (Surya OCR on Mac, Colab, and GPU).


Docs

Key design decisions

See docs/key-design-decisions.md for full technical notes.


Prior work and inspirations

See docs/prior-work.md for full annotated citations.