Turn a public digital archive URL into a structured, browsable CSV — no manual transcription, no custom code per collection type.
Give it a URL from the Library of Congress, Internet Archive, NYPL Digital Collections, or any institution that publishes a public IIIF manifest. It downloads the scans, OCRs them, and extracts entries into a structured CSV. With the enrichment steps, every row links back to the exact location in the original scan.
Built for digitized historical directories — city directories, gazetteers, trade directories — but works on just about any historical document with regularl entry-like structure.

The auto-generated data explorer. Categorical facets generated from the data, full-text search, IIIF page thumbnails, and a “View in source document” deep link for every row.
# One-time calibration steps for a new collection type — generates OCR and NER prompts that will work for any item with the same entry structure
python main.py https://archive.org/details/ldpd_11290437_000/ --download
python main.py https://archive.org/details/ldpd_11290437_000/ --select-pages
python main.py https://archive.org/details/ldpd_11290437_000/ --generate-prompts
# "--extract" Automated shortcut — produces entries CSV + browsable HTML explorer for any subsequent volumes in the same series
python main.py https://archive.org/details/ldpd_11290437_000/ --extract
Calibrate once, run many. The first three commands are a one-time step per collection type: --select-pages opens a browser UI where you pick 4–10 representative pages; --generate-prompts has Gemini analyze them and write tailored OCR and extraction prompts. For any additional volume in the same series, point to the relevant NER prompt, and skip calibration entirely:
python main.py https://archive.org/details/ldpd_11290437_001/ --extract \
--ner-prompt output/ldpd_11290437_000/ner_prompt.md
Requires GEMINI_API_KEY. Can run on the free tier — no billing required for collections up to ~150 pages.
A CSV where every row is one extracted entry. Field names are driven entirely by your NER prompt — no code changes necessary for a new document or volume of the same type:
| name | address | city | state | category | canvas_fragment |
|---|---|---|---|---|---|
| Mrs. Simmons Tourist Home | 418 Johnson St | Augusta | GA | Tourist Home | https://...#xywh=142,890,1240,68 |
The canvas_fragment column is a IIIF URI pointing back to the source scan. With the precision upgrade, it includes a #xywh= bounding box pinpointing the exact line. The data explorer and map use this to link directly to the highlighted entry in the original document.
Alongside the extracted data CSV, --extract also generates a self-contained HTML data explorer (shown above). With --geocode --map run on materials with address fields, you also get:

Markers clustered and color-coded by category. Popups include a IIIF thumbnail fetched directly from the source institution’s image server.
Requires Python 3.11+ and uv.
uv sync # core: Gemini OCR + entry extraction
uv sync --extra gpu # add Surya OCR (GPU or Apple Silicon recommended)
uv sync --extra geo # add geocoding + map generation
uv sync --all-extras # everything
Set your API keys (or copy .env.template to .env):
export GEMINI_API_KEY=your_key_here
export NYPL_API_TOKEN=your_token_here # only needed for NYPL API access
export GOOGLE_MAPS_API_KEY=your_key_here # optional; enables address-level geocoding
| Goal | Flags / command |
|---|---|
| Add spatial bounding boxes to every row | --surya-ocr --align-ocr → details |
| Interactively fix unmatched lines | --review-alignment |
| Geocode entries and build a map | --geocode --map |
| Full pipeline with page scoping + alignment review | --guided |
| Export W3C/IIIF annotations | pipeline/iiif/export_annotations.py |
The core path (--download --gemini-ocr --extract-entries) gives you a canvas URI per row. Adding --surya-ocr --align-ocr upgrades every canvas_fragment to a #xywh= bounding box — the exact line on the page, usable by any IIIF viewer:
python main.py URL --surya-ocr --align-ocr
python main.py URL --review-alignment # optional: fix unmatched lines interactively

A single 100-page volume costs roughly $0.60 in Gemini API charges with a paid account but can run within the free tier’s daily quota (~15–20 minutes at 15 RPM). See docs/costs.md for a full breakdown including platform costs (Surya OCR on Mac, Colab, and GPU).
extract_entries.py hard-codes nothing. The NER prompt defines all field names; CSV columns are inferred dynamically — no code changes for a new collection type.canvas_fragment (#xywh=) URI in natural image pixel coordinates, directly consumable by IIIF viewers and annotation tools.--iiif-csv and --download accept any public IIIF Presentation v2 or v3 manifest URL, not just NYPL/LoC/IA. IIIF Collection manifests are enumerated automatically, writing one CSV row per child manifest.See docs/key-design-decisions.md for full technical notes.
align_ocr.py. (Int. J. Digital Libraries)See docs/prior-work.md for full annotated citations.