directory-pipeline

Plan: Run directory-pipeline entirely on AWS

Status: Planned — decisions captured; implementation deferred.

Context

The directory-pipeline currently depends on two Google services: Gemini (multimodal OCR, NER extraction, and prompt generation — all via the google-genai SDK) and Google Maps (geocoding). Gemini has no AWS-hosted offering, so an upcoming AWS-exclusive project cannot run the pipeline as-is. This plan ports the pipeline to run entirely on AWS infrastructure, replacing Gemini with Amazon Bedrock (Claude) and Google Maps with Amazon Location Service, while preserving the pipeline’s resumable, filesystem-staged, ~21-stage architecture.

Decisions captured from the user:

The intended outcome: the same CLI and stage outputs, with zero non-AWS LLM/geo calls, authenticated by IAM (no API keys), backed by S3, scalable to hundreds of parallel volumes.

Note on model IDs below: the exact Bedrock model / inference-profile IDs (region prefix such as us., and version suffix) must be confirmed per-region against the Bedrock console when implementing. The IDs here are illustrative of the tier choice, not literal strings to paste.


Strategy: a thin provider adapter, not a rewrite

Every Gemini call funnels through get_client() / generate_with_retry() in utils/gemini.py (re-exported by pipeline/api.py). Call sites depend on only four things from google-genai:

  1. those two functions (unchanged signatures),
  2. a response object exposing .text and .candidates[0].finish_reason,
  3. config/content builders: GenerateContentConfig(...), Part.from_bytes/from_text, ThinkingConfig, MediaResolution, FinishReason,
  4. the model_slug() filename contract in utils/models.py.

So the abstraction boundary is a provider adapter inside one module. Reimplement those helpers and types over boto3 bedrock-runtime (Converse API), keeping the same names/signatures. Call sites then change only their import lines — their retry ladders, quality checks, JSON-recovery, and context logic are untouched.


Part A — Code changes

A1. utils/gemini.pyutils/llm.py (the provider layer; highest-value change)

Rename to utils/llm.py; leave utils/gemini.py as a one-release re-export shim so colab/ and external notebooks keep working.

Route the two direct-call bypasses through generate_with_retry too: generate_prompt._call_gemini (pipeline/generate_prompt.py:246) and pipeline/compare_ocr.py:100.

A2. utils/models.py — model IDs + slug/filename contract (subtle, resumability-critical)

A3. Call sites — import-line changes only

A4. --flex semantics (orchestration: cli/main.py, pipeline/stages.py, app.py)

--flex meant “Gemini flex tier.” Repurpose to “cheap/batch tier.”

A5. pipeline/geo/geocode_entries.py — Google Maps → Amazon Location

Keep geocode_rows, the geocache.json cache, dedup, and the address→city fallback levels intact.

A6. pyproject.toml

Remove google-genai; add boto3>=1.34. Keep python-dotenv (region/override config). In the geo extra, drop geopy (Nominatim no longer used). The gpu extra (surya-ocr, transformers) is unchanged — Surya stays as the bbox detector.

A7. Secrets / IAM

Delete GEMINI_API_KEY and GOOGLE_MAPS_API_KEY from code and .env.template. Bedrock + Location authenticate via the instance/job IAM role (boto3 picks it up with zero code). Pipeline role policy: bedrock:InvokeModel (+ InvokeModelWithResponseStream if NER streaming; CreateModelInvocationJob/ GetModelInvocationJob if Batch tier built), geo:SearchPlaceIndexForText (or geo-places:Geocode), s3:GetObject/PutObject/ListBucket on the artifact bucket, secretsmanager:GetSecretValue only if a residual private-IIIF credential survives. load_dotenv() becomes dev-only.

Critical files: utils/gemini.pyutils/llm.py, utils/models.py, pipeline/extract_entries.py, pipeline/run_gemini_ocr.py, pipeline/geo/geocode_entries.py (plus supporting: pipeline/generate_prompt.py, pipeline/compare_ocr.py, analysis/compare_extraction.py, analysis/review_entries.py, pipeline/api.py, pipeline/stages.py, cli/main.py, app.py, pyproject.toml, new Dockerfile, new pipeline/storage_sync.py).


Part B — Infrastructure topology (two phases)

Phase 1 — Single GPU EC2 + S3-backed artifacts (validate the migration)

A g5.xlarge (A10G, 24 GB) running the existing CLI verbatim — matches the one-orchestrator, subprocess-stage, local-filesystem, resumable design with the least change.

Phase 2 — AWS Batch (GPU) + Step Functions (scale to 100s of volumes in parallel)

Chosen over SageMaker because the workload is heterogeneous-compute batch ETL (not training): Batch maps 1:1 to the subprocess-per-stage model, isolates GPU cost to Surya, and supports Spot.


Cost estimate — one 500-page city directory

Rates (per million tokens): Sonnet $3 in / $15 out, Opus $5 in / $25 out. Bedrock matches Anthropic’s published per-token rates (AWS sets them per region — confirm in the Bedrock pricing console). Batch inference = 50% off; prompt caching serves cache hits at ~0.1× input.

These are planning numbers. Per-page token counts vary a lot with page density (a sparse name column vs. a dense classified-ads page). Calibrate with count_tokens on a few real pages or a 10-page pilot before committing a budget. LLM tokens dominate the total — GPU/storage/geo are rounding error.

Per-page token assumptions (defaults; tune after a pilot):

Stage Input (image + system prompt + context) Output
Bedrock OCR (multimodal, Sonnet) ~4,500 tok (≈3,000 image + ~1,500 OCR prompt) ~2,500 tok transcript
Bedrock NER (text-only, Sonnet) ~4,200 tok (~2,500 transcript + ~1,500 NER prompt + context) ~4,000 tok JSON entries

Per-page cost (Sonnet, standard pricing):

Full single-volume run (Phase 1, single g5 box):

Component Standard pricing Optimized (batch NER + prompt caching + Spot GPU)
Bedrock OCR + NER (Sonnet) ~$62 ~$33 (NER batched −50%; system-prompt cache-hit trims input)
Opus fallback escalations (~5% of pages re-run) ~$2 ~$2
Surya GPU + orchestration (g5.xlarge, few hrs) ~$5 (on-demand) ~$1–2 (Spot)
Amazon Location geocoding (optional, if addresses) ~$3 (≈5–10k lookups @ ~$0.50/1k, dedup-cached) ~$3
S3 storage + transfer (~0.5–1 GB) <$1 <$1
Total per 500-page volume ≈ $70 ≈ $40

Sensitivities:


Part C — Trade-offs & risks


Part D — Verification (end-to-end)

  1. Unit (mocked boto3): in tests/test_api.py, assert generate_with_retry builds the right Converse payload from Part/GenerateContentConfig, maps ThrottlingException→429 backoff and ServiceUnavailableException→503 backoff, and that the response adapter exposes .text and .candidates[0].finish_reason with content_filtered → RECITATION. Assert model_slug/ discover_ocr_slug round-trip new IDs and still find legacy gemini-* files.
  2. Single-page smoke: run run_gemini_ocr.py + extract_entries.py on one known page against real Bedrock; confirm .txt and _entries.json with sane content and correct new slug filenames.
  3. Quality parity: reuse analysis/compare_ocr.py / analysis/compare_extraction.py (Bedrock-backed) on held-out pages that previously hit the recitation/dot-leader/repetition paths; compare to archived Gemini outputs; re-tune thresholds on regression.
  4. Resumability: run a small volume, kill mid-stage, restart; confirm pipeline_state.json + S3-synced output/ resume and mtime cache-skip still works.
  5. Geocoding: run geocode_entries.py; confirm Amazon Location address-level hits, city fallback, and geocache.json reuse (second run = zero network calls).
  6. Infra integration: deploy the Docker image to the g5 instance with the IAM role (no API keys in env); confirm Bedrock/Location succeed via the role, Flask reviewers reachable via ALB, explorer/map served from CloudFront. Then validate one Map fan-out of a few volumes through Step Functions + Batch.

Suggested implementation order

  1. A1 provider layer (utils/llm.py) + shim types + retry mapping + response adapter; unit tests (D1).
  2. A2 model IDs + slug regex generalization (keep legacy branch).
  3. A3 call-site import swaps; single-page smoke (D2).
  4. A5 Amazon Location geocoder.
  5. A6/A7 deps + IAM + remove API keys.
  6. Quality parity eval (D3); re-tune.
  7. Dockerfile + storage_sync.py; Phase 1 EC2 deploy (D4–D6).
  8. Phase 2 Batch + Step Functions for parallel scale.