STORY-F-020: Staging + Production host Postgres + Finnest deploy + smoke test¶

Epic: Infrastructure Priority: Must Have Story Points: 3 Status: Not Started Assigned To: Unassigned Created: 2026-04-17 Sprint: 4

User Story¶

As an Ops SRE and commercial stakeholder, I want Finnest deployed to staging and production hosts alongside Laravel (per ADR-014-F co-deploy), with an end-to-end smoke test proving login → org setup → agent chat → Cmd+K works on all 3 envs, so that Phase 0 deliverable is complete and Scout + Verify go-live work in Sprint 5+ has a validated deploy pipeline across all target environments.

Description¶

Background¶

F-011 landed integration deploy. This story extends the same pattern to staging (staging-finnest.agentic-ai.au) and production (app-finnest.agentic-ai.au). Per ADR-014-F §Host strategy Option A, Finnest co-deploys on the existing Laravel hosts (staging t3.medium, production t3.large). The co-deploy pattern and risks are documented in ADR-014-F §Mitigations.

This story is also the Phase 0 acceptance gate — the end-to-end smoke test runs against production-finnest and validates every deliverable from all prior stories. Pass here = Phase 0 done; begin Scout+Verify go-live sprint planning.

Scope¶

In scope:

Production-ready Dockerfile (Dockerfile at repo root + Dockerfile.prod if they diverge). Multi-stage Elixir 1.18/OTP 27 release build with Phoenix 1.8 asset pipeline (esbuild + tailwind). Umbrella releases: config in root mix.exs. Runtime image: Debian 12 slim + libssl3
libstdc++6 + libncursesw6 + libsodium23 (argon2 NIF). Matches IN-01 + architecture §Infrastructure L279 (Debian-based, non-Alpine); preserves NIF ABI compatibility from build to runtime (Argon2, Cloak, pgcrypto all link glibc/OpenSSL). This was originally listed as out-of-scope for F-011 but deferred out of F-011 when Phase C couldn't complete (see F-011 Post-sprint close-out for context). This Dockerfile unblocks the first actual Finnest deploy to integration.
First integration deploy end-to-end (picked up from F-011 Phase C defer): gh workflow run ci-finnest.yml -f destination=integration completes Kamal builder + push + deploy steps; kamal-proxy list shows finnest-app-web-integration alongside AgenticAI-app; Let's Encrypt cert issued for integration-finnest.agentic-ai.au. Smoke suite (scripts/deploy-finnest-smoke.sh integration) green including the Laravel regression guard. Flip .github/workflows/ci-finnest.yml deploy-integration gate to auto-on-main. Host RAM/capacity check per ADR-014-F §Capacity.
Terraform apply finnest_enabled: true to staging and production envs:
Reuses the agenticai module's finnest_enabled + finnest_env_label vars landed in F-011 Phase B.
Provisions finnest-${env}-backups S3 buckets + IAM users (DNS stays on Cloudflare per F-011 Phase B; no Route 53 resources here).
Staging + production env files (infrastructure/staging/main.tf, infrastructure/production/main.tf) get finnest_enabled = true
- finnest_env_label = "staging" / "production".
Kamal configs config/deploy.staging.yml, config/deploy.production.yml — same shape as F-011's integration config, different hosts + URLs + env values (proxy.host: staging-finnest.agentic-ai.au / app-finnest.agentic-ai.au)
Staging + production host kamal-proxy — both hosts already run kamal-proxy 2.x (same pattern as integration per F-011 Phase B discovery). Finnest registers with the shared proxy on first deploy; no per-host Caddy setup needed.
GitHub Actions workflow ci-finnest.yml extended:
Auto-deploy to staging on merge to main (same pattern as F-011 for integration — can run in parallel)
Manual trigger deploy-finnest-production.yml for production (ADR-014-F §CI/CD separation — manual promotion, same as AgenticAI-app production)
Secrets — per-env Bitwarden secrets for staging + production (separate from integration; DATABASE_URL, CLOAK_KEY_V1 etc. regenerated per env)
Nightly sync from production — staging job runs bin/sync_staging nightly 02:00 AEST (pattern mirrors AgenticAI-app): dump production Postgres, anonymise PII (name, email, phone, TFN → sentinel values), restore to staging. Simple sync; full DR testing in Phase 2+.
Observability baseline — Phase 0 exits with minimum observability:
logger_json structured logs to CloudWatch Logs (per env log group)
Sentry SaaS wired for commercial; IRAP self-hosted deferred
Oban Web mounted at /admin/oban (behind admin auth)
Basic Prometheus + Grafana: /metrics endpoint; reuse bastion Grafana OR set up per-env Prometheus exporter (defer Grafana dashboards to Phase 1)
End-to-end smoke test — automated Playwright script runs post-deploy against production-finnest:
Login as smoke-test user (seeded via migration)
Navigate home; confirm DaisyUI styles load
Open Cmd+K; type "go to home"; Enter; confirm navigation
Open Cmd+K; type "who am I"; Enter (handoff to agent chat)
Agent chat streams response via Channel
Close session; confirm agents.sessions.ended_at populated
Check events.domain_events has entries for login, session events
Hash chain verify for smoke-test org passes
Phase 0 retrospective note — create stories/phase-0-retro-2026-05-15.md stub for end-of-phase review
Sprint-status.yaml updated to status: completed for Sprint 4 at end of this story

Out of scope:

Full Grafana dashboards (Phase 1)
Alertmanager / PagerDuty wiring (Phase 1 or 2)
Production cutover from Laravel (Migration Phase X)
IRAP environment (Phase 3)
PgBouncer (Phase 3 per ADR-014-F)
Read replicas (Phase 2)

Technical Notes¶

Per ADR-014-F §Capacity planning: staging t3.medium co-deploy is TIGHT on RAM (~4 GB total with both apps + both DBs). Monitor sustained >85% RAM during Sprint 4; trigger instance upgrade if breached. Production t3.large has ~7.5 GB projected; more headroom.
Two-person approval for production deploys — implement via GitHub environment protection rules (production-finnest environment with 2 reviewer requirement)
Route 53 propagation: pre-provision records 24h before deploys so DNS is warm
Smoke test seed user: created via mix run priv/repo/seeds/smoke_test_user.exs — idempotent; credentials in Bitwarden under smoke_test_user
Anonymisation script for staging sync: replace name → FakerName, email → fake{id}@staging-anon.local, phone → 0400000000, TFN → 000000000, address → sentinel address. Event store tamper-evidence means we can't modify events; accept that staging event store is still hashed correctly (staging has its own chain, not prod's chain)
Laravel regression test: before + after deploy, hit staging.agentic-ai.au/up and app.agentic-ai.au/up — confirm 200 (co-deploy not regressed)
First-time cert issuance per Let's Encrypt: rate-limited to ~5 per week per domain, so stagger deploys by env
Dockerfile runtime image — IN-01 compliance: the base image is Debian 12 slim, not Alpine. argon2_elixir (F-006), Cloak AES-256-GCM (F-007 vault), and pgcrypto-backed event hash (F-016) all link against glibc/OpenSSL at the NIF boundary — musl builds need rebuild-against-musl or static linking, and the argon2_elixir port has a documented history of musl/glibc incompatibility. Architecture §Technology Stack — Infrastructure (L279) also mandates Debian-based images. Estimated image size with the Debian slim runtime: ~180–220 MB (Phoenix asset bundle is the long pole, not the base layer).
New dep — {:oban_web, "~> 2.11"}: Oban core (already in deps since F-008) ships no UI; oban_web is the official dashboard, maintained by the Oban authors, and is the standard Elixir-community choice. Mounts at /admin/oban behind the role-admin authorisation plug. Alternative considered: bespoke LiveView dashboard (~200 LOC of listing/filter/retry UI) — rejected as wheel-reinvention for a single story's benefit. Licence: free for the open-source core (Oban Web basic features); Oban Pro is a paid tier with features Phase 0 does not need. Revisit if Phase 2+ telemetry requirements outgrow the basic dashboard.
Production promotion discipline: the deploy-finnest-production.yml manual-only workflow enforces 2-reviewer GitHub environment protection (production-finnest environment) + pre-deploy + post-deploy smoke gates per ADR-014-F §Production promotion discipline (added 2026-04-18 alongside this story's gate-check resolution). Same pattern as AgenticAI-app production.

Dependencies¶

Blocked by: all of F-001 through F-019 (Phase 0 work must be complete and deployed to integration successfully first)

Acceptance Criteria¶

Infrastructure provisioning: - [ ] Terraform apply clean for staging + production with finnest_enabled: true - [ ] Route 53 records resolve: staging-finnest.agentic-ai.au → staging IP, app-finnest.agentic-ai.au → production IP - [ ] Postgres accessories running on both hosts; distinct from Laravel MySQL - [ ] S3 buckets provisioned: finnest-staging-storage, finnest-production-storage - [ ] Caddy serves both staging.agentic-ai.au (Laravel) AND staging-finnest.agentic-ai.au (Finnest) — co-deploy working

Deploy pipeline: - [ ] Dockerfile builds finnest:<sha> image locally (docker build -t finnest:test .) in under 10 min; image size <500 MB - [ ] gh workflow run ci-finnest.yml -f destination=integration completes end-to-end on a Dockerfile-containing main; curl https://integration-finnest.agentic-ai.au/health returns 200 (F-011 Phase C defer picked up here) - [ ] ci-finnest.yml deploy-integration gate flipped to auto-on-main; next merge to main auto-deploys - [ ] scripts/deploy-finnest.sh staging succeeds; curl https://staging-finnest.agentic-ai.au/health returns 200 - [ ] scripts/deploy-finnest.sh production with 2-person approval succeeds; curl https://app-finnest.agentic-ai.au/health returns 200 - [ ] Auto-deploy on merge to main updates staging - [ ] Production deploy requires manual approval + 2 reviewers per GitHub env protection - [ ] SSL certs valid on all 3 Finnest subdomains - [ ] docker exec kamal-proxy kamal-proxy list on each host shows AgenticAI + Finnest services registered with correct hosts + TLS=yes

Co-deploy regression: - [ ] curl https://integration.agentic-ai.au/up (Laravel) still 200 - [ ] curl https://staging.agentic-ai.au/up still 200 - [ ] curl https://app.agentic-ai.au/up (Laravel production) still 200 — this one is critical, Laravel is LIVE

Nightly sync: - [ ] bin/sync_staging runs; anonymises PII; staging Postgres receives copy of production data - [ ] Staging smoke test passes against post-sync data

Observability baseline: - [ ] CloudWatch Logs receives structured logs from production-finnest (sample query returns recent entries) - [ ] Sentry receives a deliberate test exception (trigger via /admin/test-exception — admin-only route — then remove after verification) - [ ] /metrics endpoint responds with Prometheus format metrics - [ ] /admin/oban (behind admin auth) shows Oban Web dashboard

End-to-end smoke test (the Phase 0 deliverable): - [ ] Smoke test user can log in to production-finnest - [ ] Home page renders with DaisyUI styles - [ ] Cmd+K opens; "go to home" navigates - [ ] "Who am I" via agent chat returns user info (Tier-1 pattern match, $0 cost) - [ ] Agent chat streams a Claude response (Tier-2 LLM, via AnthropicDirect) - [ ] Event store has entries; hash chain verify passes for smoke-test org - [ ] tenant_query_raises_test.exs architecture test still passes in production build

Capacity monitoring: - [ ] Sustained RAM <85% on staging host after 24h (ADR-014-F upgrade trigger) - [ ] Sustained RAM <85% on production host after 24h - [ ] If breached, upgrade ticket opened; t3.medium → t3.large scheduled

Phase 0 wrap: - [ ] sprint-status.yaml Sprint 4 marked completed; velocity recorded - [ ] stories/phase-0-retro-2026-05-15.md stub created for retro - [ ] Scout + Verify sprint planning scheduled for 2026-05-15 (Sprint 5 kickoff) with execution learnings from Phase 0

Testing Requirements¶

Integration (live): deploy → staging; run smoke test Playwright script; all assertions pass
Integration (live): deploy → production; repeat smoke test
Co-deploy regression: pre- + post-deploy Laravel health checks from external monitor
Observability: exception test → confirm Sentry receives it; structured log → confirm CloudWatch receives it
Capacity: 1-hour stress --cpu 4 on staging while monitoring RAM — ensure Laravel services don't OOM

References¶

../adrs/adr-014-F-infrastructure-reuse-from-agenticai-app.md §Implementation Plan steps 9 + 10–16
../architecture/architecture.md Part 11 Deployment Architecture
../10-GUARDRAILS.md IN-01 through IN-18, OP-01 through OP-04
AgenticAI-app bin/sync_staging, config/deploy.production.yml, .github/workflows/deploy-production.yml (reference patterns)
../brainstorms/brainstorm-10-development-roadmap.md §Phase 0 deliverable