Skip to content

STORY-F-020: Staging + Production host Postgres + Finnest deploy + smoke test

Epic: Infrastructure Priority: Must Have Story Points: 3 Status: Not Started Assigned To: Unassigned Created: 2026-04-17 Sprint: 4


User Story

As an Ops SRE and commercial stakeholder, I want Finnest deployed to staging and production hosts alongside Laravel (per ADR-014-F co-deploy), with an end-to-end smoke test proving login → org setup → agent chat → Cmd+K works on all 3 envs, so that Phase 0 deliverable is complete and Scout + Verify go-live work in Sprint 5+ has a validated deploy pipeline across all target environments.


Description

Background

F-011 landed integration deploy. This story extends the same pattern to staging (staging-finnest.agentic-ai.au) and production (app-finnest.agentic-ai.au). Per ADR-014-F §Host strategy Option A, Finnest co-deploys on the existing Laravel hosts (staging t3.medium, production t3.large). The co-deploy pattern and risks are documented in ADR-014-F §Mitigations.

This story is also the Phase 0 acceptance gate — the end-to-end smoke test runs against production-finnest and validates every deliverable from all prior stories. Pass here = Phase 0 done; begin Scout+Verify go-live sprint planning.

Scope

In scope:

  • Production-ready Dockerfile (Dockerfile at repo root + Dockerfile.prod if they diverge). Multi-stage Elixir 1.18/OTP 27 release build with Phoenix 1.8 asset pipeline (esbuild + tailwind). Umbrella releases: config in root mix.exs. Runtime image: Debian 12 slim + libssl3
  • libstdc++6 + libncursesw6 + libsodium23 (argon2 NIF). Matches IN-01 + architecture §Infrastructure L279 (Debian-based, non-Alpine); preserves NIF ABI compatibility from build to runtime (Argon2, Cloak, pgcrypto all link glibc/OpenSSL). This was originally listed as out-of-scope for F-011 but deferred out of F-011 when Phase C couldn't complete (see F-011 Post-sprint close-out for context). This Dockerfile unblocks the first actual Finnest deploy to integration.
  • First integration deploy end-to-end (picked up from F-011 Phase C defer): gh workflow run ci-finnest.yml -f destination=integration completes Kamal builder + push + deploy steps; kamal-proxy list shows finnest-app-web-integration alongside AgenticAI-app; Let's Encrypt cert issued for integration-finnest.agentic-ai.au. Smoke suite (scripts/deploy-finnest-smoke.sh integration) green including the Laravel regression guard. Flip .github/workflows/ci-finnest.yml deploy-integration gate to auto-on-main. Host RAM/capacity check per ADR-014-F §Capacity.
  • Terraform apply finnest_enabled: true to staging and production envs:
  • Reuses the agenticai module's finnest_enabled + finnest_env_label vars landed in F-011 Phase B.
  • Provisions finnest-${env}-backups S3 buckets + IAM users (DNS stays on Cloudflare per F-011 Phase B; no Route 53 resources here).
  • Staging + production env files (infrastructure/staging/main.tf, infrastructure/production/main.tf) get finnest_enabled = true
    • finnest_env_label = "staging" / "production".
  • Kamal configs config/deploy.staging.yml, config/deploy.production.yml — same shape as F-011's integration config, different hosts + URLs + env values (proxy.host: staging-finnest.agentic-ai.au / app-finnest.agentic-ai.au)
  • Staging + production host kamal-proxy — both hosts already run kamal-proxy 2.x (same pattern as integration per F-011 Phase B discovery). Finnest registers with the shared proxy on first deploy; no per-host Caddy setup needed.
  • GitHub Actions workflow ci-finnest.yml extended:
  • Auto-deploy to staging on merge to main (same pattern as F-011 for integration — can run in parallel)
  • Manual trigger deploy-finnest-production.yml for production (ADR-014-F §CI/CD separation — manual promotion, same as AgenticAI-app production)
  • Secrets — per-env Bitwarden secrets for staging + production (separate from integration; DATABASE_URL, CLOAK_KEY_V1 etc. regenerated per env)
  • Nightly sync from production — staging job runs bin/sync_staging nightly 02:00 AEST (pattern mirrors AgenticAI-app): dump production Postgres, anonymise PII (name, email, phone, TFN → sentinel values), restore to staging. Simple sync; full DR testing in Phase 2+.
  • Observability baseline — Phase 0 exits with minimum observability:
  • logger_json structured logs to CloudWatch Logs (per env log group)
  • Sentry SaaS wired for commercial; IRAP self-hosted deferred
  • Oban Web mounted at /admin/oban (behind admin auth)
  • Basic Prometheus + Grafana: /metrics endpoint; reuse bastion Grafana OR set up per-env Prometheus exporter (defer Grafana dashboards to Phase 1)
  • End-to-end smoke test — automated Playwright script runs post-deploy against production-finnest:
  • Login as smoke-test user (seeded via migration)
  • Navigate home; confirm DaisyUI styles load
  • Open Cmd+K; type "go to home"; Enter; confirm navigation
  • Open Cmd+K; type "who am I"; Enter (handoff to agent chat)
  • Agent chat streams response via Channel
  • Close session; confirm agents.sessions.ended_at populated
  • Check events.domain_events has entries for login, session events
  • Hash chain verify for smoke-test org passes
  • Phase 0 retrospective note — create stories/phase-0-retro-2026-05-15.md stub for end-of-phase review
  • Sprint-status.yaml updated to status: completed for Sprint 4 at end of this story

Out of scope:

  • Full Grafana dashboards (Phase 1)
  • Alertmanager / PagerDuty wiring (Phase 1 or 2)
  • Production cutover from Laravel (Migration Phase X)
  • IRAP environment (Phase 3)
  • PgBouncer (Phase 3 per ADR-014-F)
  • Read replicas (Phase 2)

Technical Notes

  • Per ADR-014-F §Capacity planning: staging t3.medium co-deploy is TIGHT on RAM (~4 GB total with both apps + both DBs). Monitor sustained >85% RAM during Sprint 4; trigger instance upgrade if breached. Production t3.large has ~7.5 GB projected; more headroom.
  • Two-person approval for production deploys — implement via GitHub environment protection rules (production-finnest environment with 2 reviewer requirement)
  • Route 53 propagation: pre-provision records 24h before deploys so DNS is warm
  • Smoke test seed user: created via mix run priv/repo/seeds/smoke_test_user.exs — idempotent; credentials in Bitwarden under smoke_test_user
  • Anonymisation script for staging sync: replace name → FakerName, email → fake{id}@staging-anon.local, phone → 0400000000, TFN → 000000000, address → sentinel address. Event store tamper-evidence means we can't modify events; accept that staging event store is still hashed correctly (staging has its own chain, not prod's chain)
  • Laravel regression test: before + after deploy, hit staging.agentic-ai.au/up and app.agentic-ai.au/up — confirm 200 (co-deploy not regressed)
  • First-time cert issuance per Let's Encrypt: rate-limited to ~5 per week per domain, so stagger deploys by env
  • Dockerfile runtime image — IN-01 compliance: the base image is Debian 12 slim, not Alpine. argon2_elixir (F-006), Cloak AES-256-GCM (F-007 vault), and pgcrypto-backed event hash (F-016) all link against glibc/OpenSSL at the NIF boundary — musl builds need rebuild-against-musl or static linking, and the argon2_elixir port has a documented history of musl/glibc incompatibility. Architecture §Technology Stack — Infrastructure (L279) also mandates Debian-based images. Estimated image size with the Debian slim runtime: ~180–220 MB (Phoenix asset bundle is the long pole, not the base layer).
  • New dep — {:oban_web, "~> 2.11"}: Oban core (already in deps since F-008) ships no UI; oban_web is the official dashboard, maintained by the Oban authors, and is the standard Elixir-community choice. Mounts at /admin/oban behind the role-admin authorisation plug. Alternative considered: bespoke LiveView dashboard (~200 LOC of listing/filter/retry UI) — rejected as wheel-reinvention for a single story's benefit. Licence: free for the open-source core (Oban Web basic features); Oban Pro is a paid tier with features Phase 0 does not need. Revisit if Phase 2+ telemetry requirements outgrow the basic dashboard.
  • Production promotion discipline: the deploy-finnest-production.yml manual-only workflow enforces 2-reviewer GitHub environment protection (production-finnest environment) + pre-deploy + post-deploy smoke gates per ADR-014-F §Production promotion discipline (added 2026-04-18 alongside this story's gate-check resolution). Same pattern as AgenticAI-app production.

Dependencies

  • Blocked by: all of F-001 through F-019 (Phase 0 work must be complete and deployed to integration successfully first)

Acceptance Criteria

Infrastructure provisioning: - [ ] Terraform apply clean for staging + production with finnest_enabled: true - [ ] Route 53 records resolve: staging-finnest.agentic-ai.au → staging IP, app-finnest.agentic-ai.au → production IP - [ ] Postgres accessories running on both hosts; distinct from Laravel MySQL - [ ] S3 buckets provisioned: finnest-staging-storage, finnest-production-storage - [ ] Caddy serves both staging.agentic-ai.au (Laravel) AND staging-finnest.agentic-ai.au (Finnest) — co-deploy working

Deploy pipeline: - [ ] Dockerfile builds finnest:<sha> image locally (docker build -t finnest:test .) in under 10 min; image size <500 MB - [ ] gh workflow run ci-finnest.yml -f destination=integration completes end-to-end on a Dockerfile-containing main; curl https://integration-finnest.agentic-ai.au/health returns 200 (F-011 Phase C defer picked up here) - [ ] ci-finnest.yml deploy-integration gate flipped to auto-on-main; next merge to main auto-deploys - [ ] scripts/deploy-finnest.sh staging succeeds; curl https://staging-finnest.agentic-ai.au/health returns 200 - [ ] scripts/deploy-finnest.sh production with 2-person approval succeeds; curl https://app-finnest.agentic-ai.au/health returns 200 - [ ] Auto-deploy on merge to main updates staging - [ ] Production deploy requires manual approval + 2 reviewers per GitHub env protection - [ ] SSL certs valid on all 3 Finnest subdomains - [ ] docker exec kamal-proxy kamal-proxy list on each host shows AgenticAI + Finnest services registered with correct hosts + TLS=yes

Co-deploy regression: - [ ] curl https://integration.agentic-ai.au/up (Laravel) still 200 - [ ] curl https://staging.agentic-ai.au/up still 200 - [ ] curl https://app.agentic-ai.au/up (Laravel production) still 200 — this one is critical, Laravel is LIVE

Nightly sync: - [ ] bin/sync_staging runs; anonymises PII; staging Postgres receives copy of production data - [ ] Staging smoke test passes against post-sync data

Observability baseline: - [ ] CloudWatch Logs receives structured logs from production-finnest (sample query returns recent entries) - [ ] Sentry receives a deliberate test exception (trigger via /admin/test-exception — admin-only route — then remove after verification) - [ ] /metrics endpoint responds with Prometheus format metrics - [ ] /admin/oban (behind admin auth) shows Oban Web dashboard

End-to-end smoke test (the Phase 0 deliverable): - [ ] Smoke test user can log in to production-finnest - [ ] Home page renders with DaisyUI styles - [ ] Cmd+K opens; "go to home" navigates - [ ] "Who am I" via agent chat returns user info (Tier-1 pattern match, $0 cost) - [ ] Agent chat streams a Claude response (Tier-2 LLM, via AnthropicDirect) - [ ] Event store has entries; hash chain verify passes for smoke-test org - [ ] tenant_query_raises_test.exs architecture test still passes in production build

Capacity monitoring: - [ ] Sustained RAM <85% on staging host after 24h (ADR-014-F upgrade trigger) - [ ] Sustained RAM <85% on production host after 24h - [ ] If breached, upgrade ticket opened; t3.medium → t3.large scheduled

Phase 0 wrap: - [ ] sprint-status.yaml Sprint 4 marked completed; velocity recorded - [ ] stories/phase-0-retro-2026-05-15.md stub created for retro - [ ] Scout + Verify sprint planning scheduled for 2026-05-15 (Sprint 5 kickoff) with execution learnings from Phase 0


Testing Requirements

  • Integration (live): deploy → staging; run smoke test Playwright script; all assertions pass
  • Integration (live): deploy → production; repeat smoke test
  • Co-deploy regression: pre- + post-deploy Laravel health checks from external monitor
  • Observability: exception test → confirm Sentry receives it; structured log → confirm CloudWatch receives it
  • Capacity: 1-hour stress --cpu 4 on staging while monitoring RAM — ensure Laravel services don't OOM

References