STORY-F-020: Staging + Production host Postgres + Finnest deploy + smoke test¶
Epic: Infrastructure Priority: Must Have Story Points: 3 Status: Not Started Assigned To: Unassigned Created: 2026-04-17 Sprint: 4
User Story¶
As an Ops SRE and commercial stakeholder, I want Finnest deployed to staging and production hosts alongside Laravel (per ADR-014-F co-deploy), with an end-to-end smoke test proving login → org setup → agent chat → Cmd+K works on all 3 envs, so that Phase 0 deliverable is complete and Scout + Verify go-live work in Sprint 5+ has a validated deploy pipeline across all target environments.
Description¶
Background¶
F-011 landed integration deploy. This story extends the same pattern to staging (staging-finnest.agentic-ai.au) and production (app-finnest.agentic-ai.au). Per ADR-014-F §Host strategy Option A, Finnest co-deploys on the existing Laravel hosts (staging t3.medium, production t3.large). The co-deploy pattern and risks are documented in ADR-014-F §Mitigations.
This story is also the Phase 0 acceptance gate — the end-to-end smoke test runs against production-finnest and validates every deliverable from all prior stories. Pass here = Phase 0 done; begin Scout+Verify go-live sprint planning.
Scope¶
In scope:
- Production-ready Dockerfile (
Dockerfileat repo root +Dockerfile.prodif they diverge). Multi-stage Elixir 1.18/OTP 27 release build with Phoenix 1.8 asset pipeline (esbuild + tailwind). Umbrellareleases:config in rootmix.exs. Runtime image: Debian 12 slim +libssl3 libstdc++6+libncursesw6+libsodium23(argon2 NIF). Matches IN-01 + architecture §Infrastructure L279 (Debian-based, non-Alpine); preserves NIF ABI compatibility from build to runtime (Argon2, Cloak, pgcrypto all link glibc/OpenSSL). This was originally listed as out-of-scope for F-011 but deferred out of F-011 when Phase C couldn't complete (see F-011 Post-sprint close-out for context). This Dockerfile unblocks the first actual Finnest deploy to integration.- First integration deploy end-to-end (picked up from F-011 Phase C
defer):
gh workflow run ci-finnest.yml -f destination=integrationcompletes Kamal builder + push + deploy steps;kamal-proxy listshowsfinnest-app-web-integrationalongside AgenticAI-app; Let's Encrypt cert issued forintegration-finnest.agentic-ai.au. Smoke suite (scripts/deploy-finnest-smoke.sh integration) green including the Laravel regression guard. Flip.github/workflows/ci-finnest.ymldeploy-integration gate to auto-on-main. Host RAM/capacity check per ADR-014-F §Capacity. - Terraform apply
finnest_enabled: trueto staging and production envs: - Reuses the
agenticaimodule'sfinnest_enabled+finnest_env_labelvars landed in F-011 Phase B. - Provisions
finnest-${env}-backupsS3 buckets + IAM users (DNS stays on Cloudflare per F-011 Phase B; no Route 53 resources here). - Staging + production env files (
infrastructure/staging/main.tf,infrastructure/production/main.tf) getfinnest_enabled = truefinnest_env_label = "staging"/"production".
- Kamal configs
config/deploy.staging.yml,config/deploy.production.yml— same shape as F-011's integration config, different hosts + URLs + env values (proxy.host: staging-finnest.agentic-ai.au/app-finnest.agentic-ai.au) - Staging + production host kamal-proxy — both hosts already run kamal-proxy 2.x (same pattern as integration per F-011 Phase B discovery). Finnest registers with the shared proxy on first deploy; no per-host Caddy setup needed.
- GitHub Actions workflow
ci-finnest.ymlextended: - Auto-deploy to staging on merge to main (same pattern as F-011 for integration — can run in parallel)
- Manual trigger
deploy-finnest-production.ymlfor production (ADR-014-F §CI/CD separation — manual promotion, same as AgenticAI-app production) - Secrets — per-env Bitwarden secrets for staging + production (separate from integration;
DATABASE_URL,CLOAK_KEY_V1etc. regenerated per env) - Nightly sync from production — staging job runs
bin/sync_stagingnightly 02:00 AEST (pattern mirrors AgenticAI-app): dump production Postgres, anonymise PII (name, email, phone, TFN → sentinel values), restore to staging. Simple sync; full DR testing in Phase 2+. - Observability baseline — Phase 0 exits with minimum observability:
logger_jsonstructured logs to CloudWatch Logs (per env log group)- Sentry SaaS wired for commercial; IRAP self-hosted deferred
- Oban Web mounted at
/admin/oban(behind admin auth) - Basic Prometheus + Grafana:
/metricsendpoint; reuse bastion Grafana OR set up per-env Prometheus exporter (defer Grafana dashboards to Phase 1) - End-to-end smoke test — automated Playwright script runs post-deploy against production-finnest:
- Login as smoke-test user (seeded via migration)
- Navigate home; confirm DaisyUI styles load
- Open Cmd+K; type "go to home"; Enter; confirm navigation
- Open Cmd+K; type "who am I"; Enter (handoff to agent chat)
- Agent chat streams response via Channel
- Close session; confirm
agents.sessions.ended_atpopulated - Check
events.domain_eventshas entries for login, session events - Hash chain verify for smoke-test org passes
- Phase 0 retrospective note — create
stories/phase-0-retro-2026-05-15.mdstub for end-of-phase review - Sprint-status.yaml updated to
status: completedfor Sprint 4 at end of this story
Out of scope:
- Full Grafana dashboards (Phase 1)
- Alertmanager / PagerDuty wiring (Phase 1 or 2)
- Production cutover from Laravel (Migration Phase X)
- IRAP environment (Phase 3)
- PgBouncer (Phase 3 per ADR-014-F)
- Read replicas (Phase 2)
Technical Notes¶
- Per ADR-014-F §Capacity planning: staging t3.medium co-deploy is TIGHT on RAM (~4 GB total with both apps + both DBs). Monitor sustained >85% RAM during Sprint 4; trigger instance upgrade if breached. Production t3.large has ~7.5 GB projected; more headroom.
- Two-person approval for production deploys — implement via GitHub environment protection rules (
production-finnestenvironment with 2 reviewer requirement) - Route 53 propagation: pre-provision records 24h before deploys so DNS is warm
- Smoke test seed user: created via
mix run priv/repo/seeds/smoke_test_user.exs— idempotent; credentials in Bitwarden undersmoke_test_user - Anonymisation script for staging sync: replace name →
FakerName, email →fake{id}@staging-anon.local, phone →0400000000, TFN →000000000, address → sentinel address. Event store tamper-evidence means we can't modify events; accept that staging event store is still hashed correctly (staging has its own chain, not prod's chain) - Laravel regression test: before + after deploy, hit
staging.agentic-ai.au/upandapp.agentic-ai.au/up— confirm 200 (co-deploy not regressed) - First-time cert issuance per Let's Encrypt: rate-limited to ~5 per week per domain, so stagger deploys by env
- Dockerfile runtime image — IN-01 compliance: the base image is Debian 12 slim, not Alpine.
argon2_elixir(F-006), Cloak AES-256-GCM (F-007 vault), and pgcrypto-backed event hash (F-016) all link against glibc/OpenSSL at the NIF boundary — musl builds need rebuild-against-musl or static linking, and theargon2_elixirport has a documented history of musl/glibc incompatibility. Architecture §Technology Stack — Infrastructure (L279) also mandates Debian-based images. Estimated image size with the Debian slim runtime: ~180–220 MB (Phoenix asset bundle is the long pole, not the base layer). - New dep —
{:oban_web, "~> 2.11"}: Oban core (already in deps since F-008) ships no UI;oban_webis the official dashboard, maintained by the Oban authors, and is the standard Elixir-community choice. Mounts at/admin/obanbehind the role-admin authorisation plug. Alternative considered: bespoke LiveView dashboard (~200 LOC of listing/filter/retry UI) — rejected as wheel-reinvention for a single story's benefit. Licence: free for the open-source core (Oban Web basic features); Oban Pro is a paid tier with features Phase 0 does not need. Revisit if Phase 2+ telemetry requirements outgrow the basic dashboard. - Production promotion discipline: the
deploy-finnest-production.ymlmanual-only workflow enforces 2-reviewer GitHub environment protection (production-finnestenvironment) + pre-deploy + post-deploy smoke gates per ADR-014-F §Production promotion discipline (added 2026-04-18 alongside this story's gate-check resolution). Same pattern as AgenticAI-app production.
Dependencies¶
- Blocked by: all of F-001 through F-019 (Phase 0 work must be complete and deployed to integration successfully first)
Acceptance Criteria¶
Infrastructure provisioning:
- [ ] Terraform apply clean for staging + production with finnest_enabled: true
- [ ] Route 53 records resolve: staging-finnest.agentic-ai.au → staging IP, app-finnest.agentic-ai.au → production IP
- [ ] Postgres accessories running on both hosts; distinct from Laravel MySQL
- [ ] S3 buckets provisioned: finnest-staging-storage, finnest-production-storage
- [ ] Caddy serves both staging.agentic-ai.au (Laravel) AND staging-finnest.agentic-ai.au (Finnest) — co-deploy working
Deploy pipeline:
- [ ] Dockerfile builds finnest:<sha> image locally (docker build -t finnest:test .) in under 10 min; image size <500 MB
- [ ] gh workflow run ci-finnest.yml -f destination=integration completes end-to-end on a Dockerfile-containing main; curl https://integration-finnest.agentic-ai.au/health returns 200 (F-011 Phase C defer picked up here)
- [ ] ci-finnest.yml deploy-integration gate flipped to auto-on-main; next merge to main auto-deploys
- [ ] scripts/deploy-finnest.sh staging succeeds; curl https://staging-finnest.agentic-ai.au/health returns 200
- [ ] scripts/deploy-finnest.sh production with 2-person approval succeeds; curl https://app-finnest.agentic-ai.au/health returns 200
- [ ] Auto-deploy on merge to main updates staging
- [ ] Production deploy requires manual approval + 2 reviewers per GitHub env protection
- [ ] SSL certs valid on all 3 Finnest subdomains
- [ ] docker exec kamal-proxy kamal-proxy list on each host shows AgenticAI + Finnest services registered with correct hosts + TLS=yes
Co-deploy regression:
- [ ] curl https://integration.agentic-ai.au/up (Laravel) still 200
- [ ] curl https://staging.agentic-ai.au/up still 200
- [ ] curl https://app.agentic-ai.au/up (Laravel production) still 200 — this one is critical, Laravel is LIVE
Nightly sync:
- [ ] bin/sync_staging runs; anonymises PII; staging Postgres receives copy of production data
- [ ] Staging smoke test passes against post-sync data
Observability baseline:
- [ ] CloudWatch Logs receives structured logs from production-finnest (sample query returns recent entries)
- [ ] Sentry receives a deliberate test exception (trigger via /admin/test-exception — admin-only route — then remove after verification)
- [ ] /metrics endpoint responds with Prometheus format metrics
- [ ] /admin/oban (behind admin auth) shows Oban Web dashboard
End-to-end smoke test (the Phase 0 deliverable):
- [ ] Smoke test user can log in to production-finnest
- [ ] Home page renders with DaisyUI styles
- [ ] Cmd+K opens; "go to home" navigates
- [ ] "Who am I" via agent chat returns user info (Tier-1 pattern match, $0 cost)
- [ ] Agent chat streams a Claude response (Tier-2 LLM, via AnthropicDirect)
- [ ] Event store has entries; hash chain verify passes for smoke-test org
- [ ] tenant_query_raises_test.exs architecture test still passes in production build
Capacity monitoring: - [ ] Sustained RAM <85% on staging host after 24h (ADR-014-F upgrade trigger) - [ ] Sustained RAM <85% on production host after 24h - [ ] If breached, upgrade ticket opened; t3.medium → t3.large scheduled
Phase 0 wrap:
- [ ] sprint-status.yaml Sprint 4 marked completed; velocity recorded
- [ ] stories/phase-0-retro-2026-05-15.md stub created for retro
- [ ] Scout + Verify sprint planning scheduled for 2026-05-15 (Sprint 5 kickoff) with execution learnings from Phase 0
Testing Requirements¶
- Integration (live): deploy → staging; run smoke test Playwright script; all assertions pass
- Integration (live): deploy → production; repeat smoke test
- Co-deploy regression: pre- + post-deploy Laravel health checks from external monitor
- Observability: exception test → confirm Sentry receives it; structured log → confirm CloudWatch receives it
- Capacity: 1-hour
stress --cpu 4on staging while monitoring RAM — ensure Laravel services don't OOM
References¶
../adrs/adr-014-F-infrastructure-reuse-from-agenticai-app.md§Implementation Plan steps 9 + 10–16../architecture/architecture.mdPart 11 Deployment Architecture../10-GUARDRAILS.mdIN-01 through IN-18, OP-01 through OP-04- AgenticAI-app
bin/sync_staging,config/deploy.production.yml,.github/workflows/deploy-production.yml(reference patterns) ../brainstorms/brainstorm-10-development-roadmap.md§Phase 0 deliverable