ADR-014-F: Infrastructure Reuse Strategy — Co-Deploy Finnest on Existing AgenticAI-app Hosts¶

Status: Accepted Date: 2026-04-17 Decision Makers: Gautham Chellappa Depends on: ADR-001-F (Elixir), ADR-010-F (Strangler Fig migration), ADR-0011 (inherited — Elixir migration) Supersedes: — (clarifies the deployment intent sketched in ADR-0011)

Context¶

AgenticAI-app has four mature AWS ap-southeast-2 EC2 environments (local, integration, staging, production) plus a CI bastion. Each host runs Docker + Kamal, Caddy + Let's Encrypt, MySQL and Redis as Kamal accessories, with Bitwarden Secrets Manager for secret retrieval. Combined operational investment is significant — terraform modules, deploy scripts, GitHub Actions runners on bastion, SSL automation, per-env .env patterns.

ADR-0011 sketched a transition intent: "repurpose integration + staging for AgenticAI-finnest, production stays Laravel until cutover." This ADR formalises the concrete reuse strategy across all four envs and commits to a co-deploy pattern (Laravel + Finnest on the same hosts during transition), with a deliberate plan to manage resource contention, security separation, and cutover risk.

Three host strategies were considered:

Option A — co-deploy all envs (this ADR) — both apps on each existing host during transition, Laravel decommissioned post-cutover by stopping its services
Option B — new EC2 instances for every Finnest env (3 extra instances during transition)
Option C — hybrid (reuse integration + staging; new production for Finnest)

Option A was selected for cost-minimisation and operational consistency (one host pattern to manage across envs). The concerns Option A raises — resource contention, security scoping, cutover risk — are explicitly addressed in §Consequences and §Mitigations below.

Decision¶

Finnest deploys alongside Laravel AgenticAI-app on the existing four EC2 hosts during the transition window. Laravel services remain active throughout go-live and migration phases; Finnest services are added incrementally. Post-cutover (per ADR-010-F Strangler Fig, Migration Phase X decommission), Laravel services are stopped on each host and its accessories (MySQL, Redis, Horizon) retired. Finnest then has the hosts to itself.

Reuse inventory (what's shared)¶

Component	Reuse
4 × EC2 hosts in `ap-southeast-2`	Same hosts; both apps deploy via Kamal
AWS accounts (separate per env)	Same accounts; new IAM roles for Finnest
CI bastion with 3 GitHub Actions runners	Same runners serve both `ci.yml` (Laravel) and `ci-finnest.yml` (Elixir)
Bitwarden Secrets Manager	Same org; new `finnest` project alongside `agenticai` project
`bws` CLI + `scripts/deploy.sh` pattern	Ported pattern: `scripts/deploy-finnest.sh` with identical shape
Terraform modules (`app-host`, `ci-bastion`)	Reused as-is; parametrised per-env with Finnest additions
Caddy + Let's Encrypt	Same Caddy instance; new virtual hosts per Finnest subdomain
Docker runtime	Same; Finnest uses Debian-slim Elixir images
Kamal orchestration	Same; new `config/deploy.finnest.*.yml` files
Route 53 zone `agentic-ai.au`	Same zone; new Finnest subdomains
Sentry + Grafana + Prometheus stack	Same self-hosted observability pattern per ADR-010 Part 10

New provisioning (what's added per host)¶

Component	Added
PostgreSQL 17	New Kamal accessory per host (alongside existing MySQL)
Finnest app container (Elixir release)	New service
Oban-owned background processing	No separate accessory (Postgres-backed)
S3 buckets: `finnest-{integration,staging,production}-storage`	New buckets in same AWS accounts
Caddy config blocks for WebSocket upgrade (`/live/websocket`, `/socket/*`)	New config
Route 53 records: `integration-finnest.agentic-ai.au`, `staging-finnest.agentic-ai.au`, `app-finnest.agentic-ai.au`	New A/CNAME records
Postgres read replica on staging + production	New (post-Phase 1)

What's NOT reused¶

Component	Replacement
MySQL	Postgres (AR-13). MySQL remains active on each host for Laravel + Finnest `V2Repo` read-only access during Strangler Fig; retired when Laravel is retired.
Redis + Horizon	Not needed. Oban is Postgres-backed (AR-11); ETS covers cache (B02). Redis retired when Laravel is retired.
Horizon dashboard	Oban Web
`/up` healthcheck path	`/health` (shallow) + `/ready` (deep) per architecture Part 9
`.env.docker` per-env file pattern	`config/runtime.exs` env branching (Elixir convention)
`php artisan schedule:work`	Oban cron

Domain naming during transition¶

Finnest subdomains in the existing agentic-ai.au zone:

Env	Laravel URL (unchanged)	Finnest URL
Integration	`integration.agentic-ai.au`	`integration-finnest.agentic-ai.au`
Staging	`staging.agentic-ai.au`	`staging-finnest.agentic-ai.au`
Production	`app.agentic-ai.au`	`app-finnest.agentic-ai.au`

At cutover (per ADR-010-F Migration Phase X): app.agentic-ai.au CNAME-swaps to Finnest host; Laravel services stopped but image retained on host for 30-day rollback window; finally, zone-level migration to hexis.au (commercial name per brainstorm-11 Topic 4) is executed separately when commercial launch is ready.

Capacity planning¶

Current host utilisation (Laravel only):

Host	Instance	Approx. utilisation
Integration (t3.medium)	2 vCPU / 4 GB	~35% CPU, 55% RAM
Staging (t3.medium)	2 vCPU / 4 GB	~40% CPU, 60% RAM
Production (t3.large)	2 vCPU / 8 GB	~55% CPU, 70% RAM
CI Bastion (t3.large)	2 vCPU / 8 GB	~40% CPU, 50% RAM

Projected co-deploy utilisation (Laravel + Finnest, Phase 1):

Host	Target RAM budget
Integration + Staging (t3.medium, 4 GB)	Laravel ~1.5 GB + Postgres ~1 GB + Finnest BEAM ~1 GB + OS ~500 MB = ~4 GB — TIGHT. Upgrade trigger: sustained >85% RAM or OOM events. Upgrade target: t3.large (8 GB) before Phase 1 end.
Production (t3.large, 8 GB)	Laravel ~2 GB + Postgres ~3 GB + Finnest BEAM ~2 GB + OS ~500 MB = ~7.5 GB. Some headroom; upgrade to t3.xlarge if Phase 2 load exceeds.

Upgrade budget: ~$50/month per env if t3.medium → t3.large, ~$100/month per env if t3.large → t3.xlarge. Absorbed into Phase 1 infra costs; re-evaluated at each phase boundary.

Production promotion discipline¶

Added 2026-04-18 via Sprint 4 solutioning gate check (D3). Same pattern as AgenticAI-app production deploy.

Production deploys (both Laravel and Finnest) require human gating. Auto-on-main applies only to integration + staging. Production promotion enforces four controls:

Manual trigger only — production deploys fire from a dedicated workflow_dispatch workflow (.github/workflows/deploy-finnest-production.yml), never on push or merge. The workflow accepts a required confirmation input (e.g. matching the target host name) to prevent accidental triggers.
GitHub environment protection with 2 reviewers — the production-finnest GitHub environment (and the equivalent production environment for Laravel) requires two distinct reviewers with write access to approve the job before the workflow proceeds past the protection gate. Reviewer list is maintained in GitHub org settings; rotations tracked in the ops runbook.
Pre-deploy smoke gate — the workflow fails-fast if the staging smoke suite has not run clean within the prior 24 hours. scripts/deploy-finnest-smoke.sh staging must pass; the timestamp is checked against git log --first-parent staging...main.
Post-deploy smoke gate — the workflow runs scripts/deploy-finnest-smoke.sh production immediately after the kamal deploy step; a failing post-deploy smoke auto-triggers scripts/deploy-finnest.sh --rollback production and raises an alert in the ops channel. No silent regressions.

Rationale: production is live customer workload for Laravel and will be live customer workload for Finnest post-cutover. The cost of an accidental bad deploy — data loss, Laravel regression on the shared host, customer-facing downtime — is asymmetrically higher than the ~5 min of human latency that two-reviewer gating adds. Matches the AgenticAI-app production policy adopted at Laravel launch.

Alternatives Considered¶

Alternative	Rejected because
Option B — new EC2 instances for every Finnest env	+3 instances during transition (~$150/mo); operational overhead of managing two parallel infrastructure footprints; Laravel decommission becomes an instance termination exercise per env rather than a service-level stop
Option C — hybrid (reuse integration+staging, new production)	Reasonable but still 1 extra instance; operational inconsistency (production differs from lower envs in transition strategy); minor cost saving not worth the extra complexity
Option D — reuse all + cutover in place on same host at cutover day	Highest cutover risk (if Finnest boot fails on cutover day, Laravel is already stopped on same host); rollback requires re-deploying Laravel from image in emergency conditions
New Route 53 zone `hexis.au` from Phase 0	Commercial Hexis branding isn't ready at Phase 0 start; purchasing + SSL provisioning + zone migration adds unnecessary Phase 0 work. Zone migration is an independent late-phase task
Different secrets manager per app (Laravel on Bitwarden, Finnest on SSM)	Bitwarden carry-forward decided in architecture Part 8; one tool simpler operationally

Consequences¶

Positive:

Zero new EC2 instances during transition — direct cost saving ~$70-150/month vs Options B/C
One set of hosts, one set of SSL configs, one set of Terraform modules, one CI bastion to know
Laravel decommission = stop services on each host (no instance termination / data migration)
Postgres co-located with MySQL lets V2Repo MyXQL queries traverse loopback (low latency) during Strangler Fig
Team learns one host topology; on-call procedures apply to both apps
Caddy routing enables immediate side-by-side URL testing without DNS cutover

Negative:

Resource contention risk on t3.medium hosts (integration + staging). Peak load could cause either app to impact the other. Mitigation: upgrade to t3.large before any real traffic on Finnest beyond smoke tests. Monitor via CloudWatch CPUUtilization + MemoryUtilization.
Security blast radius is larger — a compromise of the host compromises both apps' data. Mitigation: separate DB users with minimum privileges (Finnest Postgres user cannot read MySQL; Finnest V2Repo MyXQL connection uses a read-only user with explicit SELECT grants only); secrets scoped per project in Bitwarden.
Operational coordination during transition — a Kamal deploy of one app requires care not to disturb the other's accessories. Mitigation: Kamal's per-app namespacing (agenticai-app-* containers vs finnest-* containers) is already isolation-friendly; documented playbook for each host's expected services.
Cutover rollback window — if Finnest prod goes wrong post-cutover, rollback requires re-enabling Laravel on the same host. Mitigation: Laravel image + DB state retained for 30 days post-cutover; scripted re-enable (scripts/restore-laravel.sh) prepared in advance and tested in staging; cutover checklist requires successful staging cutover rehearsal first.
Dependency drift risk — upgrades to shared components (Docker, Caddy, OS packages) now affect both apps. Mitigation: lower-env upgrades always run a day ahead of production; dependency upgrades tracked in a shared changelog per host.

Mitigations summary¶

Risk	Mitigation
t3.medium RAM exhaustion	Monitor; upgrade to t3.large before Phase 1 end
Finnest Postgres accidentally writes to Laravel MySQL	Separate DB users; V2Repo connection string uses a role with `SELECT` grants only (DA-01); architecture test asserts no `INSERT/UPDATE/DELETE` statements against V2Repo
Cross-app secret leakage via Bitwarden	Separate `agenticai` and `finnest` Bitwarden projects; machine account tokens per project; neither app's deploy script reads the other's tokens
Caddy config breaking WebSocket for LiveView	Staged Caddy config change with rollback; synthetic WebSocket test from CI before merge
Production cutover failure	Parallel run pattern: on cutover day, DNS switches `app.agentic-ai.au` to Finnest but Laravel services remain running for 72h; Laravel only stopped after 72h of clean Finnest operation
Kamal deploy of one app disturbs the other	Per-app Docker network namespacing; accessory containers prefixed per app (`agenticai-app-mysql` vs `finnest-postgres`); shutdown-order documented in runbooks
IRAP deployment bleeds into commercial hosts	IRAP is entirely separate (ADR-007-F). None of this ADR applies to IRAP. IRAP hosts are provisioned independently in a dedicated VPC with no peering to these commercial hosts.

Review triggers¶

Re-evaluate this ADR if any of:

Finnest production load causes sustained >75% RAM on production host even after instance upgrade
Security incident traces to cross-app state leakage
Cutover timeline materially changes such that long-term co-deploy (>12 months) becomes expected
Hexis commercial launch requires production environment rename before Finnest is ready for cutover

Implementation Plan¶

Phase 0 (Weeks 1–4):

Add finnest project to Bitwarden Secrets Manager; generate machine account token
Extend Terraform app-host module to accept finnest_enabled: true flag; provision Postgres Kamal accessory when set
Apply Terraform to integration host first — provisions Postgres + Finnest S3 bucket + Route 53 integration-finnest.agentic-ai.au
Create config/deploy.finnest.integration.yml Kamal config
Create scripts/deploy-finnest.sh (mirror of scripts/deploy.sh pattern)
Create .github/workflows/ci-finnest.yml using existing bastion runners
Add Caddy virtual host + WebSocket upgrade config
First Finnest deploy to integration — smoke test
Repeat for staging and production hosts

Phase 1+ (per ADR-010-F):

Strangler Fig migration phases run; Finnest takes over v2 capabilities domain-by-domain
Capacity monitoring runs continuously; upgrade triggers acted on within 1 week
Cutover rehearsal in staging before production cutover

Migration Phase X (post-cutover):

Laravel services stopped on each host; images retained 30 days
MySQL and Redis Kamal accessories stopped; volumes retained 30 days
Post-30-day: MySQL + Redis volumes purged; Laravel images removed; host renamed in Terraform (optional)
Zone migration agentic-ai.au → hexis.au as separate commercial workstream

Relationship to Guardrails¶

Enforces / is enforced by: ADR-001-F (Elixir), ADR-010-F (Strangler Fig phases), ADR-007-F (IRAP is separate — this ADR excluded from IRAP scope), DA-01 (V2Repo read-only), IN-01 through IN-04 (Docker + Elixir releases), IN-08 (health check), SE-11/12 (CI security scanning), SE-15 (Postgres TLS).