You’re now asking platform-level questions — good. This is where most “we run EKS” claims fall apart. Multi-cluster is not just more clusters — it’s control-plane separation + cross-cluster data flows.

I’ll break this into real production patterns, not theory:

1️⃣ Multi-cluster networking patterns 2️⃣ Platform cluster vs prod cluster responsibilities 3️⃣ How GitOps (Argo CD) controls prod clusters 4️⃣ Cross-cluster logging & monitoring 5️⃣ How Fluent Bit sends logs to another EKS cluster (actual flow)

No fluff — just how big shops really wire this.

🧭 Multi-Cluster Networking Patterns (EKS at Scale)

✅ Pattern A — Hub & Spoke (Most Common)

Shared Services VPC (Hub)
 ├── platform EKS
 ├── logging stack
 ├── monitoring stack
 └── gitops controllers

Spoke VPCs
 ├── prod EKS
 ├── staging EKS
 └── dev EKS

Connected using:

AWS Transit Gateway  ← preferred
or
VPC Peering          ← small setups

Why TGW wins:

route scale
easier expansion
central inspection
security control

✅ Pattern B — Private API + Bastion / SSM

Prod cluster API is private only.

Access path:

Platform cluster tools
→ SSM / bastion / runner
→ private endpoint
→ prod cluster

If your prod API is public at large scale — that’s weak design.

🧱 Platform EKS Cluster — What Lives There

This is not for business apps.

This cluster runs platform controllers:

GitOps controllers
Cluster fleet managers
Observability backends
Security scanners
Policy engines

Typical platform cluster workloads:

Argo CD
Thanos / Cortex
Loki / OpenSearch
Central Prometheus
Falco / security agents
OPA Gatekeeper
ExternalDNS controller
Cert-manager (sometimes centralized)

Think of it as control & visibility plane.

🚀 How GitOps Controls Prod Clusters

Using: 👉 Argo CD

Argo CD does NOT need to run inside every prod cluster.

Two models:

Model 1 — Central Argo CD (Platform Cluster)

ArgoCD (platform cluster)
   ↓
connects to
   ↓
Prod EKS API server
Staging EKS API server
Dev EKS API server

How?

argocd cluster add <context>

This creates:

service account in target cluster
RBAC bindings
token secret stored in Argo

Argo then deploys remotely.

Security requirement:

That service account is tightly scoped — not cluster-admin unless required.

Model 2 — Argo Per Cluster (Highly Regulated Orgs)

Each cluster runs its own Argo CD. Platform team controls repos & policies.

Used when:

compliance strict
cluster isolation required
no cross-cluster API trust allowed

📊 Cross-Cluster Monitoring Pattern

Metrics travel outward, not inward.

Prod clusters run:

Prometheus agent / remote_write

They push to:

Central metrics backend
(Thanos / AMP / Cortex)
in platform VPC

Never scrape prod from outside — push instead.

🪵 Cross-Cluster Logging — Fluent Bit Pattern

Using: 👉 Fluent Bit

Reality: Fluent Bit does NOT “send to another cluster”

It sends to a central log backend endpoint.

Example targets:

Loki
OpenSearch
Elasticsearch
Kafka
Kinesis
CloudWatch

That backend may be running inside platform EKS — but it’s exposed as a service endpoint.

✅ Real Flow — Fluent Bit → Central Loki (Platform Cluster)

Prod cluster:

FluentBit DaemonSet
  tail /var/log/containers
  enrich with k8s metadata
  output → https://loki.platform.internal

Platform cluster:

Loki gateway
Service type: NLB / internal LB
Private DNS name

Networking:

Prod VPC → Transit Gateway → Platform VPC → Loki NLB

No cluster-to-cluster kube traffic required. Just HTTPS to logging endpoint.

Example Fluent Bit Output Config

[OUTPUT]
  Name  loki
  Host  loki.platform.internal
  Port  3100
  tls   on
  labels cluster=prod,team=payments

Add cluster label — critical for multi-cluster search.

🔐 Security Controls You Must Add

At this scale — skipping these is negligence:

mTLS or TLS to logging backend
VPC-internal endpoints only
IAM auth if using OpenSearch/Kinesis
NetworkPolicy allowlist egress
Per-cluster log labels
Rate limiting

⚠️ Hard Truth — What Breaks First in Multi-Cluster

From real failures:

Central logging cluster becomes bottleneck
No per-cluster labels → log chaos
Argo CD given cluster-admin everywhere
No network quotas → logging DDoS
Scrape-based metrics across VPCs → timeouts
Public prod API endpoints → abused

If you want, next I can go deeper into:

exact Transit Gateway route tables for multi-cluster
Argo CD multi-cluster RBAC design
Loki vs OpenSearch tradeoffs at scale
cross-cluster service discovery patterns
zero-trust multi-cluster networking model

Pick one — we’ll go surgical.

Dr Multieks Cluster