Youβre now asking platform-level questions β good. This is where most βwe run EKSβ claims fall apart. Multi-cluster is not just more clusters β itβs control-plane separation + cross-cluster data flows.
Iβll break this into real production patterns, not theory:
1οΈβ£ Multi-cluster networking patterns 2οΈβ£ Platform cluster vs prod cluster responsibilities 3οΈβ£ How GitOps (Argo CD) controls prod clusters 4οΈβ£ Cross-cluster logging & monitoring 5οΈβ£ How Fluent Bit sends logs to another EKS cluster (actual flow)
No fluff β just how big shops really wire this.
π§ Multi-Cluster Networking Patterns (EKS at Scale)



β Pattern A β Hub & Spoke (Most Common)
Shared Services VPC (Hub)
βββ platform EKS
βββ logging stack
βββ monitoring stack
βββ gitops controllers
Spoke VPCs
βββ prod EKS
βββ staging EKS
βββ dev EKSConnected using:
AWS Transit Gateway β preferred
or
VPC Peering β small setupsWhy TGW wins:
- route scale
- easier expansion
- central inspection
- security control
β Pattern B β Private API + Bastion / SSM
Prod cluster API is private only.
Access path:
Platform cluster tools
β SSM / bastion / runner
β private endpoint
β prod clusterIf your prod API is public at large scale β thatβs weak design.
π§± Platform EKS Cluster β What Lives There
This is not for business apps.
This cluster runs platform controllers:
GitOps controllers
Cluster fleet managers
Observability backends
Security scanners
Policy enginesTypical platform cluster workloads:
Argo CD
Thanos / Cortex
Loki / OpenSearch
Central Prometheus
Falco / security agents
OPA Gatekeeper
ExternalDNS controller
Cert-manager (sometimes centralized)Think of it as control & visibility plane.
π How GitOps Controls Prod Clusters
Using: π Argo CD
Argo CD does NOT need to run inside every prod cluster.
Two models:
Model 1 β Central Argo CD (Platform Cluster)
ArgoCD (platform cluster)
β
connects to
β
Prod EKS API server
Staging EKS API server
Dev EKS API serverHow?
argocd cluster add <context>This creates:
- service account in target cluster
- RBAC bindings
- token secret stored in Argo
Argo then deploys remotely.
Security requirement:
That service account is tightly scoped β not cluster-admin unless required.
Model 2 β Argo Per Cluster (Highly Regulated Orgs)
Each cluster runs its own Argo CD. Platform team controls repos & policies.
Used when:
- compliance strict
- cluster isolation required
- no cross-cluster API trust allowed
π Cross-Cluster Monitoring Pattern
Metrics travel outward, not inward.
Prod clusters run:
Prometheus agent / remote_writeThey push to:
Central metrics backend
(Thanos / AMP / Cortex)
in platform VPCNever scrape prod from outside β push instead.
πͺ΅ Cross-Cluster Logging β Fluent Bit Pattern
Using: π Fluent Bit



Reality: Fluent Bit does NOT βsend to another clusterβ
It sends to a central log backend endpoint.
Example targets:
Loki
OpenSearch
Elasticsearch
Kafka
Kinesis
CloudWatchThat backend may be running inside platform EKS β but itβs exposed as a service endpoint.
β Real Flow β Fluent Bit β Central Loki (Platform Cluster)
Prod cluster:
FluentBit DaemonSet
tail /var/log/containers
enrich with k8s metadata
output β https://loki.platform.internalPlatform cluster:
Loki gateway
Service type: NLB / internal LB
Private DNS nameNetworking:
Prod VPC β Transit Gateway β Platform VPC β Loki NLBNo cluster-to-cluster kube traffic required. Just HTTPS to logging endpoint.
Example Fluent Bit Output Config
[OUTPUT]
Name loki
Host loki.platform.internal
Port 3100
tls on
labels cluster=prod,team=paymentsAdd cluster label β critical for multi-cluster search.
π Security Controls You Must Add
At this scale β skipping these is negligence:
mTLS or TLS to logging backend
VPC-internal endpoints only
IAM auth if using OpenSearch/Kinesis
NetworkPolicy allowlist egress
Per-cluster log labels
Rate limitingβ οΈ Hard Truth β What Breaks First in Multi-Cluster
From real failures:
- Central logging cluster becomes bottleneck
- No per-cluster labels β log chaos
- Argo CD given cluster-admin everywhere
- No network quotas β logging DDoS
- Scrape-based metrics across VPCs β timeouts
- Public prod API endpoints β abused
If you want, next I can go deeper into:
- exact Transit Gateway route tables for multi-cluster
- Argo CD multi-cluster RBAC design
- Loki vs OpenSearch tradeoffs at scale
- cross-cluster service discovery patterns
- zero-trust multi-cluster networking model
Pick one β weβll go surgical.