PLatformEngineering
Multicluster Networking

You’re now asking platform-level questions β€” good. This is where most β€œwe run EKS” claims fall apart. Multi-cluster is not just more clusters β€” it’s control-plane separation + cross-cluster data flows.

I’ll break this into real production patterns, not theory:

1️⃣ Multi-cluster networking patterns 2️⃣ Platform cluster vs prod cluster responsibilities 3️⃣ How GitOps (Argo CD) controls prod clusters 4️⃣ Cross-cluster logging & monitoring 5️⃣ How Fluent Bit sends logs to another EKS cluster (actual flow)

No fluff β€” just how big shops really wire this.


🧭 Multi-Cluster Networking Patterns (EKS at Scale)

Image

Image

Image

Image

βœ… Pattern A β€” Hub & Spoke (Most Common)

Shared Services VPC (Hub)
 β”œβ”€β”€ platform EKS
 β”œβ”€β”€ logging stack
 β”œβ”€β”€ monitoring stack
 └── gitops controllers

Spoke VPCs
 β”œβ”€β”€ prod EKS
 β”œβ”€β”€ staging EKS
 └── dev EKS

Connected using:

AWS Transit Gateway  ← preferred
or
VPC Peering          ← small setups

Why TGW wins:

  • route scale
  • easier expansion
  • central inspection
  • security control

βœ… Pattern B β€” Private API + Bastion / SSM

Prod cluster API is private only.

Access path:

Platform cluster tools
β†’ SSM / bastion / runner
β†’ private endpoint
β†’ prod cluster

If your prod API is public at large scale β€” that’s weak design.


🧱 Platform EKS Cluster β€” What Lives There

This is not for business apps.

This cluster runs platform controllers:

GitOps controllers
Cluster fleet managers
Observability backends
Security scanners
Policy engines

Typical platform cluster workloads:

Argo CD
Thanos / Cortex
Loki / OpenSearch
Central Prometheus
Falco / security agents
OPA Gatekeeper
ExternalDNS controller
Cert-manager (sometimes centralized)

Think of it as control & visibility plane.


πŸš€ How GitOps Controls Prod Clusters

Using: πŸ‘‰ Argo CD

Argo CD does NOT need to run inside every prod cluster.

Two models:


Model 1 β€” Central Argo CD (Platform Cluster)

ArgoCD (platform cluster)
   ↓
connects to
   ↓
Prod EKS API server
Staging EKS API server
Dev EKS API server

How?

argocd cluster add <context>

This creates:

  • service account in target cluster
  • RBAC bindings
  • token secret stored in Argo

Argo then deploys remotely.

Security requirement:

That service account is tightly scoped β€” not cluster-admin unless required.


Model 2 β€” Argo Per Cluster (Highly Regulated Orgs)

Each cluster runs its own Argo CD. Platform team controls repos & policies.

Used when:

  • compliance strict
  • cluster isolation required
  • no cross-cluster API trust allowed

πŸ“Š Cross-Cluster Monitoring Pattern

Metrics travel outward, not inward.

Prod clusters run:

Prometheus agent / remote_write

They push to:

Central metrics backend
(Thanos / AMP / Cortex)
in platform VPC

Never scrape prod from outside β€” push instead.


πŸͺ΅ Cross-Cluster Logging β€” Fluent Bit Pattern

Using: πŸ‘‰ Fluent Bit

Image

Image

Image

Image

Reality: Fluent Bit does NOT β€œsend to another cluster”

It sends to a central log backend endpoint.

Example targets:

Loki
OpenSearch
Elasticsearch
Kafka
Kinesis
CloudWatch

That backend may be running inside platform EKS β€” but it’s exposed as a service endpoint.


βœ… Real Flow β€” Fluent Bit β†’ Central Loki (Platform Cluster)

Prod cluster:

FluentBit DaemonSet
  tail /var/log/containers
  enrich with k8s metadata
  output β†’ https://loki.platform.internal

Platform cluster:

Loki gateway
Service type: NLB / internal LB
Private DNS name

Networking:

Prod VPC β†’ Transit Gateway β†’ Platform VPC β†’ Loki NLB

No cluster-to-cluster kube traffic required. Just HTTPS to logging endpoint.


Example Fluent Bit Output Config

[OUTPUT]
  Name  loki
  Host  loki.platform.internal
  Port  3100
  tls   on
  labels cluster=prod,team=payments

Add cluster label β€” critical for multi-cluster search.


πŸ” Security Controls You Must Add

At this scale β€” skipping these is negligence:

mTLS or TLS to logging backend
VPC-internal endpoints only
IAM auth if using OpenSearch/Kinesis
NetworkPolicy allowlist egress
Per-cluster log labels
Rate limiting

⚠️ Hard Truth β€” What Breaks First in Multi-Cluster

From real failures:

  1. Central logging cluster becomes bottleneck
  2. No per-cluster labels β†’ log chaos
  3. Argo CD given cluster-admin everywhere
  4. No network quotas β†’ logging DDoS
  5. Scrape-based metrics across VPCs β†’ timeouts
  6. Public prod API endpoints β†’ abused

If you want, next I can go deeper into:

  • exact Transit Gateway route tables for multi-cluster
  • Argo CD multi-cluster RBAC design
  • Loki vs OpenSearch tradeoffs at scale
  • cross-cluster service discovery patterns
  • zero-trust multi-cluster networking model

Pick one β€” we’ll go surgical.


πŸ’¬ Need a Quick Summary?

Hey! Don't have time to read everything? I get it. 😊
Click below and I'll give you the main points and what matters most on this page.
Takes about 5 seconds β€’ Uses Perplexity AI