VXLAN, EVPN, BGP ECMP — Deep Dive

Foundation

Switch configuration — underlay vs overlay

A modern data center switch runs two distinct planes simultaneously: the underlay and the overlay. Understanding the separation is the key to understanding why VXLAN and EVPN exist.

The underlay — IP reachability between all switches

The underlay is the physical IP network. Every switch has loopback addresses that are reachable from every other switch. The underlay uses eBGP or OSPF to distribute these loopback routes so that any VTEP (VXLAN Tunnel Endpoint) can reach any other VTEP. This is what the lab already does with BIRD2 and eBGP — it is a complete underlay.

Cisco NX-OS — spine underlay config
! Spine underlay: advertise loopback into BGP for VTEP reachability
interface loopback0
  ip address 10.0.0.1/32
  ip router ospf 1 area 0  ! or use BGP

interface Ethernet1/1  ! spine to leaf link
  no switchport
  ip address 10.1.1.1/30
  ip ospf network point-to-point
  no shutdown

router bgp 65001
  router-id 10.0.0.1
  address-family ipv4 unicast
    network 10.0.0.1/32    ! loopback reachable to all VTEPs
  neighbor 10.1.1.2        ! leaf SP4
    remote-as 65004
    address-family ipv4 unicast
      send-community extended  ! required for EVPN

The overlay — tenant traffic carried over the underlay

The overlay is the logical network. VXLAN tunnels between leaf switches carry tenant traffic (VMs, containers, servers) across the underlay without the physical switches needing to know about tenant MAC addresses at scale. The leaf switch encapsulates tenant frames in UDP packets and sends them across the underlay to the destination VTEP.

Cisco NX-OS — leaf VTEP config
! Enable VXLAN on the leaf (VTEP)
feature vn-segment-vlan-based
feature nv overlay

! Loopback1 is the VTEP source IP (NVE interface)
interface loopback1
  ip address 10.1.0.1/32  ! unique per leaf

! NVE interface is the VXLAN tunnel endpoint
interface nve1
  no shutdown
  source-interface loopback1
  host-reachability protocol bgp  ! EVPN control plane
  member vni 10010               ! VNI for VLAN 10 tenant traffic
    mcast-group 239.1.1.1        ! multicast for BUM, or use ingress replication
  member vni 10020
    ingress-replication protocol bgp  ! EVPN unicast replication

! Map VLANs to VNIs
vlan 10
  vn-segment 10010  ! VXLAN Network Identifier for VLAN 10
vlan 20
  vn-segment 10020

In the lab, the containers do not run VXLAN because nicolaka/netshoot does not have the kernel modules for NVE interfaces. The lab demonstrates the underlay (eBGP routing between switches). A production deployment adds VXLAN and EVPN on top of this same underlay foundation.

Overlay protocol

VXLAN — why we need it and how it works

Traditional VLANs are limited to 4,096 IDs and cannot span across IP routed boundaries. A hyperscale data center with tens of thousands of tenants cannot fit into 4,096 VLANs. VXLAN (Virtual Extensible LAN) solves both problems.

The problem VXLAN solves

When a VM on Server 1 (connected to Leaf 1) needs to communicate with a VM on Server 2 (connected to Leaf 4), the data center needs to carry Layer 2 traffic across a Layer 3 routed network. Without VXLAN, you need stretched VLANs using STP (which does not scale and creates loops) or complex DCI (Data Center Interconnect) solutions.

  Server A (VLAN 10)          Server B (VLAN 10)
       |                            |
   [ Leaf 1 ]                  [ Leaf 4 ]
   VTEP: 10.1.0.1              VTEP: 10.4.0.1
       |                            |
       +-----[ Spine 1, 2, 3 ]------+
              (IP underlay)

VXLAN encapsulation:
  Outer:  Leaf1 10.1.0.1 → Leaf4 10.4.0.1 (UDP port 4789)
  Inner:  ServerA MAC → ServerB MAC (original frame)
  VNI:    10010 (identifies the tenant / VLAN)

How VXLAN works

VXLAN encapsulates the original Layer 2 Ethernet frame inside a UDP packet with a 24-bit VXLAN Network Identifier (VNI). This gives 16 million possible segments instead of 4,096 VLANs. The outer IP header carries the packet across the routed underlay between VTEPs. The destination VTEP strips the VXLAN header and delivers the original frame to the correct server.

Property	Traditional VLAN	VXLAN
Segment limit	4,096	16 million (24-bit VNI)
Spans IP networks	No (STP boundary)	Yes (UDP/IP transport)
Loop prevention	STP (slow convergence)	IP routing (fast convergence)
Scales to hyperscale	No	Yes
Control plane	Flood and learn	EVPN (BGP-based, no flooding)

Control plane

EVPN — distributing MAC and IP information without flooding

VXLAN provides the data plane tunnel. But how does Leaf 1 know which VTEP to send a frame to when it does not know where Server B's MAC address is? The original answer was to flood the frame to all VTEPs and let them learn. At hyperscale, flooding traffic across thousands of VTEPs is catastrophic. EVPN eliminates flooding.

What EVPN does

EVPN (Ethernet VPN) is a BGP address family (MP-BGP with address-family l2vpn evpn) that distributes MAC addresses, IP-to-MAC bindings, and VTEP reachability information through the BGP control plane. Instead of flooding a frame to learn where a MAC lives, a leaf switch learns it from a BGP EVPN route advertisement before any traffic is sent.

Cisco NX-OS — EVPN BGP config on leaf
router bgp 65004
  router-id 10.0.0.4
  address-family l2vpn evpn  ! EVPN address family

  ! eBGP session to each spine for EVPN route distribution
  neighbor 10.1.1.1          ! Spine 1
    remote-as 65001
    address-family l2vpn evpn
      send-community extended
      route-map NEXT-HOP-UNCHANGED out  ! preserve VTEP IP

evpn
  vni 10010 l2                ! EVPN instance for VNI 10010
    rd auto
    route-target import auto
    route-target export auto

EVPN route types

EVPN uses five route types to carry different kinds of information between VTEPs. The most important for a spine-leaf fabric are:

Route Type	What it carries	Used for
Type 2	MAC/IP advertisement	Distributing server MAC and IP addresses across the fabric so VTEPs know where to send frames without flooding
Type 3	Inclusive multicast route	BUM (Broadcast, Unknown unicast, Multicast) traffic reachability — tells other VTEPs where to send flooded traffic
Type 5	IP prefix route	Routing external prefixes (routes from outside the fabric) into the VXLAN overlay for inter-subnet routing

EVPN is why modern data centers can run thousands of VTEPs without flooding tearing apart the network. Every leaf switch knows where every MAC address lives before it needs to send a frame, because BGP told it in advance.

Load balancing

BGP ECMP — using all three spines simultaneously

In the lab, SP4 has eBGP sessions to SP1, SP2, and SP3. Each spine is a viable path to reach SP7's subnet. Without ECMP, the router picks one path and all traffic goes through a single spine. With ECMP (Equal-Cost Multi-Path), traffic is hashed across all three spines simultaneously, tripling the effective bandwidth and providing automatic failover.

How ECMP works with BGP

By default, BGP selects a single best path and installs only that route in the forwarding table. ECMP requires explicitly telling BGP to install multiple equal-cost paths. The paths are “equal cost” when they have the same AS path length, same local preference, same MED, and same weight. In an AS-per-rack design, all three paths from a leaf through three spines to another leaf have equal AS path length, so ECMP works naturally.

Cisco NX-OS — enabling BGP ECMP on leaf
router bgp 65004
  address-family ipv4 unicast
    maximum-paths 3        ! use up to 3 equal-cost eBGP paths
    maximum-paths ibgp 3   ! also for iBGP if mixed topology

! Result: leaf SP4 installs 3 routes to SP7's subnet:
! 10.4.4.0/30 via 10.1.1.1 (SP1) dev eth1  [BGP/32]
! 10.4.4.0/30 via 10.2.1.1 (SP2) dev eth2  [BGP/32]
! 10.4.4.0/30 via 10.3.1.1 (SP3) dev eth3  [BGP/32]

ECMP hashing

Traffic is distributed across the equal-cost paths using a hash of the flow's 5-tuple: source IP, destination IP, source port, destination port, and protocol. The same flow always takes the same path (so TCP sessions do not get reordered), but different flows take different paths. With a large number of flows, traffic distributes evenly across all three spines.

In the lab, BIRD2 installs multiple routes automatically when it learns the same prefix from multiple eBGP peers. The ip route show command on SP4 shows three BGP routes to SP7's subnet via SP1, SP2, and SP3 simultaneously — ECMP is already running in the lab.

Architecture

Why VXLAN + EVPN + BGP ECMP are always used together

These three technologies are not independently useful at scale. They form a system where each one depends on the others.

  BGP ECMP (underlay)
  ─────────────────────────────────────────────────────
  Provides: IP reachability between all VTEPs
  Uses:     eBGP with maximum-paths, AS per rack design
  Result:   Every VTEP can reach every other VTEP
            via multiple equal-cost paths through spines
  ─────────────────────────────────────────────────────

  VXLAN (data plane overlay)
  ─────────────────────────────────────────────────────
  Provides: Layer 2 transport across Layer 3 underlay
  Uses:     UDP encapsulation, 24-bit VNI
  Depends:  Underlay ECMP for load-balanced tunnels
  Result:   Tenant traffic crosses fabric with no STP
  ─────────────────────────────────────────────────────

  EVPN (control plane)
  ─────────────────────────────────────────────────────
  Provides: MAC/IP distribution without flooding
  Uses:     MP-BGP l2vpn evpn address family
  Depends:  BGP sessions (same sessions as underlay)
  Result:   VTEPs know where to send frames before
            sending them. No BUM flooding at scale.
  ─────────────────────────────────────────────────────

ECMP makes the underlay redundant and high-bandwidth. VXLAN tunnels run over the underlay and benefit from ECMP automatically because the outer IP source and destination change per flow. EVPN runs over the same BGP sessions as the underlay, which means no additional routing protocol, no additional peering relationships. The entire fabric is BGP.

This is why hyperscale data centers at Meta, Google, and Amazon converged on this design. One protocol (BGP) handles underlay routing, overlay control plane, and external routing. One team of engineers with BGP expertise can operate the entire network.

Scaling

How to scale from 13 nodes to a real enterprise or hyperscale network

The lab runs 13 nodes on a single VM. A real enterprise data center runs hundreds to thousands of switches. Here is the progression.

Lab scale — 13 nodes (current)

3 spines, 4 leaves, 4 servers, 1 edge router. Single VM in Containerlab. eBGP only. Demonstrates the underlay design and automation patterns. Full BGP convergence in under 30 seconds. Suitable for learning, development, and CI/CD testing.

Small enterprise — 50 to 200 nodes

Add more leaves (one per rack), more servers per leaf. Introduce VXLAN and EVPN for multi-tenancy. Add a Route Reflector or spine-based RR to avoid full iBGP mesh. Add a second pair of spines for redundancy. NetBox source of truth remains the same, automation scales by adding more devices to the device list. n8n workflow adds more device checks to the monitoring loop.

Large enterprise / service provider — 500 to 5,000 nodes

Add a second tier of spine aggregation (super-spines). Multiple pods each with their own spines, connected via super-spines. Each pod maintains local ECMP, inter-pod traffic goes through super-spines. Add dedicated border leaves for WAN and internet connectivity. Containerlab cannot simulate this scale but the design and automation patterns are identical — only the device count changes.

Hyperscale — tens of thousands of nodes

Meta's data centers run spine-leaf at this scale. The BGP design is the same: AS per rack, eBGP underlay, ECMP through multiple spines. What changes is the automation infrastructure: ZTP (Zero Touch Provisioning) for auto-onboarding new switches, Ansible at scale for config push, network-as-code with Git as source of truth, and monitoring via streaming telemetry (gNMI) rather than SNMP polling. The AIOps pipeline scales by connecting to the telemetry stream rather than polling devices individually.

What stays the same at every scale

The eBGP design principle, the AS-per-rack model, the ECMP requirement, and the NetBox-as-source-of-truth pattern are consistent from 13 nodes to 30,000. The lab is a faithful miniature of the real thing. Every automation script, every BGP config template, and every Python script in this lab could be pointed at a production network with minimal changes.

AI Agent

How the AI anomaly detection agent maps to a real production network

The lab runs the AIOps pipeline against 8 containers via docker exec. A production network would run the same logic against real switches via NETCONF, gNMI streaming telemetry, or SSH. The fundamental design — collect logs, redact sensitive data, analyze with AI, alert with root cause and recommendation — is identical.

Lab (current implementation)

Log source: docker exec inside containers
Protocol: subprocess + shell commands
Commands: birdc show protocols, ip route show
Devices: 8 (SP1-SP7, router1)
Interval: 15 minutes via n8n
Alerting: Gmail via n8n SMTP
Remediation: fix_bird_configs.py via Flask API
Scale: single VM, free tier

Production (equivalent design)

Log source: gNMI streaming or NETCONF
Protocol: pygnmi, netmiko, or RESTCONF
Data: BGP peer state, route table, interface counters
Devices: hundreds to thousands
Interval: real-time or 1-5 minute cycle
Alerting: PagerDuty, Slack, or NOC dashboard
Remediation: Ansible playbook via AWX/Tower
Scale: Kubernetes + horizontal scaling

What changes in production

The log collection method changes from docker exec to a proper network management protocol. Instead of running shell commands inside containers, the production version would subscribe to gNMI telemetry streams from each switch, which pushes data in real time without polling. The Flask API would be replaced with a Kafka consumer that processes telemetry events as they arrive.

The AI analysis layer does not change at all. Gemini receives the same redacted structured log regardless of whether it was collected via docker exec or gNMI. The prompt, the response format, the severity parsing — all identical.

The remediation layer changes from running a Python script directly to triggering an Ansible playbook via AWX or Red Hat Automation Platform. The playbook applies the fix across the affected device and verifies the result, with full audit trail logging. The human-in-the-loop approval step through n8n remains the same design pattern.

Why this AI agent design is production-correct

The three architectural decisions in the lab that make it production-transferable are the same decisions production teams make:

Redaction before AI. No sensitive data reaches an external AI service. This is not optional in a production network — customer IPs, device hostnames, and routing policy information are confidential. The redaction layer is not a convenience feature, it is a security control.

Human approval before remediation. The AI detects and recommends but does not act autonomously on production infrastructure. The approve-before-fix pattern through the webhook is the correct design for any automated remediation in a real environment. Fully autonomous remediation without human approval is an anti-pattern in production networks where a wrong automated fix can cause more damage than the original fault.

Structured output from AI. The prompt instructs Gemini to respond in a parseable format with fixed field names (SEVERITY, ANOMALY_DETECTED, RECOMMENDED_ACTION). This makes the AI output machine-readable so downstream systems (n8n, alerting, ticketing) can process it programmatically rather than treating it as free text. This is the difference between an AI tool and an AI agent.

At TikTok's scale, the same pattern runs at thousands of times this volume. The tools change (Kafka instead of n8n, Ansible instead of Python scripts, gNMI instead of SSH) but the architecture is identical: collect, redact, analyze, alert with context, approve, remediate, verify. This lab demonstrates all six steps working end to end.

VXLAN, EVPN, BGP ECMP — the architecture behind the lab

Switch configuration — underlay vs overlay

The underlay — IP reachability between all switches

The overlay — tenant traffic carried over the underlay

VXLAN — why we need it and how it works

The problem VXLAN solves

How VXLAN works

EVPN — distributing MAC and IP information without flooding

What EVPN does

EVPN route types

BGP ECMP — using all three spines simultaneously

How ECMP works with BGP

ECMP hashing

Why VXLAN + EVPN + BGP ECMP are always used together

How to scale from 13 nodes to a real enterprise or hyperscale network

Lab scale — 13 nodes (current)

Small enterprise — 50 to 200 nodes

Large enterprise / service provider — 500 to 5,000 nodes

Hyperscale — tens of thousands of nodes

What stays the same at every scale

How the AI anomaly detection agent maps to a real production network

Lab (current implementation)

Production (equivalent design)

What changes in production

Why this AI agent design is production-correct