Switch configuration — underlay vs overlay
A modern data center switch runs two distinct planes simultaneously: the underlay and the overlay. Understanding the separation is the key to understanding why VXLAN and EVPN exist.
The underlay — IP reachability between all switches
The underlay is the physical IP network. Every switch has loopback addresses that are reachable from every other switch. The underlay uses eBGP or OSPF to distribute these loopback routes so that any VTEP (VXLAN Tunnel Endpoint) can reach any other VTEP. This is what the lab already does with BIRD2 and eBGP — it is a complete underlay.
! Spine underlay: advertise loopback into BGP for VTEP reachability interface loopback0 ip address 10.0.0.1/32 ip router ospf 1 area 0 ! or use BGP interface Ethernet1/1 ! spine to leaf link no switchport ip address 10.1.1.1/30 ip ospf network point-to-point no shutdown router bgp 65001 router-id 10.0.0.1 address-family ipv4 unicast network 10.0.0.1/32 ! loopback reachable to all VTEPs neighbor 10.1.1.2 ! leaf SP4 remote-as 65004 address-family ipv4 unicast send-community extended ! required for EVPN
The overlay — tenant traffic carried over the underlay
The overlay is the logical network. VXLAN tunnels between leaf switches carry tenant traffic (VMs, containers, servers) across the underlay without the physical switches needing to know about tenant MAC addresses at scale. The leaf switch encapsulates tenant frames in UDP packets and sends them across the underlay to the destination VTEP.
! Enable VXLAN on the leaf (VTEP) feature vn-segment-vlan-based feature nv overlay ! Loopback1 is the VTEP source IP (NVE interface) interface loopback1 ip address 10.1.0.1/32 ! unique per leaf ! NVE interface is the VXLAN tunnel endpoint interface nve1 no shutdown source-interface loopback1 host-reachability protocol bgp ! EVPN control plane member vni 10010 ! VNI for VLAN 10 tenant traffic mcast-group 239.1.1.1 ! multicast for BUM, or use ingress replication member vni 10020 ingress-replication protocol bgp ! EVPN unicast replication ! Map VLANs to VNIs vlan 10 vn-segment 10010 ! VXLAN Network Identifier for VLAN 10 vlan 20 vn-segment 10020
In the lab, the containers do not run VXLAN because nicolaka/netshoot does not have the kernel modules for NVE interfaces. The lab demonstrates the underlay (eBGP routing between switches). A production deployment adds VXLAN and EVPN on top of this same underlay foundation.
VXLAN — why we need it and how it works
Traditional VLANs are limited to 4,096 IDs and cannot span across IP routed boundaries. A hyperscale data center with tens of thousands of tenants cannot fit into 4,096 VLANs. VXLAN (Virtual Extensible LAN) solves both problems.
The problem VXLAN solves
When a VM on Server 1 (connected to Leaf 1) needs to communicate with a VM on Server 2 (connected to Leaf 4), the data center needs to carry Layer 2 traffic across a Layer 3 routed network. Without VXLAN, you need stretched VLANs using STP (which does not scale and creates loops) or complex DCI (Data Center Interconnect) solutions.
Server A (VLAN 10) Server B (VLAN 10)
| |
[ Leaf 1 ] [ Leaf 4 ]
VTEP: 10.1.0.1 VTEP: 10.4.0.1
| |
+-----[ Spine 1, 2, 3 ]------+
(IP underlay)
VXLAN encapsulation:
Outer: Leaf1 10.1.0.1 → Leaf4 10.4.0.1 (UDP port 4789)
Inner: ServerA MAC → ServerB MAC (original frame)
VNI: 10010 (identifies the tenant / VLAN)How VXLAN works
VXLAN encapsulates the original Layer 2 Ethernet frame inside a UDP packet with a 24-bit VXLAN Network Identifier (VNI). This gives 16 million possible segments instead of 4,096 VLANs. The outer IP header carries the packet across the routed underlay between VTEPs. The destination VTEP strips the VXLAN header and delivers the original frame to the correct server.
| Property | Traditional VLAN | VXLAN |
|---|---|---|
| Segment limit | 4,096 | 16 million (24-bit VNI) |
| Spans IP networks | No (STP boundary) | Yes (UDP/IP transport) |
| Loop prevention | STP (slow convergence) | IP routing (fast convergence) |
| Scales to hyperscale | No | Yes |
| Control plane | Flood and learn | EVPN (BGP-based, no flooding) |
EVPN — distributing MAC and IP information without flooding
VXLAN provides the data plane tunnel. But how does Leaf 1 know which VTEP to send a frame to when it does not know where Server B's MAC address is? The original answer was to flood the frame to all VTEPs and let them learn. At hyperscale, flooding traffic across thousands of VTEPs is catastrophic. EVPN eliminates flooding.
What EVPN does
EVPN (Ethernet VPN) is a BGP address family (MP-BGP with address-family l2vpn evpn) that distributes MAC addresses, IP-to-MAC bindings, and VTEP reachability information through the BGP control plane. Instead of flooding a frame to learn where a MAC lives, a leaf switch learns it from a BGP EVPN route advertisement before any traffic is sent.
router bgp 65004 router-id 10.0.0.4 address-family l2vpn evpn ! EVPN address family ! eBGP session to each spine for EVPN route distribution neighbor 10.1.1.1 ! Spine 1 remote-as 65001 address-family l2vpn evpn send-community extended route-map NEXT-HOP-UNCHANGED out ! preserve VTEP IP evpn vni 10010 l2 ! EVPN instance for VNI 10010 rd auto route-target import auto route-target export auto
EVPN route types
EVPN uses five route types to carry different kinds of information between VTEPs. The most important for a spine-leaf fabric are:
| Route Type | What it carries | Used for |
|---|---|---|
| Type 2 | MAC/IP advertisement | Distributing server MAC and IP addresses across the fabric so VTEPs know where to send frames without flooding |
| Type 3 | Inclusive multicast route | BUM (Broadcast, Unknown unicast, Multicast) traffic reachability — tells other VTEPs where to send flooded traffic |
| Type 5 | IP prefix route | Routing external prefixes (routes from outside the fabric) into the VXLAN overlay for inter-subnet routing |
EVPN is why modern data centers can run thousands of VTEPs without flooding tearing apart the network. Every leaf switch knows where every MAC address lives before it needs to send a frame, because BGP told it in advance.
BGP ECMP — using all three spines simultaneously
In the lab, SP4 has eBGP sessions to SP1, SP2, and SP3. Each spine is a viable path to reach SP7's subnet. Without ECMP, the router picks one path and all traffic goes through a single spine. With ECMP (Equal-Cost Multi-Path), traffic is hashed across all three spines simultaneously, tripling the effective bandwidth and providing automatic failover.
How ECMP works with BGP
By default, BGP selects a single best path and installs only that route in the forwarding table. ECMP requires explicitly telling BGP to install multiple equal-cost paths. The paths are “equal cost” when they have the same AS path length, same local preference, same MED, and same weight. In an AS-per-rack design, all three paths from a leaf through three spines to another leaf have equal AS path length, so ECMP works naturally.
router bgp 65004
address-family ipv4 unicast
maximum-paths 3 ! use up to 3 equal-cost eBGP paths
maximum-paths ibgp 3 ! also for iBGP if mixed topology
! Result: leaf SP4 installs 3 routes to SP7's subnet:
! 10.4.4.0/30 via 10.1.1.1 (SP1) dev eth1 [BGP/32]
! 10.4.4.0/30 via 10.2.1.1 (SP2) dev eth2 [BGP/32]
! 10.4.4.0/30 via 10.3.1.1 (SP3) dev eth3 [BGP/32]ECMP hashing
Traffic is distributed across the equal-cost paths using a hash of the flow's 5-tuple: source IP, destination IP, source port, destination port, and protocol. The same flow always takes the same path (so TCP sessions do not get reordered), but different flows take different paths. With a large number of flows, traffic distributes evenly across all three spines.
In the lab, BIRD2 installs multiple routes automatically when it learns the same prefix from multiple eBGP peers. The ip route show command on SP4 shows three BGP routes to SP7's subnet via SP1, SP2, and SP3 simultaneously — ECMP is already running in the lab.
Why VXLAN + EVPN + BGP ECMP are always used together
These three technologies are not independently useful at scale. They form a system where each one depends on the others.
BGP ECMP (underlay)
─────────────────────────────────────────────────────
Provides: IP reachability between all VTEPs
Uses: eBGP with maximum-paths, AS per rack design
Result: Every VTEP can reach every other VTEP
via multiple equal-cost paths through spines
─────────────────────────────────────────────────────
VXLAN (data plane overlay)
─────────────────────────────────────────────────────
Provides: Layer 2 transport across Layer 3 underlay
Uses: UDP encapsulation, 24-bit VNI
Depends: Underlay ECMP for load-balanced tunnels
Result: Tenant traffic crosses fabric with no STP
─────────────────────────────────────────────────────
EVPN (control plane)
─────────────────────────────────────────────────────
Provides: MAC/IP distribution without flooding
Uses: MP-BGP l2vpn evpn address family
Depends: BGP sessions (same sessions as underlay)
Result: VTEPs know where to send frames before
sending them. No BUM flooding at scale.
─────────────────────────────────────────────────────ECMP makes the underlay redundant and high-bandwidth. VXLAN tunnels run over the underlay and benefit from ECMP automatically because the outer IP source and destination change per flow. EVPN runs over the same BGP sessions as the underlay, which means no additional routing protocol, no additional peering relationships. The entire fabric is BGP.
This is why hyperscale data centers at Meta, Google, and Amazon converged on this design. One protocol (BGP) handles underlay routing, overlay control plane, and external routing. One team of engineers with BGP expertise can operate the entire network.
How to scale from 13 nodes to a real enterprise or hyperscale network
The lab runs 13 nodes on a single VM. A real enterprise data center runs hundreds to thousands of switches. Here is the progression.
Lab scale — 13 nodes (current)
3 spines, 4 leaves, 4 servers, 1 edge router. Single VM in Containerlab. eBGP only. Demonstrates the underlay design and automation patterns. Full BGP convergence in under 30 seconds. Suitable for learning, development, and CI/CD testing.
Small enterprise — 50 to 200 nodes
Add more leaves (one per rack), more servers per leaf. Introduce VXLAN and EVPN for multi-tenancy. Add a Route Reflector or spine-based RR to avoid full iBGP mesh. Add a second pair of spines for redundancy. NetBox source of truth remains the same, automation scales by adding more devices to the device list. n8n workflow adds more device checks to the monitoring loop.
Large enterprise / service provider — 500 to 5,000 nodes
Add a second tier of spine aggregation (super-spines). Multiple pods each with their own spines, connected via super-spines. Each pod maintains local ECMP, inter-pod traffic goes through super-spines. Add dedicated border leaves for WAN and internet connectivity. Containerlab cannot simulate this scale but the design and automation patterns are identical — only the device count changes.
Hyperscale — tens of thousands of nodes
Meta's data centers run spine-leaf at this scale. The BGP design is the same: AS per rack, eBGP underlay, ECMP through multiple spines. What changes is the automation infrastructure: ZTP (Zero Touch Provisioning) for auto-onboarding new switches, Ansible at scale for config push, network-as-code with Git as source of truth, and monitoring via streaming telemetry (gNMI) rather than SNMP polling. The AIOps pipeline scales by connecting to the telemetry stream rather than polling devices individually.
What stays the same at every scale
The eBGP design principle, the AS-per-rack model, the ECMP requirement, and the NetBox-as-source-of-truth pattern are consistent from 13 nodes to 30,000. The lab is a faithful miniature of the real thing. Every automation script, every BGP config template, and every Python script in this lab could be pointed at a production network with minimal changes.
How the AI anomaly detection agent maps to a real production network
The lab runs the AIOps pipeline against 8 containers via docker exec. A production network would run the same logic against real switches via NETCONF, gNMI streaming telemetry, or SSH. The fundamental design — collect logs, redact sensitive data, analyze with AI, alert with root cause and recommendation — is identical.
Lab (current implementation)
- Log source: docker exec inside containers
- Protocol: subprocess + shell commands
- Commands: birdc show protocols, ip route show
- Devices: 8 (SP1-SP7, router1)
- Interval: 15 minutes via n8n
- Alerting: Gmail via n8n SMTP
- Remediation: fix_bird_configs.py via Flask API
- Scale: single VM, free tier
Production (equivalent design)
- Log source: gNMI streaming or NETCONF
- Protocol: pygnmi, netmiko, or RESTCONF
- Data: BGP peer state, route table, interface counters
- Devices: hundreds to thousands
- Interval: real-time or 1-5 minute cycle
- Alerting: PagerDuty, Slack, or NOC dashboard
- Remediation: Ansible playbook via AWX/Tower
- Scale: Kubernetes + horizontal scaling
What changes in production
The log collection method changes from docker exec to a proper network management protocol. Instead of running shell commands inside containers, the production version would subscribe to gNMI telemetry streams from each switch, which pushes data in real time without polling. The Flask API would be replaced with a Kafka consumer that processes telemetry events as they arrive.
The AI analysis layer does not change at all. Gemini receives the same redacted structured log regardless of whether it was collected via docker exec or gNMI. The prompt, the response format, the severity parsing — all identical.
The remediation layer changes from running a Python script directly to triggering an Ansible playbook via AWX or Red Hat Automation Platform. The playbook applies the fix across the affected device and verifies the result, with full audit trail logging. The human-in-the-loop approval step through n8n remains the same design pattern.
Why this AI agent design is production-correct
The three architectural decisions in the lab that make it production-transferable are the same decisions production teams make:
Redaction before AI. No sensitive data reaches an external AI service. This is not optional in a production network — customer IPs, device hostnames, and routing policy information are confidential. The redaction layer is not a convenience feature, it is a security control.
Human approval before remediation. The AI detects and recommends but does not act autonomously on production infrastructure. The approve-before-fix pattern through the webhook is the correct design for any automated remediation in a real environment. Fully autonomous remediation without human approval is an anti-pattern in production networks where a wrong automated fix can cause more damage than the original fault.
Structured output from AI. The prompt instructs Gemini to respond in a parseable format with fixed field names (SEVERITY, ANOMALY_DETECTED, RECOMMENDED_ACTION). This makes the AI output machine-readable so downstream systems (n8n, alerting, ticketing) can process it programmatically rather than treating it as free text. This is the difference between an AI tool and an AI agent.
At TikTok's scale, the same pattern runs at thousands of times this volume. The tools change (Kafka instead of n8n, Ansible instead of Python scripts, gNMI instead of SSH) but the architecture is identical: collect, redact, analyze, alert with context, approve, remediate, verify. This lab demonstrates all six steps working end to end.