Spine-leaf topology and BGP design
The lab runs a 13-node enterprise spine-leaf topology in Containerlab on a single Oracle Cloud ARM VM (4 CPU, 24 GB RAM, free tier). The routing protocol is eBGP with an AS-per-rack design, which is how large-scale data centers like those at Meta, Amazon, and TikTok are built. Each rack gets its own autonomous system number, spines share AS65001, and the edge router represents an upstream ISP or WAN connection at AS65000.
IP addressing
| Link | Side A | Side B | Subnet |
|---|---|---|---|
| router1 → SP1 | router1:eth1 = 10.0.1.1 | SP1:eth1 = 10.0.1.2 | 10.0.1.0/30 |
| router1 → SP2 | router1:eth2 = 10.0.2.1 | SP2:eth1 = 10.0.2.2 | 10.0.2.0/30 |
| router1 → SP3 | router1:eth3 = 10.0.3.1 | SP3:eth1 = 10.0.3.2 | 10.0.3.0/30 |
| SP1 → SP4 | SP1:eth2 = 10.1.1.1 | SP4:eth1 = 10.1.1.2 | 10.1.1.0/30 |
| SP1 → SP5 | SP1:eth3 = 10.1.2.1 | SP5:eth1 = 10.1.2.2 | 10.1.2.0/30 |
| SP2 → SP4 | SP2:eth2 = 10.2.1.1 | SP4:eth2 = 10.2.1.2 | 10.2.1.0/30 |
| SP3 → SP4 | SP3:eth2 = 10.3.1.1 | SP4:eth3 = 10.3.1.2 | 10.3.1.0/30 |
| SP4 → SRV1 | SP4:eth4 = 10.4.1.1 | SRV1:eth1 | 10.4.1.0/30 |
| SP7 → SRV4 | SP7:eth4 = 10.4.4.1 | SRV4:eth1 | 10.4.4.0/30 |
All 19 cable connections are stored in NetBox as the source of truth. The add_netbox_cables.py script reads the topology definition and creates every cable via the NetBox dcim.cables API automatically. Interface types are set to 1000base-t since NetBox requires physical interfaces for cable terminations.
BGP configuration with BIRD2
Each node runs BIRD2 as its routing daemon. The BGP design uses static blackhole route advertisements so that each device originates its own connected subnets into BGP. The static routes advertise the subnet into BIRD's master routing table and BGP then redistributes them to all peers. Without the static routes, BGP sessions establish but no prefixes are exchanged.
router id 10.0.0.1;
protocol kernel { ipv4 { export all; import all; }; }
protocol device { scan time 10; }
# Advertise connected subnets into BGP via static blackhole routes
protocol static {
ipv4;
route 10.1.1.0/30 blackhole; # SP1-SP4 link
route 10.1.2.0/30 blackhole; # SP1-SP5 link
route 10.1.3.0/30 blackhole; # SP1-SP6 link
route 10.1.4.0/30 blackhole; # SP1-SP7 link
}
# eBGP session to edge router (upstream)
protocol bgp upstream {
local 10.0.1.2 as 65001;
neighbor 10.0.1.1 as 65000;
ipv4 { import all; export all; };
}
# eBGP sessions to each leaf switch (AS per rack)
protocol bgp sp4 {
local 10.1.1.1 as 65001;
neighbor 10.1.1.2 as 65004;
ipv4 { import all; export all; };
}
protocol bgp sp5 { local 10.1.2.1 as 65001; neighbor 10.1.2.2 as 65005; ipv4 { import all; export all; }; }
protocol bgp sp6 { local 10.1.3.1 as 65001; neighbor 10.1.3.2 as 65006; ipv4 { import all; export all; }; }
protocol bgp sp7 { local 10.1.4.1 as 65001; neighbor 10.1.4.2 as 65007; ipv4 { import all; export all; }; }The fix_bird_configs.py Python script writes this configuration directly into each running container using docker exec -i ... tee /etc/bird.conf, then restarts BIRD. This is the key automation step that makes the lab self-healing after a restart.
def write_config(device, config): container = f"clab-enterprise-spine-leaf-{device}" result = subprocess.run( ["sudo", "docker", "exec", "-i", container, "sh", "-c", "tee /etc/bird.conf > /dev/null"], input=config, text=True, capture_output=True ) return result.returncode == 0 def restart_bird(device): container = f"clab-enterprise-spine-leaf-{device}" subprocess.run( ["sudo", "docker", "exec", container, "sh", "-c", "pkill bird 2>/dev/null; sleep 1; bird -c /etc/bird.conf"] )
End-to-end verification: ping from SP4 (AS65004, 10.4.1.0/30) to SP7's subnet (10.4.4.1) passes through three spines via eBGP. This confirms full fabric connectivity across all eight autonomous systems.
LLDP validation against NetBox
LLDP (Link Layer Discovery Protocol) is a vendor-neutral Layer 2 protocol where every network device advertises its own identity and connected port to directly attached neighbors. By collecting this data from all devices and comparing it against the 19 cables stored in NetBox, the lab can detect any mismatch between the expected topology and the actual running state.
This is how enterprise network teams catch cable mistakes after a hardware replacement, detect rogue devices plugged in without authorization, and verify that their CMDB (NetBox) accurately reflects the physical infrastructure.
-
1Install lldpd on all containersAlpine-based containers (spine/leaf) use apk, ubuntu-based servers use apt. lldpd starts advertising immediately to all connected neighbors.
sudo docker exec clab-enterprise-spine-leaf-SP1 \ sh -c "apk add lldpd -q && lldpd" -
2Collect live LLDP neighbors via docker execPython runs lldpcli show neighbors inside each container and parses the output into a structured list of local interface, neighbor name, and remote interface.
def parse_lldp_output(output): neighbors = [] current = {} for line in output.split("\n"): if "Interface:" in line and "via:" in line: current = {"local_interface": line.split("Interface:")[1].split(",")[0].strip()} elif "SysName:" in line: current["neighbor"] = line.split("SysName:")[1].strip() elif "PortDescr:" in line: current["remote_interface"] = line.split("PortDescr:")[1].strip() return neighbors
-
3Fetch expected connections from NetBoxpynetbox queries all 19 cables for each device. NetBox returns the same structure: local interface, neighbor device, remote interface.
nb = pynetbox.api(NETBOX_URL, token=NETBOX_TOKEN) interfaces = list(nb.dcim.interfaces.filter(device=device)) for iface in interfaces: if iface.cable: cable = nb.dcim.cables.get(iface.cable.id) for term in cable.a_terminations + cable.b_terminations: if str(term.object.device) != device: connections.append({...})
-
4Compare and flag mismatchesEvery expected connection is checked against live LLDP. MISSING_LINK = in NetBox but not seen in LLDP. UNEXPECTED_LINK = in LLDP but not in NetBox. Both are flagged as HIGH severity.
Results for SP1: SP1:eth2 ↔ SP4:eth1 ✓ match SP1:eth3 ↔ SP5:eth1 ✓ match SP1:eth4 ↔ SP6:eth1 ✓ match SP1:eth5 ↔ SP7:eth1 ✓ match SP1:eth1 ↔ router1 ✓ match Total: 30 matches, 4 missing (servers without lldpd)
Automated switch config generation and replacement planning
Switch replacement is one of the most error-prone operations in network engineering. Generating a new switch config manually means transcribing VLANs, IP addresses, BGP neighbors, interface descriptions, and port channels from the old device to the new one. A single missed VLAN or wrong subnet causes a production outage.
This automation pipeline pulls the complete device record from NetBox (interfaces, IPs, roles, connected devices), generates the replacement configuration using Python and Jinja2 templates, validates it against the live topology using Netmiko, and prepares a cutover plan. The GitHub Actions pipeline can then deploy it automatically after human approval.
import pynetbox, yaml, os nb = pynetbox.api("https://netbox.networkforai.com", token=NETBOX_TOKEN) for device in nb.dcim.devices.all(): interfaces = list(nb.dcim.interfaces.filter(device=device.name)) ips = list(nb.ipam.ip_addresses.filter(device=device.name)) cables = [i for i in interfaces if i.cable] host_vars = { "hostname": device.name, "role": str(device.role), "interfaces": [{"name": i.name, "ip": str(i.primary_ip4)} for i in interfaces if i.primary_ip4], "bgp_peers": [...], # derived from cable endpoints } path = f"host_vars/{device.name}.yml" with open(path, "w") as f: yaml.dump(host_vars, f)
The startup.py master script runs configure_devices.py followed by fix_bird_configs.py on every deployment. This means the entire lab can be fully reconfigured from source in about 60 seconds after any container restart or redeploy.
n8n workflow — automated monitoring, AI analysis, and email alerting
n8n is a self-hosted workflow automation platform similar to Zapier but open source and running on the Oracle VM. The workflow connects the schedule trigger, the Flask API, Gemini AI, and Gmail into a fully automated monitoring loop. No human needs to be watching for anything — the pipeline runs on its own every 15 minutes and only sends an email when something is actually wrong.
-
1Schedule Trigger — every 15 minutesn8n fires the workflow automatically on a fixed interval. No cron jobs, no external scheduler. The timing is configured inside the n8n UI and persists across VM reboots.
-
2HTTP Request POST — trigger the Flask APIn8n sends a POST request to
https://anomaly.networkforai.com/collect-and-analyzewith a 300-second timeout. The API runs the full collection and analysis pipeline and returns a JSON result.POST /collect-and-analyze Content-Type: application/json Response: { "total_devices_checked": 8, "anomalies_found": 2, "anomalies": [ { "device": "SP1", "severity": "CRITICAL", "analysis": "BGP sessions sp5, sp6, sp7 in Active state...", "recommendation": "Check BIRD config, verify neighbor IPs..." } ] } -
3Log collection inside Flask APIThe Flask API runs three commands on each of the 8 devices via docker exec: birdc show protocols (BGP session status), birdc show route count (route table health), and ip route show (kernel routing table). Raw output is collected for all 8 devices before analysis begins. A 4-second sleep between devices prevents hitting Gemini's free tier rate limit.
DEVICES = ["SP1", "SP2", "SP3", "SP4", "SP5", "SP6", "SP7", "router1"] for device in DEVICES: raw_logs = collect_device_logs(device) # docker exec birdc/ip commands redacted = redact_log(raw_logs) # mask IPs, MACs, credentials analysis = analyze_with_gemini(device, redacted["redacted"]) time.sleep(4) # rate limit buffer
-
4Log redaction before AI analysisBefore any log data is sent to Gemini, all sensitive fields are replaced with tokens. IP addresses become [IP_001], [IP_002], etc. consistently throughout the log. MAC addresses become [MAC_REDACTED]. Credentials are replaced. This ensures that no customer or infrastructure data is sent to an external AI service. The redacted log is still fully readable and analyzable.
Original: neighbor 10.1.1.2 as 65004 Established Redacted: neighbor [IP_003] as AS1 Established Original: 10.4.4.0/30 via 10.1.1.1 dev eth1 proto bird Redacted: [IP_012]/30 via [IP_003] dev eth1 proto bird
-
5Gemini 2.5 Flash analysisThe clean redacted log is sent to Google Gemini AI with a structured prompt that instructs it to act as a senior network operations engineer. Gemini returns its response in a strict format that includes severity level, whether an anomaly was detected, issues found, root cause, and recommended action. The response is parsed with simple string matching.
Prompt instructions to Gemini: - You are a senior network operations engineer - Analyze BGP session state, route table health, kernel routes - Respond in EXACT format: SEVERITY: [LOW/MEDIUM/HIGH/CRITICAL] ANOMALY_DETECTED: [YES/NO] ISSUES_FOUND: [description] ROOT_CAUSE: [likely cause] RECOMMENDED_ACTION: [steps to resolve] URGENCY: [immediate/scheduled/monitoring] -
6IF node — conditional branchingn8n evaluates
anomalies_found > 0. The True branch triggers the Gmail alert. The False branch does nothing. No email is sent when all devices are healthy, which keeps the signal-to-noise ratio high. -
7Gmail alert with full analysisThe email includes the device name, severity, full Gemini analysis including root cause, recommended action, and a link to approve automatic remediation. The approve link calls the /remediate endpoint which runs fix_bird_configs.py to restore the network to the correct state without any manual intervention.
Subject: Network Anomaly Detected - SP1 [CRITICAL] Device: SP1 Severity: CRITICAL Time: 2025-12-15T14:30:00 ISSUES FOUND: SP1 BGP sessions to sp5, sp6, sp7 in Active state. Route table empty (0 of 0 routes in master4). ROOT CAUSE: BIRD config missing static protocol block. BGP sessions established but no routes being originated. RECOMMENDED ACTION: Run fix_bird_configs.py to restore static route advertisements. Verify birdc show protocols after reload. [ Approve Auto-Fix ] → https://n8n.networkforai.com/webhook/approve
-
8Webhook approve — one-click remediationClicking the approve link in the email hits the n8n webhook which then calls POST /remediate on the Flask API. The API runs fix_bird_configs.py which rewrites the BIRD configs on all 8 devices and restarts BIRD. The network is fully restored in about 45 seconds with no terminal access needed.
GitHub Actions CI/CD pipeline
A self-hosted GitHub Actions runner runs on the Oracle VM as a systemd service. It connects to GitHub and listens for push events on the main branch. Every code commit triggers the deployment pipeline which reconfigures the entire lab and runs a connectivity test.
name: Network Config Pipeline on: push: branches: [ main ] jobs: deploy: runs-on: self-hosted steps: - name: Checkout code uses: actions/checkout@v4 - name: Configure device interfaces run: | source /home/ubuntu/venv/bin/activate python3 /home/ubuntu/network-automation/scripts/configure_devices.py - name: Write BGP configs with static routes run: | source /home/ubuntu/venv/bin/activate python3 /home/ubuntu/network-automation/scripts/fix_bird_configs.py - name: Verify end-to-end connectivity run: | sleep 30 sudo docker exec clab-enterprise-spine-leaf-SP4 \ ping -c 3 10.4.4.1 # SP4 to SP7 subnet via 3 spines
The runner requires sudo access to docker for the ping test step. This is configured in /etc/sudoers.d/github-runner on the Oracle VM so the runner can execute docker exec commands without a password prompt during CI runs.