Network Automation Lab | NetworkForAI

Containerlab nodes

BGP autonomous systems

cable links in NetBox

15m

AIOps check interval

Infrastructure

Spine-leaf topology and BGP design

The lab runs a 13-node enterprise spine-leaf topology in Containerlab on a single Oracle Cloud ARM VM (4 CPU, 24 GB RAM, free tier). The routing protocol is eBGP with an AS-per-rack design, which is how large-scale data centers like those at Meta, Amazon, and TikTok are built. Each rack gets its own autonomous system number, spines share AS65001, and the edge router represents an upstream ISP or WAN connection at AS65000.

router1

AS65000 • mgmt 172.20.20.2

↓ eBGP — eth1=10.0.1.x eth2=10.0.2.x eth3=10.0.3.x ↓

SP1

AS65001 • .11

SP2

AS65001 • .12

SP3

AS65001 • .13

↓ eBGP AS per rack — /30 subnets per spine-leaf link ↓

SP4

AS65004 • .14

SP5

AS65005 • .15

SP6

AS65006 • .16

SP7

AS65007 • .17

↓ access layer ↓

SRV1

.21

SRV2

.22

SRV3

.23

SRV4

.24

Edge router

Spine (AS65001)

Leaf / ToR (unique AS)

Server

IP addressing

Link	Side A	Side B	Subnet
router1 → SP1	router1:eth1 = 10.0.1.1	SP1:eth1 = 10.0.1.2	10.0.1.0/30
router1 → SP2	router1:eth2 = 10.0.2.1	SP2:eth1 = 10.0.2.2	10.0.2.0/30
router1 → SP3	router1:eth3 = 10.0.3.1	SP3:eth1 = 10.0.3.2	10.0.3.0/30
SP1 → SP4	SP1:eth2 = 10.1.1.1	SP4:eth1 = 10.1.1.2	10.1.1.0/30
SP1 → SP5	SP1:eth3 = 10.1.2.1	SP5:eth1 = 10.1.2.2	10.1.2.0/30
SP2 → SP4	SP2:eth2 = 10.2.1.1	SP4:eth2 = 10.2.1.2	10.2.1.0/30
SP3 → SP4	SP3:eth2 = 10.3.1.1	SP4:eth3 = 10.3.1.2	10.3.1.0/30
SP4 → SRV1	SP4:eth4 = 10.4.1.1	SRV1:eth1	10.4.1.0/30
SP7 → SRV4	SP7:eth4 = 10.4.4.1	SRV4:eth1	10.4.4.0/30

All 19 cable connections are stored in NetBox as the source of truth. The add_netbox_cables.py script reads the topology definition and creates every cable via the NetBox dcim.cables API automatically. Interface types are set to 1000base-t since NetBox requires physical interfaces for cable terminations.

Routing

BGP configuration with BIRD2

Each node runs BIRD2 as its routing daemon. The BGP design uses static blackhole route advertisements so that each device originates its own connected subnets into BGP. The static routes advertise the subnet into BIRD's master routing table and BGP then redistributes them to all peers. Without the static routes, BGP sessions establish but no prefixes are exchanged.

/etc/bird.conf — SP1 (spine AS65001)
router id 10.0.0.1;

protocol kernel { ipv4 { export all; import all; }; }
protocol device { scan time 10; }

# Advertise connected subnets into BGP via static blackhole routes
protocol static {
    ipv4;
    route 10.1.1.0/30 blackhole;  # SP1-SP4 link
    route 10.1.2.0/30 blackhole;  # SP1-SP5 link
    route 10.1.3.0/30 blackhole;  # SP1-SP6 link
    route 10.1.4.0/30 blackhole;  # SP1-SP7 link
}

# eBGP session to edge router (upstream)
protocol bgp upstream {
    local 10.0.1.2 as 65001;
    neighbor 10.0.1.1 as 65000;
    ipv4 { import all; export all; };
}

# eBGP sessions to each leaf switch (AS per rack)
protocol bgp sp4 {
    local 10.1.1.1 as 65001;
    neighbor 10.1.1.2 as 65004;
    ipv4 { import all; export all; };
}
protocol bgp sp5 { local 10.1.2.1 as 65001; neighbor 10.1.2.2 as 65005; ipv4 { import all; export all; }; }
protocol bgp sp6 { local 10.1.3.1 as 65001; neighbor 10.1.3.2 as 65006; ipv4 { import all; export all; }; }
protocol bgp sp7 { local 10.1.4.1 as 65001; neighbor 10.1.4.2 as 65007; ipv4 { import all; export all; }; }

The fix_bird_configs.py Python script writes this configuration directly into each running container using docker exec -i ... tee /etc/bird.conf, then restarts BIRD. This is the key automation step that makes the lab self-healing after a restart.

fix_bird_configs.py — writing config into container
def write_config(device, config):
    container = f"clab-enterprise-spine-leaf-{device}"
    result = subprocess.run(
        ["sudo", "docker", "exec", "-i", container,
         "sh", "-c", "tee /etc/bird.conf > /dev/null"],
        input=config, text=True, capture_output=True
    )
    return result.returncode == 0

def restart_bird(device):
    container = f"clab-enterprise-spine-leaf-{device}"
    subprocess.run(
        ["sudo", "docker", "exec", container,
         "sh", "-c", "pkill bird 2>/dev/null; sleep 1; bird -c /etc/bird.conf"]
    )

End-to-end verification: ping from SP4 (AS65004, 10.4.1.0/30) to SP7's subnet (10.4.4.1) passes through three spines via eBGP. This confirms full fabric connectivity across all eight autonomous systems.

Topology Validation

LLDP validation against NetBox

LLDP (Link Layer Discovery Protocol) is a vendor-neutral Layer 2 protocol where every network device advertises its own identity and connected port to directly attached neighbors. By collecting this data from all devices and comparing it against the 19 cables stored in NetBox, the lab can detect any mismatch between the expected topology and the actual running state.

This is how enterprise network teams catch cable mistakes after a hardware replacement, detect rogue devices plugged in without authorization, and verify that their CMDB (NetBox) accurately reflects the physical infrastructure.

1
Install lldpd on all containers

Alpine-based containers (spine/leaf) use apk, ubuntu-based servers use apt. lldpd starts advertising immediately to all connected neighbors.
```
sudo docker exec clab-enterprise-spine-leaf-SP1 \
  sh -c "apk add lldpd -q && lldpd"
```

Collect live LLDP neighbors via docker exec

Python runs lldpcli show neighbors inside each container and parses the output into a structured list of local interface, neighbor name, and remote interface.

def parse_lldp_output(output):
    neighbors = []
    current = {}
    for line in output.split("\n"):
        if "Interface:" in line and "via:" in line:
            current = {"local_interface": line.split("Interface:")[1].split(",")[0].strip()}
        elif "SysName:" in line:
            current["neighbor"] = line.split("SysName:")[1].strip()
        elif "PortDescr:" in line:
            current["remote_interface"] = line.split("PortDescr:")[1].strip()
    return neighbors

Fetch expected connections from NetBox

pynetbox queries all 19 cables for each device. NetBox returns the same structure: local interface, neighbor device, remote interface.

nb = pynetbox.api(NETBOX_URL, token=NETBOX_TOKEN)
interfaces = list(nb.dcim.interfaces.filter(device=device))
for iface in interfaces:
    if iface.cable:
        cable = nb.dcim.cables.get(iface.cable.id)
        for term in cable.a_terminations + cable.b_terminations:
            if str(term.object.device) != device:
                connections.append({...})

Compare and flag mismatches

Every expected connection is checked against live LLDP. MISSING_LINK = in NetBox but not seen in LLDP. UNEXPECTED_LINK = in LLDP but not in NetBox. Both are flagged as HIGH severity.

Results for SP1:
  SP1:eth2 ↔ SP4:eth1  ✓ match
  SP1:eth3 ↔ SP5:eth1  ✓ match
  SP1:eth4 ↔ SP6:eth1  ✓ match
  SP1:eth5 ↔ SP7:eth1  ✓ match
  SP1:eth1 ↔ router1   ✓ match

Total: 30 matches, 4 missing (servers without lldpd)

Switch Replacement Automation

Automated switch config generation and replacement planning

Switch replacement is one of the most error-prone operations in network engineering. Generating a new switch config manually means transcribing VLANs, IP addresses, BGP neighbors, interface descriptions, and port channels from the old device to the new one. A single missed VLAN or wrong subnet causes a production outage.

This automation pipeline pulls the complete device record from NetBox (interfaces, IPs, roles, connected devices), generates the replacement configuration using Python and Jinja2 templates, validates it against the live topology using Netmiko, and prepares a cutover plan. The GitHub Actions pipeline can then deploy it automatically after human approval.

generate_host_vars.py — pulling device data from NetBox
import pynetbox, yaml, os

nb = pynetbox.api("https://netbox.networkforai.com", token=NETBOX_TOKEN)

for device in nb.dcim.devices.all():
    interfaces = list(nb.dcim.interfaces.filter(device=device.name))
    ips        = list(nb.ipam.ip_addresses.filter(device=device.name))
    cables     = [i for i in interfaces if i.cable]

    host_vars = {
        "hostname":   device.name,
        "role":       str(device.role),
        "interfaces": [{"name": i.name, "ip": str(i.primary_ip4)} for i in interfaces if i.primary_ip4],
        "bgp_peers":  [...],  # derived from cable endpoints
    }

    path = f"host_vars/{device.name}.yml"
    with open(path, "w") as f:
        yaml.dump(host_vars, f)

The startup.py master script runs configure_devices.py followed by fix_bird_configs.py on every deployment. This means the entire lab can be fully reconfigured from source in about 60 seconds after any container restart or redeploy.

AIOps Pipeline

n8n workflow — automated monitoring, AI analysis, and email alerting

n8n is a self-hosted workflow automation platform similar to Zapier but open source and running on the Oracle VM. The workflow connects the schedule trigger, the Flask API, Gemini AI, and Gmail into a fully automated monitoring loop. No human needs to be watching for anything — the pipeline runs on its own every 15 minutes and only sends an email when something is actually wrong.

1

Schedule Trigger — every 15 minutes

n8n fires the workflow automatically on a fixed interval. No cron jobs, no external scheduler. The timing is configured inside the n8n UI and persists across VM reboots.

HTTP Request POST — trigger the Flask API

n8n sends a POST request to https://anomaly.networkforai.com/collect-and-analyze with a 300-second timeout. The API runs the full collection and analysis pipeline and returns a JSON result.

POST /collect-and-analyze
Content-Type: application/json

Response:
{
  "total_devices_checked": 8,
  "anomalies_found": 2,
  "anomalies": [
    {
      "device": "SP1",
      "severity": "CRITICAL",
      "analysis": "BGP sessions sp5, sp6, sp7 in Active state...",
      "recommendation": "Check BIRD config, verify neighbor IPs..."
    }
  ]
}

Log collection inside Flask API

The Flask API runs three commands on each of the 8 devices via docker exec: birdc show protocols (BGP session status), birdc show route count (route table health), and ip route show (kernel routing table). Raw output is collected for all 8 devices before analysis begins. A 4-second sleep between devices prevents hitting Gemini's free tier rate limit.

DEVICES = ["SP1", "SP2", "SP3", "SP4", "SP5", "SP6", "SP7", "router1"]

for device in DEVICES:
    raw_logs = collect_device_logs(device)     # docker exec birdc/ip commands
    redacted = redact_log(raw_logs)            # mask IPs, MACs, credentials
    analysis = analyze_with_gemini(device, redacted["redacted"])
    time.sleep(4)                             # rate limit buffer

4
Log redaction before AI analysis

Before any log data is sent to Gemini, all sensitive fields are replaced with tokens. IP addresses become [IP_001], [IP_002], etc. consistently throughout the log. MAC addresses become [MAC_REDACTED]. Credentials are replaced. This ensures that no customer or infrastructure data is sent to an external AI service. The redacted log is still fully readable and analyzable.
```
Original:  neighbor 10.1.1.2 as 65004  Established
Redacted:  neighbor [IP_003] as AS1     Established

Original:  10.4.4.0/30 via 10.1.1.1 dev eth1 proto bird
Redacted:  [IP_012]/30 via [IP_003] dev eth1 proto bird
```

Gemini 2.5 Flash analysis

The clean redacted log is sent to Google Gemini AI with a structured prompt that instructs it to act as a senior network operations engineer. Gemini returns its response in a strict format that includes severity level, whether an anomaly was detected, issues found, root cause, and recommended action. The response is parsed with simple string matching.

Prompt instructions to Gemini:
  - You are a senior network operations engineer
  - Analyze BGP session state, route table health, kernel routes
  - Respond in EXACT format:
    SEVERITY: [LOW/MEDIUM/HIGH/CRITICAL]
    ANOMALY_DETECTED: [YES/NO]
    ISSUES_FOUND: [description]
    ROOT_CAUSE: [likely cause]
    RECOMMENDED_ACTION: [steps to resolve]
    URGENCY: [immediate/scheduled/monitoring]

6

IF node — conditional branching

n8n evaluates anomalies_found > 0. The True branch triggers the Gmail alert. The False branch does nothing. No email is sent when all devices are healthy, which keeps the signal-to-noise ratio high.

Gmail alert with full analysis

The email includes the device name, severity, full Gemini analysis including root cause, recommended action, and a link to approve automatic remediation. The approve link calls the /remediate endpoint which runs fix_bird_configs.py to restore the network to the correct state without any manual intervention.

Subject: Network Anomaly Detected - SP1 [CRITICAL]

Device:   SP1
Severity: CRITICAL
Time:     2025-12-15T14:30:00

ISSUES FOUND:
SP1 BGP sessions to sp5, sp6, sp7 in Active state.
Route table empty (0 of 0 routes in master4).

ROOT CAUSE:
BIRD config missing static protocol block.
BGP sessions established but no routes being originated.

RECOMMENDED ACTION:
Run fix_bird_configs.py to restore static route advertisements.
Verify birdc show protocols after reload.

[ Approve Auto-Fix ]  → https://n8n.networkforai.com/webhook/approve

8

Webhook approve — one-click remediation

Clicking the approve link in the email hits the n8n webhook which then calls POST /remediate on the Flask API. The API runs fix_bird_configs.py which rewrites the BIRD configs on all 8 devices and restarts BIRD. The network is fully restored in about 45 seconds with no terminal access needed.

CI/CD

GitHub Actions CI/CD pipeline

A self-hosted GitHub Actions runner runs on the Oracle VM as a systemd service. It connects to GitHub and listens for push events on the main branch. Every code commit triggers the deployment pipeline which reconfigures the entire lab and runs a connectivity test.

.github/workflows/deploy.yml
name: Network Config Pipeline

on:
  push:
    branches: [ main ]

jobs:
  deploy:
    runs-on: self-hosted

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure device interfaces
        run: |
          source /home/ubuntu/venv/bin/activate
          python3 /home/ubuntu/network-automation/scripts/configure_devices.py

      - name: Write BGP configs with static routes
        run: |
          source /home/ubuntu/venv/bin/activate
          python3 /home/ubuntu/network-automation/scripts/fix_bird_configs.py

      - name: Verify end-to-end connectivity
        run: |
          sleep 30
          sudo docker exec clab-enterprise-spine-leaf-SP4 \
            ping -c 3 10.4.4.1  # SP4 to SP7 subnet via 3 spines

The runner requires sudo access to docker for the ping test step. This is configured in /etc/sudoers.d/github-runner on the Oracle VM so the runner can execute docker exec commands without a password prompt during CI runs.

Infrastructure

Full tech stack

Oracle Cloud ARM (free tier) Ubuntu 22.04 Containerlab BIRD2 BGP daemon nicolaka/netshoot Python 3.12 pynetbox lldpd / lldpcli NetBox v4.5 Flask + Gunicorn n8n Google Gemini 2.5 Flash GitHub Actions Nginx + Let's Encrypt SSL Netmiko Jinja2

Network Automation Lab — AIOps, LLDP Topology Validation, and Switch Replacement Automation