Using Kubernetes to scale physical networks: Introducing scalable BGP Hierarchical Route Reflectors with meshrr and Juniper cRPD

Subscribe to my newsletter and never miss my upcoming articles

In a service provider network architecture, most rely on one set of elements to tie an entire autonomous system together. In many networks, they aren't given a lot of thought. I'm talking about BGP route reflectors, of course.

I've built a demo solution called meshrr. meshrr is built to scale out BGP route reflectors on Kubernetes using Juniper cRPD (Containerized Routing Protocol Daemon).

Using meshrr, route reflectors are dynamically discovered between each other to form RR meshes and hierarchies thereof. If a route reflector stops behaving as expected, it can simply be deleted, a new one will be initialized in its place, and other route reflectors will automatically discover that it has been replaced and initiate a peering with the new one. RR clients in a physical network can reach outward-facing RRs via specific IPs or via anycast IPs applied to the RR's hosts.

Boundaries between different meshrr groups become policy control points, establishing points at which routing information exchange can be minimized and inter-group adjustments can be put in place for attributes or filtering.

Why?

Too often, when talking about the BGP full mesh scaling problems, we engineers quickly jump to the conclusion, "Just use a couple route reflectors." They're simple. They're standard. They do everything you need.

But what about when they don't?

  • What if two route reflectors isn't enough to service the number of routers in the topology? Possible solutions:
    • More route reflectors. 💸
  • What if some routers are closer to certain egress points than others? Possible solutions:
    • Diverse Path Route Reflectors. This can result in significant complexity.
    • Add-Path. This adds load across the entire install base. Furthermore, it's not always sufficient to solve the problem. What if there are 5 better paths from the perspective of the route reflector than the one a client should ideally select? add-path 6 would be required, which may defeat the entire purpose of route reflector scale reduction.
  • What if there's a need to maintain more routing knowledge within certain groups of routers? Possible solutions:
    • Fully meshed groups. One solution would be to fully mesh the routers that need to learn routes within those groups, but again - this leads to complexity. And depending on group size, this may revive the original scale challenges.

Trying to solve all these challenges and address the drawbacks of all of the potential solutions would generally lead down the road of using HRRs. This can potentially be even more costly than simply deploying additional route reflectors in the same cluster for scale, as it requires additional routers for both scale and for policy. Fortunately, we now have the capability to run routing daemons, and route reflectors, in containers. Furthermore, we can deploy, upgrade, healthcheck, and restore - all with Kubernetes.

Disclaimer: My background is in traditional hardware-based networks. I speak BGP, MPLS, IS-IS, OSPF, etc. Kubernetes, and especially the notion of using it as the glue for a BGP AS, is a huge paradigm shift. That being said, the more that I work with Kubernetes, the more I realize that certain VNFs, like RRs, could be ideal applications of the technology. Kubernetes can manage all of horizontal scaling, health checks, remediation, and backend networking.

The Solution

The Technologies

The Methodology

The Dockerfile

First, we must build our own container image based on the cRPD image. This enables us to install additional packages via apt-get (cron, python3), add necessary scripts, and create a default Jinja2 configuration template at /root/juniper.conf.j2.

The Kubernetes Components

Deployments are used for RR groups that scale to a certain number of replicas. This is generally appropriate for groups of RRs that have no clients outside the Kubernetes cluster.

DaemonSets are used for RR groups that need to be guaranteed to be deployed to a set of nodes. This is generally appropriate for groups of RRs that are serving clients outside the Kubernetes cluster.

Services index either Deployments or DaemonSets and are used to discover other RRs' IP addresses to establish peerings inside the cluster.

ConfigMaps are used to overwrite the default Jinja2 configuration template at /root/juniper.conf.j2.

Secrets are used to store secret data to be either mounted as volumes or mapped to environment variables. This includes the junos_sfnt.lic cRPD license file in the example deployment, though other licensing mechanisms are available.

The Scripts

There are only three scripts required for this project in its base form.

runit-init.sh

  • Stores environment variables for later use.
  • Calls render_config.py
  • Installs a crontab to call update_peers.py every minute.

render_config.py

  • Creates a configuration at /config/juniper.conf from the Jinja2 configuration template at /root/juniper.conf.j2. This does not yet include meshrr managed peers.

update_peers.py

  • Manages the meshrr managed peers based on the discovered RRs via the Kubernetes Services. These services are discovered using Kubernetes CoreDNS, which is installed by default in many Kubernetes environments.
    • MESHRR-MESH: Discovered peers in the same RR mesh
    • MESHRR-UPSTREAM: Discovered peers in the upstream RR mesh. By default, only up to two upstream peers are configured at any given time. Peers that are down will be replaced with a random discovered peer that is up.
    • MESHRR-CLIENTS: For clients, meshrr does not expect it to be feasible to define each downstream RR client explicitly due to the dynamic nature of a Kubernetes environment. Therefore, meshrr uses the Junos BGP group allow configuration statement to allow connections from any IP within the range defined in the MESHRR_CLIENTRANGE environment variable. However, if, for example, the allow range was 0/0 and a peer is explicitly defined in the MESHRR-MESH group with any IP, that peer technically exists in both the MESHRR-CLIENT and MESHRR-MESH groups, which expectedly leads to unexpected results. Therefore, update_peers.py dynamically updates this allow statement to be the MESHRR_CLIENTRANGE environment variable with all explicitly defined peers removed:
      ❯ k exec -t meshrr-lothlorien-a-4rc7m -- cli show configuration groups MESHRR protocols bgp group MESHRR-CLIENTS
      type internal;
      cluster 10.42.0.25;
      allow [ 0.0.0.0/5 8.0.0.0/7 10.0.0.0/11 10.32.0.0/13 10.40.0.0/15 10.42.0.0/28 10.42.0.16/32 10.42.0.18/31 10.42.0.20/30 10.42.0.24/29 10.42.0.32/27 10.42.0.64/26 10.42.0.128/25 10.42.1.0/24 10.42.2.0/24 10.42.3.0/30 10.42.3.4/31 10.42.3.6/32 10.42.3.8/29 10.42.3.16/28 10.42.3.32/27 10.42.3.64/26 10.42.3.128/25 10.42.4.0/30 10.42.4.4/31 10.42.4.6/32 10.42.4.8/29 10.42.4.16/28 10.42.4.32/27 10.42.4.64/26 10.42.4.128/25 10.42.5.0/30 10.42.5.4/31 10.42.5.6/32 10.42.5.8/30 10.42.5.12/31 10.42.5.14/32 10.42.5.16/28 10.42.5.32/27 10.42.5.64/26 10.42.5.128/25 10.42.6.0/23 10.42.8.0/21 10.42.16.0/20 10.42.32.0/19 10.42.64.0/18 10.42.128.0/17 10.43.0.0/16 10.44.0.0/14 10.48.0.0/12 10.64.0.0/10 10.128.0.0/9 11.0.0.0/8 12.0.0.0/6 16.0.0.0/4 32.0.0.0/3 64.0.0.0/2 128.0.0.0/1 ];
      

The Demonstration

The examples/2regions-hrr directory of the GitHub project includes Kubernetes .yaml files and .j2 files for this example.

2regions-hrr-hierarchy.png

Assume there are two nations - Mirkwood and Lothlorien - serviced by one ISP. The ISP wants to ensure that routes with the community tag 65000:101 are not advertised outside of the nation in which they originate, and that routes with the community tag 65000:102 have a low local preference (20) outside of the region in which they originate.

The ISP will use 172.19.1.1 and 172.19.1.2 as anycast route reflectors for Lothlorien physical routers, and 172.19.2.1 and 172.19.2.2 as anycast route reflectors for Mirkwood physical routers. They specifically want to ensure that each router peers with two separate physical nodes, and only want to build containers on nodes labelled for those containers. To do so, they:

  1. Set the .1 addresses as loopbacks on the a nodes in each region and the .2 addresses as loopbacks on the b nodes in each region, then static route to them from the routers connecting them and redistribute the routes into the IGP. (Note: This is why this is a demo. In a production environment you'd want something ensuring liveliness to withdraw the route if necessary.)
  2. Build a custom container image using the project using docker build -t <private_registry>/meshrr:<tag> and docker push.
  3. They create ConfigMaps to overwrite the default configuration template for Lothlorien and Mirkwood groups:
    ❯ k create configmap lothlorien-config \
    --from-file=config=../templates/lothlorien-config.j2 \
    -o yaml --dry-run=client |
    k apply -f -
    ❯ k create configmap mirkwood-config \
    --from-file=config=../templates/mirkwood-config.j2 \
    -o yaml --dry-run=client |
    k apply -f -
    
  4. Apply the YAML files:
    k apply -f meshrr-mirkwood.yaml
    k apply -f meshrr-core.yaml
    k apply -f meshrr-lothlorien.yaml
    
  5. Configure labels for the Kubernetes nodes with as either redundancy_group=a or redundancy_group=b. Configure labels for each of the Kubernetes nodes to signal that they are eligible for that region by applying meshrr_region_lothlorien, meshrr_region_core, and/or meshrr_region_mirkwood.
  6. Watch the route reflectors come up and peers establish:

No Comments Yet