Using Kubernetes to scale physical networks: Introducing scalable BGP Hierarchical Route Reflectors with meshrr and Juniper cRPD
In a service provider network architecture, most rely on one set of elements to tie an entire autonomous system together. In many networks, they aren't given a lot of thought. I'm talking about BGP route reflectors, of course.
I've built a demo solution called meshrr. meshrr is built to scale out BGP route reflectors on Kubernetes using Juniper cRPD (Containerized Routing Protocol Daemon).
Using meshrr, route reflectors are dynamically discovered between each other to form RR meshes and hierarchies thereof. If a route reflector stops behaving as expected, it can simply be deleted, a new one will be initialized in its place, and other route reflectors will automatically discover that it has been replaced and initiate a peering with the new one. RR clients in a physical network can reach outward-facing RRs via specific IPs or via anycast IPs applied to the RR's hosts.
Boundaries between different meshrr groups become policy control points, establishing points at which routing information exchange can be minimized and inter-group adjustments can be put in place for attributes or filtering.
Too often, when talking about the BGP full mesh scaling problems, we engineers quickly jump to the conclusion, "Just use a couple route reflectors." They're simple. They're standard. They do everything you need.
But what about when they don't?
- What if two route reflectors isn't enough to service the number of routers in the topology? Possible solutions:
- More route reflectors. 💸
- What if some routers are closer to certain egress points than others? Possible solutions:
- Diverse Path Route Reflectors. This can result in significant complexity.
- Add-Path. This adds load across the entire install base. Furthermore, it's not always sufficient to solve the problem. What if there are 5 better paths from the perspective of the route reflector than the one a client should ideally select?
add-path 6would be required, which may defeat the entire purpose of route reflector scale reduction.
- What if there's a need to maintain more routing knowledge within certain groups of routers? Possible solutions:
- Fully meshed groups. One solution would be to fully mesh the routers that need to learn routes within those groups, but again - this leads to complexity. And depending on group size, this may revive the original scale challenges.
Trying to solve all these challenges and address the drawbacks of all of the potential solutions would generally lead down the road of using HRRs. This can potentially be even more costly than simply deploying additional route reflectors in the same cluster for scale, as it requires additional routers for both scale and for policy. Fortunately, we now have the capability to run routing daemons, and route reflectors, in containers. Furthermore, we can deploy, upgrade, healthcheck, and restore - all with Kubernetes.
Disclaimer: My background is in traditional hardware-based networks. I speak BGP, MPLS, IS-IS, OSPF, etc. Kubernetes, and especially the notion of using it as the glue for a BGP AS, is a huge paradigm shift. That being said, the more that I work with Kubernetes, the more I realize that certain VNFs, like RRs, could be ideal applications of the technology. Kubernetes can manage all of horizontal scaling, health checks, remediation, and backend networking.
- Juniper cRPD (20.3R1.8)
- For the purposes of the demonstration, I used v1.19.7-rancher1-1 with Canal as the networking provider. Different networking providers will expose different possibilities, especially if they are not overlay based.
- Juniper PyEZ for "on-box" (or in-container) scripting
First, we must build our own container image based on the cRPD image. This enables us to install additional packages via
python3), add necessary scripts, and create a default Jinja2 configuration template at
The Kubernetes Components
Deployments are used for RR groups that scale to a certain number of replicas. This is generally appropriate for groups of RRs that have no clients outside the Kubernetes cluster.
DaemonSets are used for RR groups that need to be guaranteed to be deployed to a set of nodes. This is generally appropriate for groups of RRs that are serving clients outside the Kubernetes cluster.
Services index either Deployments or DaemonSets and are used to discover other RRs' IP addresses to establish peerings inside the cluster.
ConfigMaps are used to overwrite the default Jinja2 configuration template at
Secrets are used to store secret data to be either mounted as volumes or mapped to environment variables. This includes the
junos_sfnt.lic cRPD license file in the example deployment, though other licensing mechanisms are available.
There are only three scripts required for this project in its base form.
- Stores environment variables for later use.
- Installs a crontab to call
- Creates a configuration at
/config/juniper.conffrom the Jinja2 configuration template at
/root/juniper.conf.j2. This does not yet include meshrr managed peers.
- Manages the meshrr managed peers based on the discovered RRs via the Kubernetes Services. These services are discovered using Kubernetes CoreDNS, which is installed by default in many Kubernetes environments.
MESHRR-MESH: Discovered peers in the same RR mesh
MESHRR-UPSTREAM: Discovered peers in the upstream RR mesh. By default, only up to two upstream peers are configured at any given time. Peers that are down will be replaced with a random discovered peer that is up.
MESHRR-CLIENTS: For clients, meshrr does not expect it to be feasible to define each downstream RR client explicitly due to the dynamic nature of a Kubernetes environment. Therefore, meshrr uses the Junos BGP group
allowconfiguration statement to allow connections from any IP within the range defined in the
MESHRR_CLIENTRANGEenvironment variable. However, if, for example, the
0/0and a peer is explicitly defined in the
MESHRR-MESHgroup with any IP, that peer technically exists in both the
MESHRR-MESHgroups, which expectedly leads to unexpected results. Therefore,
update_peers.pydynamically updates this
allowstatement to be the
MESHRR_CLIENTRANGEenvironment variable with all explicitly defined peers removed:
❯ k exec -t meshrr-lothlorien-a-4rc7m -- cli show configuration groups MESHRR protocols bgp group MESHRR-CLIENTS type internal; cluster 10.42.0.25; allow [ 0.0.0.0/5 188.8.131.52/7 10.0.0.0/11 10.32.0.0/13 10.40.0.0/15 10.42.0.0/28 10.42.0.16/32 10.42.0.18/31 10.42.0.20/30 10.42.0.24/29 10.42.0.32/27 10.42.0.64/26 10.42.0.128/25 10.42.1.0/24 10.42.2.0/24 10.42.3.0/30 10.42.3.4/31 10.42.3.6/32 10.42.3.8/29 10.42.3.16/28 10.42.3.32/27 10.42.3.64/26 10.42.3.128/25 10.42.4.0/30 10.42.4.4/31 10.42.4.6/32 10.42.4.8/29 10.42.4.16/28 10.42.4.32/27 10.42.4.64/26 10.42.4.128/25 10.42.5.0/30 10.42.5.4/31 10.42.5.6/32 10.42.5.8/30 10.42.5.12/31 10.42.5.14/32 10.42.5.16/28 10.42.5.32/27 10.42.5.64/26 10.42.5.128/25 10.42.6.0/23 10.42.8.0/21 10.42.16.0/20 10.42.32.0/19 10.42.64.0/18 10.42.128.0/17 10.43.0.0/16 10.44.0.0/14 10.48.0.0/12 10.64.0.0/10 10.128.0.0/9 184.108.40.206/8 220.127.116.11/6 18.104.22.168/4 22.214.171.124/3 126.96.36.199/2 188.8.131.52/1 ];
examples/2regions-hrr directory of the GitHub project includes Kubernetes .yaml files and .j2 files for this example.
Assume there are two nations - Mirkwood and Lothlorien - serviced by one ISP. The ISP wants to ensure that routes with the community tag
65000:101 are not advertised outside of the nation in which they originate, and that routes with the community tag
65000:102 have a low local preference (20) outside of the region in which they originate.
The ISP will use 172.19.1.1 and 172.19.1.2 as anycast route reflectors for Lothlorien physical routers, and 172.19.2.1 and 172.19.2.2 as anycast route reflectors for Mirkwood physical routers. They specifically want to ensure that each router peers with two separate physical nodes, and only want to build containers on nodes labelled for those containers. To do so, they:
- Set the .1 addresses as loopbacks on the
anodes in each region and the .2 addresses as loopbacks on the
bnodes in each region, then static route to them from the routers connecting them and redistribute the routes into the IGP. (Note: This is why this is a demo. In a production environment you'd want something ensuring liveliness to withdraw the route if necessary.)
- Build a custom container image using the project using
docker build -t <private_registry>/meshrr:<tag>and
- They create ConfigMaps to overwrite the default configuration template for Lothlorien and Mirkwood groups:
❯ k create configmap lothlorien-config \ --from-file=config=../templates/lothlorien-config.j2 \ -o yaml --dry-run=client | k apply -f - ❯ k create configmap mirkwood-config \ --from-file=config=../templates/mirkwood-config.j2 \ -o yaml --dry-run=client | k apply -f -
- Apply the YAML files:
k apply -f meshrr-mirkwood.yaml k apply -f meshrr-core.yaml k apply -f meshrr-lothlorien.yaml
- Configure labels for the Kubernetes nodes with as either
redundancy_group=b. Configure labels for each of the Kubernetes nodes to signal that they are eligible for that region by applying
- Watch the route reflectors come up and peers establish: