Using Kubernetes to scale physical networks: Introducing scalable BGP Hierarchical Route Reflectors with meshrr and Juniper cRPD
In a service provider network architecture, most rely on one set of elements to tie an entire autonomous system together. In many networks, they aren't given a lot of thought. I'm talking about BGP route reflectors, of course.
I've built a demo solution called meshrr. meshrr is built to scale out BGP route reflectors on Kubernetes using Juniper cRPD (Containerized Routing Protocol Daemon).
Using meshrr, route reflectors are dynamically discovered between each other to form RR meshes and hierarchies thereof. If a route reflector stops behaving as expected, it can simply be deleted, a new one will be initialized in its place, and other route reflectors will automatically discover that it has been replaced and initiate a peering with the new one. RR clients in a physical network can reach outward-facing RRs via specific IPs or via anycast IPs applied to the RR's hosts.
Boundaries between different meshrr groups become policy control points, establishing points at which routing information exchange can be minimized and inter-group adjustments can be put in place for attributes or filtering.
Why?
Too often, when talking about the BGP full mesh scaling problems, we engineers quickly jump to the conclusion, "Just use a couple route reflectors." They're simple. They're standard. They do everything you need.
But what about when they don't?
- What if two route reflectors isn't enough to service the number of routers in the topology? Possible solutions:
- More route reflectors. 💸
- What if some routers are closer to certain egress points than others? Possible solutions:
- Diverse Path Route Reflectors. This can result in significant complexity.
- Add-Path. This adds load across the entire install base. Furthermore, it's not always sufficient to solve the problem. What if there are 5 better paths from the perspective of the route reflector than the one a client should ideally select?
add-path 6
would be required, which may defeat the entire purpose of route reflector scale reduction.
- What if there's a need to maintain more routing knowledge within certain groups of routers? Possible solutions:
- Fully meshed groups. One solution would be to fully mesh the routers that need to learn routes within those groups, but again - this leads to complexity. And depending on group size, this may revive the original scale challenges.
Trying to solve all these challenges and address the drawbacks of all of the potential solutions would generally lead down the road of using HRRs. This can potentially be even more costly than simply deploying additional route reflectors in the same cluster for scale, as it requires additional routers for both scale and for policy. Fortunately, we now have the capability to run routing daemons, and route reflectors, in containers. Furthermore, we can deploy, upgrade, healthcheck, and restore - all with Kubernetes.
Disclaimer: My background is in traditional hardware-based networks. I speak BGP, MPLS, IS-IS, OSPF, etc. Kubernetes, and especially the notion of using it as the glue for a BGP AS, is a huge paradigm shift. That being said, the more that I work with Kubernetes, the more I realize that certain VNFs, like RRs, could be ideal applications of the technology. Kubernetes can manage all of horizontal scaling, health checks, remediation, and backend networking.
The Solution
The Technologies
- Juniper cRPD (20.3R1.8)
- Kubernetes
- For the purposes of the demonstration, I used v1.19.7-rancher1-1 with Canal as the networking provider. Different networking providers will expose different possibilities, especially if they are not overlay based.
- Juniper PyEZ for "on-box" (or in-container) scripting
The Methodology
The Dockerfile
First, we must build our own container image based on the cRPD image. This enables us to install additional packages via apt-get
(cron
, python3
), add necessary scripts, and create a default Jinja2 configuration template at /root/juniper.conf.j2
.
The Kubernetes Components
Deployments are used for RR groups that scale to a certain number of replicas. This is generally appropriate for groups of RRs that have no clients outside the Kubernetes cluster.
DaemonSets are used for RR groups that need to be guaranteed to be deployed to a set of nodes. This is generally appropriate for groups of RRs that are serving clients outside the Kubernetes cluster.
Services index either Deployments or DaemonSets and are used to discover other RRs' IP addresses to establish peerings inside the cluster.
ConfigMaps are used to overwrite the default Jinja2 configuration template at /root/juniper.conf.j2
.
Secrets are used to store secret data to be either mounted as volumes or mapped to environment variables. This includes the junos_sfnt.lic
cRPD license file in the example deployment, though other licensing mechanisms are available.
The Scripts
There are only three scripts required for this project in its base form.
runit-init.sh
- Stores environment variables for later use.
- Calls
render_config.py
- Installs a crontab to call
update_peers.py
every minute.
render_config.py
- Creates a configuration at
/config/juniper.conf
from the Jinja2 configuration template at/root/juniper.conf.j2
. This does not yet include meshrr managed peers.
update_peers.py
- Manages the meshrr managed peers based on the discovered RRs via the Kubernetes Services. These services are discovered using Kubernetes CoreDNS, which is installed by default in many Kubernetes environments.
MESHRR-MESH
: Discovered peers in the same RR meshMESHRR-UPSTREAM
: Discovered peers in the upstream RR mesh. By default, only up to two upstream peers are configured at any given time. Peers that are down will be replaced with a random discovered peer that is up.MESHRR-CLIENTS
: For clients, meshrr does not expect it to be feasible to define each downstream RR client explicitly due to the dynamic nature of a Kubernetes environment. Therefore, meshrr uses the Junos BGP groupallow
configuration statement to allow connections from any IP within the range defined in theMESHRR_CLIENTRANGE
environment variable. However, if, for example, theallow
range was0/0
and a peer is explicitly defined in theMESHRR-MESH
group with any IP, that peer technically exists in both theMESHRR-CLIENT
andMESHRR-MESH
groups, which expectedly leads to unexpected results. Therefore,update_peers.py
dynamically updates thisallow
statement to be theMESHRR_CLIENTRANGE
environment variable with all explicitly defined peers removed:❯ k exec -t meshrr-lothlorien-a-4rc7m -- cli show configuration groups MESHRR protocols bgp group MESHRR-CLIENTS type internal; cluster 10.42.0.25; allow [ 0.0.0.0/5 8.0.0.0/7 10.0.0.0/11 10.32.0.0/13 10.40.0.0/15 10.42.0.0/28 10.42.0.16/32 10.42.0.18/31 10.42.0.20/30 10.42.0.24/29 10.42.0.32/27 10.42.0.64/26 10.42.0.128/25 10.42.1.0/24 10.42.2.0/24 10.42.3.0/30 10.42.3.4/31 10.42.3.6/32 10.42.3.8/29 10.42.3.16/28 10.42.3.32/27 10.42.3.64/26 10.42.3.128/25 10.42.4.0/30 10.42.4.4/31 10.42.4.6/32 10.42.4.8/29 10.42.4.16/28 10.42.4.32/27 10.42.4.64/26 10.42.4.128/25 10.42.5.0/30 10.42.5.4/31 10.42.5.6/32 10.42.5.8/30 10.42.5.12/31 10.42.5.14/32 10.42.5.16/28 10.42.5.32/27 10.42.5.64/26 10.42.5.128/25 10.42.6.0/23 10.42.8.0/21 10.42.16.0/20 10.42.32.0/19 10.42.64.0/18 10.42.128.0/17 10.43.0.0/16 10.44.0.0/14 10.48.0.0/12 10.64.0.0/10 10.128.0.0/9 11.0.0.0/8 12.0.0.0/6 16.0.0.0/4 32.0.0.0/3 64.0.0.0/2 128.0.0.0/1 ];
The Demonstration
The examples/2regions-hrr
directory of the GitHub project includes Kubernetes .yaml files and .j2 files for this example.
Assume there are two nations - Mirkwood and Lothlorien - serviced by one ISP. The ISP wants to ensure that routes with the community tag 65000:101
are not advertised outside of the nation in which they originate, and that routes with the community tag 65000:102
have a low local preference (20) outside of the region in which they originate.
The ISP will use 172.19.1.1 and 172.19.1.2 as anycast route reflectors for Lothlorien physical routers, and 172.19.2.1 and 172.19.2.2 as anycast route reflectors for Mirkwood physical routers. They specifically want to ensure that each router peers with two separate physical nodes, and only want to build containers on nodes labelled for those containers. To do so, they:
- Set the .1 addresses as loopbacks on the
a
nodes in each region and the .2 addresses as loopbacks on theb
nodes in each region, then static route to them from the routers connecting them and redistribute the routes into the IGP. (Note: This is why this is a demo. In a production environment you'd want something ensuring liveliness to withdraw the route if necessary.) - Build a custom container image using the project using
docker build -t <private_registry>/meshrr:<tag>
anddocker push
. - They create ConfigMaps to overwrite the default configuration template for Lothlorien and Mirkwood groups:
❯ k create configmap lothlorien-config \ --from-file=config=../templates/lothlorien-config.j2 \ -o yaml --dry-run=client | k apply -f - ❯ k create configmap mirkwood-config \ --from-file=config=../templates/mirkwood-config.j2 \ -o yaml --dry-run=client | k apply -f -
- Apply the YAML files:
k apply -f meshrr-mirkwood.yaml k apply -f meshrr-core.yaml k apply -f meshrr-lothlorien.yaml
- Configure labels for the Kubernetes nodes with as either
redundancy_group=a
orredundancy_group=b
. Configure labels for each of the Kubernetes nodes to signal that they are eligible for that region by applyingmeshrr_region_lothlorien
,meshrr_region_core
, and/ormeshrr_region_mirkwood
. - Watch the route reflectors come up and peers establish: