Who Orchestrates the Orchestrators?

Written by Greg Bryan | Nov 15, 2024 1:00:00 PM

Last week on the pod, we shared what network managers need to know about AI and machine learning. This week, we're back on the AI beat, specifically looking at how it can support corporate network automation.

Per usual, we had a little help. Our guest today is Jamie Pugh, CTO at Globalgig.

Jamie joined the show not only to ponder all things automation but also to discuss enterprise network orchestration in the era of an increasingly complex WAN.

We get into some of the difficulties of monitoring and managing the many elements of the modern WAN that can impact application performance and how Jamie saw the need for an “orchestrator of orchestrators."

You can catch some highlights below or head right to the bottom to listen to the whole conversation.

Greg Bryan: Jamie, this is TeleGeography Explains the Internet. We always like to make sure that anybody across the industry—or even people who are only adjacent to it—can understand what we're talking about.

So I thought we'd start the episode with just your sort of basic definition of network orchestration. What does that mean to you at Globalgig?

Jamie Pugh: Yeah, so simply put, network orchestration is ultimately a process.

It's the process of commanding, controlling, managing all the different network functions, which are deployed across that enterprise network, right? So that's probably the simplest definition I could say.

Simply put, network orchestration is ultimately a process.

It's the process of commanding, controlling, managing all the different network functions, which are deployed across that enterprise network.

Greg: Excellent. Yeah. And I think the context of where that sort of comes in with what we're going to talk about today, to back up for folks, comes from this world that you were talking about with startups.

SD-WAN comes along, and the enterprise or corporate wide area network goes through a lot of changes over the past decade or so. Folks who have been really reliant on MPLS as the sort of primary mode of connectivity on private networks move to this complex environment of introducing internet, or as you mentioned, Starlink, 5G—all kinds of alternative underlay technologies. And just really increasing the complexity.

There's lots of reasons why these changes happen that are beneficial, of course, right? But how do you see that from a network management perspective of keeping up with all of these changes that have happened to the sort of physical infrastructure of the enterprise network?

Jamie: Yeah, it's all those things that you mentioned.

The the transformation of the enterprise WAN to support these business-critical applications, whether they're hosted in data centers, CSPs, cloud service providers, and then the cloud offerings, the SaaS providers. These applications, Microsoft 365, ServiceNow, CRM, Salesforce,Oracle ERP systems—they're always on applications now.

The productivity comes to a halt when they're down. So that transformation has required the network to step up, right?

Now you can't just get by with an MPLS network, right? Because you need a second MPLS network and you need some internet access out and you need some security to get to those applications that are out in the cloud.

So we went from a very simple network architecture where we had a router at the site, and on the inside, you had a switch and then you had connected devices and they all went back to the data center.

If they needed to get out to the internet from the data center, they went out from there. And now we have, you know, so many components, right? Every site that we have has multiple routers, SD-WAN routers. You have multiple switches to be able to split those underlay circuits.

All of these have different management platforms. So they all have their SDN orchestrated—you know, platforms where they look at their components. That's what you have. But now we have this sprawl of network orchestrators out there, right? So you've got you know your SD-WAN orchestrator, your switch orchestrator, your security platform, whether that's local firewalls or if it's in the cloud, SSEs, your secure service edge policies.

You've got all these different port portals and then you've got to go and manage the underlay telecoms as well through their portals. And what we found is that that presented a challenge.

The challenge was: how do I actually unify and/or see all of this holistically so that I can properly track the events and incidents that are being created and put them into a single holistic view of the network? And I kind of stumble on the words because there's so much going on there.

I've got a site that I just mentioned that has a circuit that goes down. Well, that circuit that goes down may demonstrate that one of the switches goes down. It may also bring one of the SD-WAN appliances down, depending on how it's architected. And in that world, if I go into each of those different portals, they're all going to show me that something is down. And when I am looking at it from a traditional monitoring platform, I'm going to get three tickets, right? I'm going to get three tickets that are all priority one. They're all super severe. And my network operations center or my engineering team, if I'm an enterprise, is going to suffer because everything's priority one. Everything's hair-on-fire.

Why don't we just talk to all of these devices directly over the WAN and over the public internet, wherever available? But also, let's talk to the orchestrators.

We needed something that could look at it holistically. Our idea was: why don't we just talk to all of these devices directly over the WAN and over the public internet, wherever available? But also, let's talk to the orchestrators. And let's holistically make decisions based on what these events are all saying, and then let's use logic, starting with manual logic.

So we created the conditions that said that, hey, if I see a circuit go down, it's the root cause of the switch going down and the SD-WAN appliance.

So I have one ticket. It's going to the service provider for the telecom underlay.

And I know that my network is still up. I've already checked power because other devices are up in the network. And so that's the logic that we had built manually.

But enter AI, we now have the ability to drive that conversation or drive that workflow. So you have targeted workflows that say: hey, I see a circuit's down. That's probably going to create some impact. Let me go log into those devices to find out what kind of impact. Let me enrich the ticket. Let me audit what I'm doing in the ticket. And then get the ticket directly to the people that can actually resolve the issue.

Listen to the full episode below.

From This Episode:

View full post