Instrumenting AI for Network Operations at a Federal Healthcare Agency

AI and network data

Artificial Intelligence (AI) is transforming how we retrieve, analyze and utilize data. By simply posing a question as you would to a colleague, AI can sift through vast amounts of information to provide the most pertinent answer. The outcome? Swift, precise insights that enhance productivity and efficiency.

This transformation extends to network operations as well. Consider asking, "Why is port 0/0/3 down on this router?" With access to network data, AI can analyze and respond to such inquiries. For instance, NetBrain's Version 12 introduced AI capabilities that facilitate advanced network data retrieval, analysis and automated reporting to diagnose network issues.

The next logical progression is moving beyond basic data queries like "Show me this information" to intelligent actions: "Why is port 0/0/3 down—and can you fix it?" This is precisely the direction a Federal Healthcare Agency is taking with its next-generation network operations—transitioning from "Tell me about the network" to "Help me fix the network."

However, integrating AI into network operations comes with its own set of risks. While automation has become a fundamental aspect of modern network operations—allowing teams to achieve more with fewer resources, enhancing accuracy, and minimizing outages caused by human error—there are cautionary tales of automation gone wrong. A single incorrect configuration, when deployed across numerous devices, can result in widespread disruptions.

So, how can we empower AI to take action while ensuring it causes no harm? The Agency is addressing this challenge by equipping its AI with proven, production-tested workflows. Instead of granting direct access to the network, AI interacts solely with these workflows—ensuring that actions are controlled, validated and safe.

The power of large language models and workflows

We've all experienced the remarkable capabilities of well-trained Large Language Models (LLMs) when using natural language to seek information. Now, this capability is being honed to work with smaller, specialized datasets, allowing LLMs to deliver expert-level responses on specific topics.

A notable example is code assistants like IBM's Watson. Recently, Red Hat integrated Watson with Ansible, introducing the Lightspeed feature. Embedded in coding tools like VS Code, Lightspeed enables users to generate Ansible playbook content from a simple text outline. This is achievable because Watson has been trained on extensive Ansible data, making it proficient at translating user intent into functional playbooks.

The Agency is adopting a similar approach with its network data. By linking an LLM to its network data lake and other sources, the Agency empowers operators to swiftly retrieve summarized network data, key statistics, and correlations with telemetry information. Tasks that once required hours of manual CLI queries can now be accomplished in seconds—simply by asking a question.

As part of this initiative, they have developed an internal system called Optimus, built on the open-source network automation platform eNMS. Optimus streamlines and automates network administration tasks, enhancing the efficiency and reliability of processes like switch provisioning.

Beyond "Day 0" tasks, such as initial switch setup, Optimus also automates "Day 2" operations, including troubleshooting connectivity issues. By reducing manual effort and increasing accuracy, Optimus has significantly improved network operations. It encapsulates complex network analysis and break/fix actions into reusable, reliable workflows—ensuring tasks are completed thoroughly, accurately, and with minimal human intervention.

The true game-changer, however, is the integration of AI with Optimus. This advancement enables genuine "ChatOps"-style network administration, allowing operators to interact with the system using natural language, further enhancing efficiency and operational agility.

Figure 1: Optimus and LLM Integration with the workflow engine

How they did it

A network automation consultant from WWT is supporting the Agency, working on contract and is one of the lead developers on their automation team.

He describes the system:

"We are modifying the backend to run in a clustered multisite environment under Docker Swarm. It can connect to multiple devices (multithreaded) and automate a complex implementation workflow or decision tree. With the benefits of automation... no mistakes.. no skipped steps.. and it runs faster than a large team of engineers performing tasks manually. We recently started integrating the AI chatbot capabilities and I added n8n in addition to flowise to test the agentic workflows. Much of the capabilities are based upon effective AI tool use.

Basically, the AI Agent detects if what you're asking for can be accomplished from one of the programmed tools. It then sends a REST API call into our automation platform (Optimus) which takes the payload sent by the AI agentic and implements it by running a specified workflow. The benefits of this is all components are sandboxed off and the flow is completely controlled. The LLM is local... as well as the vector datastores.. and data repositories. The AI can only target the workflows that we define ... and can only perform exactly what we have designed them to do.

The additional benefit is there is no expense to running the LLM... no worry about limited token use or fees paid to public language models, and public vector stores. All the data and flow of information is local, controlled and sandboxed."

When asked to describe the interaction between the LLM and the workflow engine, he said:

"With the AI workflow you can target one or more LLM. You would target a particular LLM depending upon its strength (Coding ability) (Text to Speach Intelligence) (Vector Embedding Effectivness) Etc. The AI agent would then be aware of the TOOLS it has available which are components you program. With effective Prompt engineering the AI agent will then understand that it needs to parse the users query for the required data elements.. format it into structured data.. and then target a defined workflow in Optimus that can execute the task."

When asked about training the LLM, he said:

"Training workflows is expensive in terms of time and hardware. At this time we are not training the LLMs... we are enhancing them with RAG and AI Tool use. AI RAG (Retrieval Augmented Generation) is a method where you have the AI pre read the documents you store at a specified point. The AI workflow then vectorized this data and embeds it using an Embedding LLM into a local Vector DataStore. Subsequently when you ask the AI a question…it is now able to retrieve this additional information from the vector datastore and answer questions that the LLM didn't previously know. So, you've enhanced it with localized knowledge.

Tool USE is a method of letting the AI Agent know that it can launch a connected tool to get answers which the LLM doesn't know (or to which the LLM provides unreliable or weak answers).

A simple example is asking it the time. The LLM is a static local precompiled brain…with the knowledge it was trained on from a specified date. It can understand that it queries the tool to get the time.. which then executes a JavaScript/Python function that returns the exact date and time back to the AI Agent. Now the AI agent can respond with the exact date and time.

Tool use can be extended to incorporate Networking functions...Example: what is the status of device X…the AI agent understand to use a tool that responds with device information. It executes the tool…passing it the device in question…which calls an Optimus workflow. The Optimus workflow then launches a predefined automation workflow that returns the exact answer back to the AI agent, which can now respond to the question.

This can eventually be enhanced where the AI agent can not only answer questions... but implement Advanced Networking Tasks and this all depends upon how we've implemented the tasks in Optimus."

When asked about the capabilities and use cases, he answered:

"Capabilities of the Network Automation platform...it can directly interact with a wide range of networking gear I believe it is over 250 different device targets...and it can target a multitude of backends for data collection ex. (Mongo, MSSQL, SQL...) and use REST API calls to interact with other networking tools (ex. Cisco ISE, Juniper MIST, etc.)

We are currently using it to automate tasks across the enterprise. The AI integration is still being tested to enhance the accuracy of the network tasks it performs."

When asked how the system is impacting current operations:

"It is saving engineering time. Instead of having a team of engineers manually log into thousands of devices to perform network tasks, the automation platform will execute a workflow.. which targets multiple devices simultaneously and executes commands at the speed of a computer. Projects which would have taken months to execute can be done in minutes

AI and the future of network operations and the Agency

The agency is spearheading several forward-looking initiatives, with automation and AI at the forefront of modernizing operations. Central to this transformation is the utilization of network data as a foundation for AI-driven operations. WWT is assisting the agency in leveraging its existing efforts in data collection and workflow-connected toolsets to enable AI to achieve the following:

- Establishing a secure set of "guardrails" for AI to safely execute network change actions.

- Creating a robust and impactful dataset for AI to analyze and utilize.

- Implementing applied agentic AI for network operations and ChatOps.

- Reducing manual effort and minimizing errors.

- Accelerating "Day 0" network equipment deployments.

- Decreasing the Mean Time to Repair (MTTR).

Instrumenting AI for Network Operations at a Federal Healthcare Agency

In this blog

AI and network data

The power of large language models and workflows

How they did it

AI and the future of network operations and the Agency