Rethink Observability with a Modern, Composable Architecture

Introduction

Modern IT operations are becoming increasingly complex with IT Operational teams face numerous challenges. Legacy monitoring systems struggle to keep up with the evolving IT landscape – hybrid cloud, distributed applications, Kubernetes, etc., resulting in inefficiencies, excessive costs and lack of flexibility to change. The composable observability architecture presents a solution by integrating telemetry pipelines and a visualization layer, allowing organizations to enhance operational efficiency, reduce tool lock-in and unlock greater value from their telemetry data.

Challenges with legacy monitoring

There are common challenges when dealing with legacy monitoring solutions, including:

Monitoring silos: Different teams use different tools, preventing seamless access and correlation of data.
Increasing costs: Traditional logging solutions keep logs within the vendor tools often resulting in high storage costs with limited flexibility.
Burdensome migrations: Shifting to new monitoring tools is difficult, often requiring major operational overhauls to both process workflows and training.
Difficulties isolating outages: Legacy systems lack correlation between disparate telemetry sources, making root cause analysis a time-consuming effort.
Tool lock-in: Monitoring solutions impose rigid UIs, fixed storage formats, and restricted data access, limiting flexibility and making it challenging to adopt new tools or modernize your observability stack.

The interconnection of assets, automation and observability

Let's take a quick step back and understand the IT landscape.

A triangle with text on it

AI-generated content may be incorrect.

Assets, automation, and observability are deeply interconnected, forming the foundation of modern IT operations. Assets represent the infrastructure, applications, services and various devices that generate telemetry data. Without a comprehensive understanding of these assets, organizations cannot effectively monitor or manage them. How do you monitor something you don't know that you own? Automation, in turn, relies on understanding all the various assets as well so that it knows the devices and applications that it can act upon. Additionally, automation is essential for ensuring that telemetry data is collected, processed, and acted upon in real time, reducing the need for manual intervention on remediation and increasing operational efficiency. The pull through effect of Automation is standard definitions defined as-code which ensures uniform application of changes and repeatability. Observability relies on both assets and automation to provide meaningful insights into system health, performance, and availability. Without a well-managed asset inventory, observability lacks awareness of what to monitor and automation has no mechanism to understand what needs to be provisioned, changed and removed. Therefore, a truly effective observability strategy requires an integrated approach where these three components continuously feed into and enhance one another, enabling organizations to respond proactively to operational challenges.

Understanding the composable observability architecture

A composable observability architecture consists of five layers:

Telemetry Producers – These are the devices and systems that generate raw telemetry data, including metrics, logs, events, and traces. Telemetry producers are foundational as they serve as the sources of data, such as servers, routers, applications, IoT devices, services, etc.
Telemetry Pipeline / Observability Pipeline – This layer serves as the backbone for routing, enriching, and transporting telemetry data efficiently to various destinations. It decouples data generation from consumption, allowing multiple tools to ingest and analyze data. The pipeline ensures data immutability and integrity, applies any necessary transformations and enrichment, and provides a robust mechanism for scalable and efficient telemetry distribution and sharing across the organization. Additionally, the pipeline functions as a service bus and enables event-driven architecture; a de-coupling of integration not only from telemetry producers to consumers but also between the various telemetry consumers that have dependencies or rely on actions from other telemetry associated systems.
Telemetry Consumers – These are the various analytics, security, and operational platforms that derive actionable insights from telemetry data. These would include network monitoring systems (NMS), application performance management (APM) platforms, logging platforms, and many others. Consumers process and analyze data to drive business decisions and automated responses. Effective telemetry consumers rely heavily on a structured and well-managed telemetry pipeline to ensure accurate and timely access to telemetry data.
Analytics Layer – This layer is responsible for deriving insights from telemetry data and analytics conducted by downstream tools, e.g. APM, NMS and other tools, by applying event correlation, policy management, event analysis, and AI/ML-driven functions. It enhances event management by prioritizing alerts, identifying patterns, and predicting failures before they impact operations. This layer enables root cause analysis, anomaly detection, and automated response actions, ensuring IT teams can proactively optimize performance and reliability across the enterprise.
Visualization Layer – This is the interface that provides end-users with a clear and unified view of observability data. By consolidating data from multiple telemetry consumers, it ensures different IT teams (e.g. network engineers, service desk analysts, application owners) can work from a shared, reliable perspective. The visualization layer minimizes the operational burden with customized personas dashboarding and subsequently enables seamless integration of new tools and decommissioning old ones without requiring disruptive workflow changes, user interfaces changes, process re-engineering, broad sweeping skills training, and re-writing of SOP documents.

The value of the telemetry pipeline

The telemetry pipeline offers significant advantages:

Data access and reuse – A centralized telemetry pipeline eliminates the need for duplicated data collection, making telemetry data, both real-time and historical, accessible to multiple teams across the organization. This approach improves efficiency, fosters collaboration, and ensures that all departments, from IT operations to security and development, are working from the same reliable data source.
Event-driven architecture – facilitates a decoupled system design, allowing for applications to operate independently while communicating through event streams. This differs from traditional request-response architectures, where service dependencies create bottlenecks, single points of failure, and close dependencies between applications. With event-driven architecture for IT Ops, operations teams benefit from an enhanced scalability, recoverability from system failures without loss of data, and asynchronous processing.
Centralized telemetry management – By serving as a single source of truth for telemetry data, the pipeline ensures consistency and immutability in telemetry data, eliminating data fragmentation and silos of knowledge across the teams.
Flexible data storage – Unlike legacy monitoring solutions that impose rigid storage structures, a telemetry pipeline allows organizations to adopt a flexible storage approach. IT teams can decide whether to store data on-premises, in the cloud, or a hybrid model, optimizing cost and accessibility according to business needs. Telemetry data can also be transformed to conform to the data formats expected by the application requesting access.
Loosely coupled integrations – Traditional monitoring solutions often require tight integration between producers and consumers, making it difficult to adopt new tools. A telemetry pipeline introduces loose coupling, allowing IT teams to switch or add new observability tools without requiring major infrastructure changes and risky cutovers from the out-going tool to the new one. This results in significantly greater agility and innovation.
Historical state and recovery – A well-designed telemetry pipeline retains historical telemetry data across all the telemetry producers, allowing operations teams to conduct forensic analysis, track trends, investigate long-term performance degradations or even feed future AI tools in their preferred data structure for analysis.
Vendor independence – Many OEM vendors lock organizations into proprietary ecosystems and data structures, limiting flexibility, exportability and driving up costs. A telemetry pipeline ensure data ownership is retained by the organization and remains within the organizational control. In this way, telemetry data is not held captive within a proprietary vendor tool or an external partner system, e.g. managed services partner.

Telemetry pipeline example

To illustrate how a composable observability architecture functions, we will walk through a real-world example showcasing how this is all integrated to provide a seamless, scalable observability solution. This architecture ensures efficient data collection, processing, event correlation, automation and visualization for actionable insights.

Example: Real-World Observability Architecture

Telemetry Producers – consist of network devices, applications, and infrastructure components generating various telemetry data, including SNMP metrics, logs, traces, and events.

Telemetry Pipeline – collects, processes, and transports telemetry data to downstream consumers, ensuring efficient data management, transformation, and event-driven processing.

Grafana Alloy
- Polls network devices via SNMP over UDP to gather network performance metrics.
- Uses remote_write to send structured time-series data to Cribl Stream over HTTP for further processing.
Cribl Stream
- Ingests logs, metrics, and traces from various sources, including syslog's via HTTP, TCP, and UDP.
- Optimizes log data through aggregation and deduplication, reducing storage costs while maintaining valuable insights.
- Routes processed telemetry data to Confluent Cloud Kafka ensuring data is categorized into specific telemetry topics.
Confluent Cloud
- Acts as the central service bus for telemetry data, enforcing a "write once, read many" model for seamless consumption.
- Exposes Kafka topics for different telemetry types (metrics, logs, traces, events) to provide structured access for observability tools.
- Uses processing rules to enrich and transform data before exposing it to telemetry consumers.
- Exposes Kafka topics to support event-driven architecture among various observability analytics, correlation, incident, alerting and notification systems.

Telemetry Consumers – extract and analyze telemetry data from Confluent Cloud to drive insights and performance monitoring while supporting an event-driven architecture to streamline alerting, automation, and operational intelligence.

NMS, NPM, and Security Tools
- Network Management Systems (NMS) and Network Performance Monitoring (NPM) tools subscribe to relevant Kafka topics to extract telemetry for network health analysis, traffic optimization, and SLA monitoring.
- Security tools, including SIEM and IDS/IPS platforms, ingest logs and alerts from Confluent Cloud topics, allowing event-driven security analytics and real-time threat detection.
- These tools publish processed events and alerts back to Confluent Cloud for further consumption by correlation, analysis, and automation tools.
Dynatrace (APM for Application Performance Monitoring)
- Consumes application performance telemetry via Kafka topics to provide deep observability into applications, microservices, and distributed environments.
- Uses AI-powered analytics to detect application performance bottlenecks and anomalous behaviors proactively.
- Publishes the output of analysis as alerts and events back to Confluent Cloud for integration with BigPanda and automated response workflows.

Analytics Layer – provides event correlation, deduplication, and automated incident generation, ensuring only meaningful alerts escalate to ITSM.

BigPanda
- Acts as the event correlation and analytics platform allowed to generate incident tickets in the ITSM tool.
- Consumes alerts from multiple observability tools via Confluent Cloud HTTP Sink and applies AI-driven event correlation and analysis.
- Compresses, deduplicates, and prioritizes alerts into actionable incidents, significantly reducing noise and false positives.
- Sends correlated events back into Confluent Cloud Kafka topics for consumption by the ITSM system to create incident tickets and/or initiate automation remediation from Ansible Automation Platform.

Visualization Layer – powered by Grafana Cloud, provides a single-pane-of-glass dashboarding and reporting solution across all observability components.

Grafana Cloud
- Serves as the enterprise-wide reporting and dashboarding solution, visualizing key metrics, logs, traces, and event analytics from multiple sources.
- Provides persona-driven dashboards tailored for network engineers, security analysts, application owners, and IT leadership, reducing noise and enhancing decision-making.
- Enables trend analysis, reporting and SLA tracking, ensuring proactive observability across all solution components.

Automation Architecture – Although not depicted in the observability architecture above, the entire solution is delivered through Infrastructure-as-Code (IaC) and powered by RedHat OpenShift and Ansible Automation Platform.

RedHat OpenShift
- The telemetry collectors, Cribl and Grafana Alloy, configurations and manifests are managed by Git for source control and then subsequently deployed onto their respective distributed OpenShift clusters using GitOps methodology with ArgoCD.
Ansible Automation Platform
- As much of the observability solution as possible is managed as IaC. For example, the Confluent integrations, Kafka topics and Grafana Dashboards are managed through git source control, CI/CD pipelines and deployed by Ansible Automation Platform. This enables rapid provisioning, or reconfiguration, of any solution component and lends well to automating the lifecycle of a managed device or service.
- Listens to Kafka topics on Confluent, triggering automated remediation actions based on observed telemetry. These actions can include infrastructure adjustments, service restarts, or configuration changes, ensuring rapid response to incidents. Additionally, Ansible can create or update incident tickets within ITSM platforms, ensuring proper tracking and process compliance.

The Unified Visualization Layer

Imagine an IT Operations Director has just been informed that they must move to a new network monitoring tool, requiring a complete overhaul of how the service desk operates. The new tool has a different UI, different alerting mechanisms, and an entirely new way of viewing network health. Additionally, the team has spent years developing efficient workflows, writing standard operating procedures (SOPs), and ensuring that everyone—from the day shift to the night shift—knows how to triage issues effectively. Now, all of that is about to change.

This is where a unified visualization layer transforms the experience. Instead of forcing the service desk to fully adopt the new monitoring tool and abandon their existing processes and dashboards, the unified visualization layer integrates the new system's data into an already utilized, familiar dashboard. The team does not have to rewrite every SOP, create new dashboards, or rework how they triage issues. Instead, the new tool plugs into the pipeline and the old tool is removed all without any change to the service desk dashboard, process or workflows. Essentially, the service desk works with a tailored visualization layer that has abstracted the underneath plumbing.

A screenshot of a computer

AI-generated content may be incorrect. — Example: Grafana Labs Dashboard by Personas

The Power of Persona-Based Dashboards

The unified visualization layer allows for persona-driven dashboards that provide:

Role-Specific Views – A service desk analyst sees different data than a network engineer or an application owner. This reduces irrelevant alerts and improves efficiency.
Noise Reduction – By filtering out unnecessary information, analysts focus on what's important, minimizing alert fatigue.
Operational Consistency – Even as new monitoring tools are introduced, the visualization layer ensures IT teams continue to interact with data in a familiar, structured way.
Faster Decision-Making – Dashboards highlight key performance indicators, enabling engineers to diagnose and resolve issues more efficiently.

The IT operations teams can successfully integrate multiple tools seamlessly, avoid massive workflow disruptions, and ensure operational continuity. This approach saves time, resources, and frustration, allowing IT teams to focus on proactive network management rather than navigating disruptive tool changes.

Conclusion

The composable observability architecture, powered by telemetry pipelines and a unified visualization layer, represents a significant shift in how IT operations teams manage complexity, efficiency, and resilience in their observability solutions. By breaking down traditional data silos, reducing vendor lock-in, and enabling seamless integration of monitoring tools, organizations can build a flexible, scalable observability framework that evolves with their needs. IT organizations can future proof their observability strategies, enabling continuous innovation and tool evaluations while maintaining stability and efficiency across their IT ecosystems.