The Evolution of IT Operations Analytics

AUG 07, 2017

IT Operations Analytics (ITOA) is the practice of utilizing data science principles to perform pattern discovery, correlation, anomaly detection, and root cause analysis against data collected from underlying infrastructure and applications. To fully appreciate the value proposition of ITOA, you must go back in time to witness the transformation and evolution of Operations.

In the beginning, there was chaos. Organizations used best effort to guarantee services were available for consumers. However, the lack of visibility into how these services were operating increased operational cost and risk often resulting in poor customer satisfaction. Organizations that managed to stay in business quickly focused their attention on increasing visibility through acquiring various products that could monitor their infrastructure and applications.

This stove pipe approach gave birth to Enterprise Service Monitoring (ESM).  As a result, Organizations transitioned from a posture of chaos to a reactive posture. Organizations could now be quickly informed when services were experiencing problems rather than waiting for customers to inform them of an outage.

Unfortunately, the silos of monitoring lacked context. Operations would need to interact with multiple monitoring ESM platforms to understand where the problem resides. This is often referred to as the “Swivel Chair Effect.” To address this issue, ESM evolved into IT Infrastructure Management (ITIM), which added platforms that could monitor horizontally across an Organization adding both context and a “Single Pane of Glass.” Operations could now monitor resources through a single system to visually correlate content and gain information about the environment.

While the ITIM approach vastly improved Operations ability to swiftly tackle issues when services were impacted, it left Operations in a constant state of putting out fires. There had to be a better option that would enable Operations to predict and head off problems before they impacted services. This need of a proactive posture gave rise to IT Operations Analytics (ITOA). By leveraging data science principles and techniques, vendors that adopted an ITOA approach could collect data and apply machine learning to understand behavior, discover patterns, provide supervised and unsupervised learning for event correlation and anomaly detection, and root cause analysis – all providing a means of forecasting or predicting probable end states that could negatively impact service performance.

Why ITOA is Important to IT Business Leaders

IT Operations Analytics utilizes various data science principles and techniques such as machine learning to offer several value propositions to your Organization including:  Pattern Discovery / Anomaly Detection / Event Correlation / Reduced Mean Time to Discover (MTTD) / Reduced Mean Time to Restore (MTTR) / Reduced Operating Cost / Reduced Risk / Increased Customer Satisfaction.  By leveraging ITOA and its inherent benefits, IT Business Leaders help the organization improve IT operational efficiency, mitigate impact to business revenue, and reduce operational cost while improving system effectiveness.

The Trace3 Approach to Effective ITOA

The Trace3 IT Operations Analytics Practice utilizes an approach which strategically places platforms to address six (6) layers within Infrastructure and Operations (I&O).  We refer to this reference architecture or stack as a System of Action. We have found that these layers of the System of Action encompass the various facets of interest for Operations and ultimately transform data into meaningful and actionable insights.

  1. Monitoring Ecosystem – The monitoring ecosystem represents the foundation for all the higher-level layers and is used to collect telemetry from infrastructure and applications. Products at this layer may or may not implement data science techniques to collect data, but all the products will convert data into information by processing, classifying, and reporting up to the high-layers.
  2. System of Engagement – The system of engagement, or manager of managers (MoM), consumes events from all over the Organization. Utilizing data science techniques, the system of engagement can perform pattern discovery, anomaly detection, event correlation, and root cause analysis. The goal of the system of engagement is to reduce both the Mean Time to Discover (MTTD) and the Mean Time to Restore (MTTR). The system of engagement is also the entry point for Operators. It is here that Operators start their incident management process and collaborate with engineering staff through ChatOps to restore/resolve incidents.
  3. System of Automation – The system of automation utilizes products that can perform both automation and orchestration. When integrated with the other layers, the system of automation can alter the environment in response to events – commonly referred to as runbook automations (RBA). This enables the programmable data center to be self-healing and removes the need for human interaction, which results in fewer human errors. Implementing a system of automation can also ensure a stateful environment by returning resources to a known state when rogue changes occur. These events are then reported back to the system of engagement for visibility by Operations thus completing a circular feedback loop between the two layers.
  4. System of Record – The system of record provides ticketing and configuration management. It can also provide a platform for knowledge management, problem management, and many other ITIL centric processes. There is a close relationship between the system of record and the system of engagement, often you will find integration between these two systems as a feature in one or both platforms.
  5. Data Management – The data management layer, sometimes referred to as a data lake, provides a repository for data from all over the Organization, especially the monitoring ecosystem. Data collected here can be mined and is often used to conduct forensics when an event occurs. By leveraging machine learning, the data management layer can help predict and forecast issues, which are then reported to the system of engagement.
  6. Visualization – The highest layer, the visualization layer, provides a layer of abstraction from the underlying platforms within the System of Action. This abstraction allows for focused reports and dashboards that provide actionable insights while reducing the need for consumers to interact with the underlying systems. Thus, the underlying systems can be replaced with other systems in the future while minimizing the impact on the consumers. This is often what most people see when they interact with a provider – a dashboard that summarizes and informs from data collected from underlying systems.

Conclusion

The IT Operations Analytics space is growing and evolving as vendors find new and exciting ways to apply data science principles and techniques to their products. Today, we see data science principles and techniques predominantly being used in the Monitoring Ecosystem, Data Management, and System of Engagement. We predict over the next 18 to 24 months vendors will apply these same principles and techniques into other areas such as System of Automation, System of Record, and the Visualization layers of the System of Action stack. But we also predict ITOA will significantly alter the traditional Network Operations Center (NOC). Under this new NOC model, the NOOP NOC, products leveraging ITOA concepts will reduce the need for Operators and allow these resources to be focused on other areas of the business. This is an exciting time with vast potential. We are excited to a part of this evolution and believe we are positioned to help customers through this transformation.

IT Operations Analytics – System of Action

David Ishmael

TRACE3 | Director of IT Operations Analytics

Leave a Reply

Your email address will not be published. Required fields are marked *