Think DevOps is hard to understand? Meet AIOps. Gartner first coined this term five years ago and it progressed from Algorithmic IT Operations to AI Operations, or AIOps for short. The idea of having general algorithms help IT operations teams is not that novel. One could say that AIOps is trying to ride the wave of Artificial Intelligence hype, and some ops and monitoring tools makers are trying to ride along. If you ask five people to define AIOps you'll probably get 10 different definitions: It’s similar to the time when no one could agree on what cloud meant and some vendors tried to “cloud-wash” older products.
Here’s how Eveline Oehrlich, research director for the DevOps Institute, describes it: “AIOps solutions equip IT operations and other teams to improve key processes, tasks, and decision-making through improved analysis of the volumes and categories of data coming its way. The adoption of these tools automates the fast ingestion of volumes of data. Machine learning is used to analyze the data and present findings that either predict or alert you to issues. This newfound knowledge can then be used for automation or decision-making.”
IT leaders need to be able to cut through and explain some common misconceptions about AIOps that your bosses, colleagues, partners, and customers may have. This will help you explain why various AIOps approaches do or don’t fit a business goal.
[ Need to speak artificial intelligence? Download our Cheat sheet: AI glossary. ]
Truths about AIOps: What it is and what it can do
So let’s dig deeper into what’s fueling the current AIOps momentum and what benefits IT teams are seeing. One example: If your teams are already working with containers and Kubernetes, you should like the idea of a self-driving cluster, and - spoiler alert - you probably can start practicing AIOps right now!
1. AIOps is not a product
If you want to introduce AIOps into your organization, you might be tempted to just buy an AIOps product, have a one-year roll out, and be done. Congratulations, you just added another product to your operations stack and increased the complexity. But wait, you didn’t want to manage more complexity?
First, have a closer look at your current toolset and evaluate where you have holes, considering the common features that current AIOps products offer:
- Baselining - for metrics and other time-series based data
- Root cause analysis - connect multiple information sources and drill down
- Anomaly detection - predict the future and alert on deviations
- Correlation - e.g. between metrics and tickets
- Simulation - "what if" scenarios
You see, AIOps is a feature or a capability rather than a standalone product. And as we’ll later see, for some of the capabilities a dedicated tool will not suffice - because the magic happens once you interconnect all your tools. It’s pretty much like a single neuron doesn’t make a brain.
2. Before you AI you need to Ops
If you want to run, you need to learn how to walk. There’s nothing wrong with good old-fashioned monitoring. Having metrics, logs, and observability in your system landscape is what you need as a base. So the first task is to get your operations straight. If you suffer from too many alerts, identify the most important ones. If you don’t get metrics from a critical application, start implementing metrics. Start to define Service Level Indicators (SLIs) and some Service Level Objectives (SLOs) you want to meet.
While doing so, you’ll discover some blind spots in your monitoring setup and improve your visibility and operational capabilities on the job. Once you hit a barrier, where the hand-crafted thresholds for alerts don’t work anymore, then it’s time to reach for new tools.
Understanding your limitations is always the first step to improvement and so the journey of AIOps starts with data collection and making sense of that data.
If an AI researcher ventures into a new domain, the first thing to do is an EDA - an Exploratory Data Analysis. This includes understanding the data features, such as what are the column names, what do values mean, and what is the semantic context.
Similarly, the first step for AIOps work would be to ensure your organization can collect and access all operational data in an easy way and can visualize it. That means not just the current data, but also historical data.
Only after doing this can you start the more advanced part of the journey – to try to find new signals and insights and put those into automated action.
3. AIOps is about a cultural shift
Some people see AIOps primarily as a cultural change in operations - very much as the DevOps movement is known for the associated culture change. DevOps combined the two cultural mentalities from development and operations teams to create a new culture, marked by speed and experimentation. Nowadays, we take it for granted that the DevOps professional uses tools from both the developer and operations toolsets. We see things like codified infrastructure or application development teams providing SLIs for operating their code.
Now add the data scientist persona to the mix and you get AIOps. In other words, using methods like EDA or tools like Jupyter Notebooks to make your operational excellence better will propel an IT pro into the land of AIOps.
The same is actually true for the AI/ML community, which is still disconnected from the operational aspects of deploying the models. What if data scientists would become more like AI engineers and embrace and understand the benefits and challenges of DevOps? Then, over time, we would shift the attention to problems in the IT domain: I find it funny that we have artificial intelligence that is better than us at identifying cat images, but identifying a bad hard drive is still a challenge.
4. Integration is king and queen
So if AIOps is not a product - where is it happening then? Once some correlation between data sets has been found, or an outage happens repeatedly, you want AIOps to do something automatically or guide you in resolving the outage.
The magic happens in the fabric between your tools.
It can manifest in small connection layers, like chatbots providing you with links to relevant systems, making it easier to jump from a metrics dashboard to the debugging console.
But correlation is not causation. So, even if you find a correlation between two sets of metrics using an AIOps tool, you still need to verify it and decide if you want to act upon it in the future. Or maybe the correlation helps you to identify a cause for the outage.
It’s all about getting better at understanding and managing the complexity of your setup and then integrating automated helpers and actions.
5. Use the open source, Luke
This is precisely where open source software shines. Compared to a closed source product, in an open source product, you can read the code at any level and deeply understand what is doing what. Translating that to the operations sphere, you’ll want to be able to expose metrics and tracing data at every layer of the software stack and understand what it means. The recent generation of data centers builds on top of Kubernetes, which makes heavy use of microservices and API-driven orchestration of software deployments. And monitoring of API calls is pretty much straightforward today. Check, you can observe your platform.
Observability means in essence, that you can inspect your landscape at any level of detail, at any time. Using some data science tools to visualize and guide you through the data can help you do root cause analysis and troubleshooting.
Now use the same paradigm to deploy and manage your own applications, containerize them, and re-use the monitoring stack. Check, you can observe your application stack.
Because you used the same tooling, you can easily correlate metrics from the platform and the application. Prometheus has emerged as a de-facto standard for monitoring in that domain and is itself heavily API-driven. Similar projects such as Loki and Jaeger help with logs and traces.
Then the organization can use a Kubernetes native data science platform like Open Data Hub or Kubeflow to collect and analyze all the data.
The bottom-line benefit for IT teams: Standards reduce friction and enable deep integration, standards are implemented best by open source tools.
[ Read also: 5 open source projects that make Kubernetes even better. ]
6. Data, data, data
Speaking of data - you can’t have enough of it. But it has to be clean and well understood, as seen above. So you can collect your own pool of data and train your own AI models - actually, all of the commercial AIOps tools require you to do just that, since there’s no pre-trained intelligence baked in.
But what if we could train some common models on public data and then use that as a baseline for training your own models? This way, nobody would need to start from scratch, but stand on the shoulders of a community. Much like the database application would come with its own model for the common workloads and architectures. Then transfer the learning to your specific setup with your specific needs. It’ll be a kickstart and let you differentiate on your individual requirements.
This is part of the vision of the Operate First public cloud project, where the platform and workloads are operated in a community and the operational data, such as metrics, logs and tickets, are released under an open source license. This is to enable data scientists to build open and free models. All the building blocks are there and you can follow along, or better yet, actively participate, at operate-first.cloud.
How to start small with AIOps now
As any IT leader who has adopted the DevOps way of working knows, a change of habits requires continued practice. The same is true for IT teams adopting an AIOps mindset. The good news is that you can start small right now, with a well-understood problem and then go through the evolutionary cycle of developing AIOps capabilities - working toward more AI assisted, AI augmented, and finally AI automated IT operations.
A change of culture requires champions, sponsors, and role-models in an organization. Instead of getting caught up in AI hype, do the research, understand the first principles and deconstruct problem statements. Every revolution starts small. You may be surprised what you can accomplish with a small group of open-minded engineers, operations professionals, and a state of the art platform and operations stack.
[ Get the free eBooks: O’Reilly: Kubernetes Operators: Automating the Container Orchestration Platform and Managing IT with Automation. ]
What to read next
Subscribe to our newsletter.
Keep up with the latest advice and insights from CIOs and IT leaders.
According to the Gartner glossary, AIOps means "Artificial Intelligence for IT Operations", not Algoritmic IT Operations. (ref: https://www.gartner.com/en/information-technology/glossary/aiops-artificial-intelligence-operations). And there are a number of products in this area, some even quite good. But l.ike any other use of AI, they need care and feeding. Lots of feeding, actually, with data.