For decades, the tech industry has been enamored with artificial intelligence, the idea that machines can think like humans. Because these machines never get tired — and compute power increases 10x every five years — we expected them to quickly figure out the meaning of life, the universe and everything else. Around 10 years ago, companies such as Moogsoft emerged, claiming that they could make the lives of IT operations teams a whole lot easier by applying AI to the digital exhaust fumes emitted by infrastructure and applications. Blessed by leading analysts, the segment quickly became known as “AIOps.”
The prevailing thought behind AIOps was that patterns in how applications and infrastructure operate could, over time, be determined — and then used to predict when a problem would occur. Even better, maybe problems could even be fixed before anyone noticed. Magic.
Like most magic, though, not everything is as it appears.
Machine learning is like having 1,000 first graders at your disposal. They’re smart enough to be taught to identify a spike in a chart—or a kitten—and alert you when they see one. However, enterprise environments are inherently noisy. If you ask to be notified every time your first graders see a kitten (spike), get ready to receive a lot of notifications. This simple example highlights why ML is great for identifying kittens in consumer search but not so great for identifying problems with enterprise applications.
What makes matters worse is that modern applications change every day, which means that there will be no predictable pattern in their operation. Put another way, the digital exhaust the application emits cannot be used to create a useful model to detect anomalies.
Most AIOps projects became consulting exercises in writing complex alerting rules. Often, they were great at detecting and predicting things you already knew about but singularly useless in helping with “unknown” problems—the problems you’ve never seen before—and the things SREs troubleshoot in distributed applications and struggle with daily.
The Dawn of Generative AI
Just as the sun appeared to be setting on AIOps, we witnessed the dawn of generative AI. AIOps and generative AI both have the acronym ‘AI’ in them, but that’s where the similarity ends. Generative AI uses large language models (LLMs) to better interpret—and respond to—inquiries. The humble chatbot, aka ChatGPT, quickly became the killer app. Powered by LLMs, ChatGPT talked sense most of the time on almost any topic.
Of course, just like with AIOps, it’s easy to jump to magical conclusions regarding the capabilities of this new technology. ‘Why can’t I just train an LLM on my telemetry data and then ask ChatGPT what the problem is?’ That would imply that telemetry not only had the necessary structure—which it doesn’t—but also that LLMs can reason. Which they can’t.
Let’s dial back the expectations from magical to practical.
LLMs are amazing at interpreting natural language requests and returning coherent results. Every product should, by default, use a GPT-based interface to enable users to access help. In addition, most observability tools use a proprietary scripting language. Every product should also provide a “copilot” so users can generate scripts from natural language inputs. The combination of GPT Help and copilots can provide a 50% boost in productivity to new users.
But what about helping users with the troubleshooting process? LLMs are statistical engines. The next word in a sentence is calculated based on prior words generated and what the most likely next word should be. Imagine applying that to incidents instead of sentences. Instead of generating the next word in a sentence, perhaps you could generate the next step in the troubleshooting flow. That would mean SRE teams could benefit from dynamically created runbooks—always up-to-date and based on the latest and greatest insights.
And we could take it one step further. Instead of simply creating a runbook, what about executing the steps in the runbook before the SRE engages in the incident? Imagine coming to an incident and being presented with a custom dashboard for the specific incident that contains all the data necessary to conduct the investigation. That alone saves an hour in most enterprise troubleshooting flows without promising any “magic.”
AIOps promised a deterministic answer to a non-deterministic problem. It just wasn’t possible to analyze terabytes of machine-generated data and accurately predict a problem with an application, let alone determine the root cause. And generative AI won’t do that, either.
But generative AI can go to so many places that AIOps could never. It provides a general-purpose approach that can be applied in many different ways. Generative AI will lower the barrier to using practically any new tooling, making it easier for everyone to perform common tasks and get up to speed on new scripting languages. Heck, you might never have to write a regex or look up an error message ever again! In addition, LLMs can statistically determine the best next step and, therefore, should be able to make troubleshooting flows more repeatable while benefiting from the most current incident knowledge.
Yes, the SRE still plays a huge role. Of course, LLMs can’t answer questions the underlying tooling is incapable of answering. But I’ll still make the bold statement that five years from now, the average MTTR will be half of what it is today.