If you’re just getting into site reliability engineering (SRE) or platform engineering, you’ve probably come across a bunch of new terminologies, like SLI, SLA and SLO. These benchmarks are commonly referenced in the day-to-day life of an SRE but may seem foreign to outsiders. So, what are the differences between these abbreviations?
Well, one of the hallmarks of the SRE approach is to gauge performance and set targets for application metrics. These targets can be thought of as reliability thresholds. If a site’s latency reaches a certain point, for example, it may break the agreement the software provider has with its consumer.
SREs create and monitor service-level performance benchmarks for all sorts of metrics, such as uptime, latency, error count, mean-time-to-recovery and throughput, among others. But not all these performance goals are part of public agreements. Companies usually take a more proactive approach to setting more internal targets to stay far away from breaking external agreements and user trust. It’s become a best practice to set these baselines to retain functional, reliable systems.
As mentioned, the type of targets range from firm partner commitments to more internal goals. Below, we’ll attempt to define, in simple terms, the differences between service-level agreement (SLA), service-level objective (SLO) and service-level indicator (SLI). While these concepts share many similarities, it’s important to understand how they are applied in practice. Keep in mind that each company may have a nuanced understanding of these terms and may conceive and apply them differently.
Service-Level Indicator (SLI)
First off, service-level indicators (SLIs) refer to the actual metrics produced by software services. This is a direct measurement of a service’s behavior. These are the real numbers that indicate overall performances, such as error rate and latency over time.
One example of SLI would be the number of successful requests out of total requests over a one-month period. Say an engineer used an application monitoring tool that ingests data from production logs. This data showed that out of one million requests made, ten requests failed. An SLI for availability would thus be 99.999% uptime.
Service-Level Agreement (SLA)
Second, most technologists are familiar with service-level agreements (SLAs). Even if you’re not, you’ve likely agreed to many throughout your history as a digital user. SLAs are like a pact between the software provider and the software user or client. These binding commitments often note availability expectations that must be upheld. SLAs may also include responsiveness to incidents and bugs. It depends on the contract, but if an SLA is broken, some kind of penalty may be incurred such as a refund or a service subscription credit.
These days, the average business relies on many cloud-based SaaS, PaaS and IaaS. Just to initiate something as simple as an online payment might require hitting multiple remote servers. But if all these services just went offline whenever they felt like it, our digital operations would come to a grinding halt. SLAs are thus necessary for enterprise software contractual agreements to ensure both parties meet specific standards.
Once an engineer tracks SLIs and has an idea of typical behavior, they can then set an SLA that makes sense. If we take the example above, an SRE may consider guaranteeing a lower threshold than what is being actively monitored. This would ensure that application performance, at its current pace, doesn’t break any legal or contractual obligations. In this scenario, perhaps the SLA mentions an availability objective: No more than 100 requests fail per one million requests made. This would essentially equate to 99.99% uptime.
Service-Level Objective (SLO)
Lastly, service-level objectives (SLOs) are similar to SLAs but explicitly refer to the performance or reliability targets. An SLA may refer to specific SLOs. Or SLOs may be tracked just for internal purposes. As Google described, “the availability SLO in the SLA is normally a looser objective than the internal availability SLO.”
You don’t want the end users to be the first people clamoring about a 400%+ latency rise in your mobile web apps. Thus, SRE teams typically keep a close eye on performance to ensure they never even get close to breaking SLAs. Some do this by setting and monitoring internal baselines that are more ambitious than the SLA threshold.
If we take our example above, if an SLA guarantees a service uptime of 99.99%, the business may set an internal target of 99.995%. In other words, for every one million requests, no more than 50 should fail. If software systems aren’t hitting these marks, it’s a sign that the company must reevaluate designs and search for bottlenecks. Or, perhaps engineering teams have goals to improve the average downtime for the next quarter—an internal objective could be set at a higher standard than the currently perceived performance.
As the Google Cloud blog described, “every service should have an availability SLO—without it, your team and your stakeholders cannot make principled judgments.” It’s a good reminder to be modest in your reliability commitments, they added, as consumers will come to expect this level of performance. On the other hand, setting more ambitious internal performance targets has the benefit of delivering a better result than the agreement on paper, increasing a software service’s competitiveness.
Real-World Performance Benchmarks
Keep in mind that the above numbers are simply for demonstration purposes. One interesting resource for real-world figures is API.Expert, a service that queries popular APIs and posts weekly performance statistics. Since APIs are the heart of many UI-based platforms (and our digital economy at large), these benchmarks stand as a good indicator of average uptimes and latencies in the industry.
For example, at the time of writing, API.Expert’s Enterprise APIs collection ranked Microsoft Office365 and Pivotal Tracker at the top, both with a 100.00% pass rate and a 220 ms and 248 ms latency, respectively. On the other end of the spectrum, Docusign is at 99.93% with an 877 ms latency and Box is at 99.99% with 414 ms latency.
SLIs, SLAs and SLOs—Oh My!
Although it may sound good to an unpracticed ear, an SLA of 99.99% still equates to 52 minutes and 36 seconds of downtime per year. That’s nearly an hour of downtime in which customers are left scratching their heads or, worse, searching for other options. For traumatic health care situations, a loss of connectivity could be a matter of life and death.
Although creating SLAs and SLOs is important to gauge system health, the reality is that it can be challenging to track and enforce them. “These agreements—generally written by people who aren’t in the tech trenches themselves—often make promises that are difficult for teams to measure,” according to the Atlassian knowledge center.
In summary, SLIs demonstrate the real behavior of software systems. These metrics inform the creation of SLAs, which must be met to ensure B2B agreements. These SLAs often reference particular service-level objectives (SLOs) that must be met, which usually give more breathing room around SLIs. Lastly, in a digital economy with accelerating digital expectations, it makes sense to monitor internal SLOs and improve baselines over time.