The SRE Approach: The conundrum of choosing your failures

At one point or the other most tech teams would have come across the issue on the approach that they should take: to count the failures or the percentage thereof.

Swapnil Mondal
3 min readNov 6, 2023
Is uptime really equivalent to Availability?

When a team starts out, the first thing that is usually decided upon is how to treat the issues and to really figure out if every problem is worth the time. Based on the size, complexity and the impact of the application, the teams would decide on their approach. For example if the application is a non-mission critical i.e. there is no major or real time affect on the user then maybe it doesn’t necessarily needs to be solved first in comparison to an application which is mission critical for example a payment gateway. Every minute lost is a real time loss of revenue which asks for a more immediate solution of that problem. So how teams usually work is they figure out the priorities, their application service availability and the cost they are willing to bear for any downtime. From here a point is derived: SLO (Service Level Objective) and SLA (Service Level Agreement).

Once this is realized another problem comes into the picture, how shall we calculate it? should we count the number of users being impacted? but what if there’s more number of users coming in and the errors counts don’t show it? How do we quantify? and how do we setup our alerts to know that there’s actually a problem?

For this, there are two ways to go about it which completely depends on how your team decides to work on or if it’s an existing application or system how do you quantify your severity levels for outages: based on number users impacted or the percentage of users impacted. This is where our main concern comes out.

Let’s see two scenarios;

Example 1: Say your application has less user base so you decide to count the users being impacted from your downtime (either due to network issue, wrong code, config or complete application failure *worst nightmare!) And then suddenly one fine weekend, you get 3x traffic come in and then all your alerts trigger “everything is broken”! So you login to your system to check how long is this going to take and you figure out, your alert was set for 10 user failure for every 100 users coming in but now there’s 1000 users and you still only see 20 users facing issue! *What!

Example 2: Your application is setup in a way that anytime there’s 10% users facing issue, your alerts will trigger and everything is perfect. And one weekend your alerts trigger showing 50% failure! Is my application down??? So you come in to check to find out, and turns out, there were 10 users and whooping 5 of them faced an issue. *bummer.

So from these above two examples it is pretty understandable that deciding one of those approach is tougher than it looks and it can be a make or break deal for your long term view on monitoring and alerting systems. These problems can get even more complicated when you introduce multiple applications and multiple alerting applications.

Answer? Pretty simple: looks at your application *go figures. Figure out how big is your customer base, how much money does each customer approximately brings, what kind of traffic does your website brings in and how is the trend of the failures (if you already have a trend to check or else keep in check with your SLO). Based on these one of the approach would be more sensible. For e-commerce a percentage wise failure alert might sound better than a hard user impact number based alerting. A blog where people come to read something from your newsletter would be better suitable for user count based alerts since each of those users bring in some value (maybe money) and hence, is very important. *so there you go!

On an ending note, it’s very easy to loose track of the real impact in search of “perfect uptime” or “perfect system” so teams will be way more better-off if their maturity of application, potential user base and revenue dictate their approach. Engineering teams starting out might find more usefulness with impacted user based counting for alert’s threshold than percentage of users impacted, which is easier to calculate when the application system is mature enough.

--

--

Swapnil Mondal
Swapnil Mondal

Written by Swapnil Mondal

0 Followers

Site Reliability Engineer, currently at AT&T. Cloud, e-commerce, Ops, architecture, food and music are my thing.

No responses yet