The more I think about availability and reliability, the more I come to the conclusion that availability is a hard metric that can be calculated through mathematical functions whereas reliability is a state that is deeply tied to the emotional response of a customer.
As my SRE learning path continues. I think it’s interesting to go over some of the key concepts that serve as the cornerstone of my motivation.
The topic of the day is Reliability.
What do you think of when I ask you to name something that is reliable? Honestly, the first thing that comes to mind is how unreliable things are around us. The kitchen scale that is sometimes dead-on, while other times slightly off – pastry is not sensitive to measurements as you know -, the streaming service that goes from 4K crispness into the dreadful “buffering” zone (notice how I’m blaming the streaming service and not the internet provider), or that time when service x went down and I really needed it (let us ignore that it’s up more than 99.99% of the time).
It’s hard to remember critical but extremely reliable equipment like a simple watch calculator that is never wrong when we ask them how much is 1+1, or account banking software (at least for me – knock on wood) that has never credited me more than it should.
Reliability vs Availability
Now you may notice that before I was using two lines of thought to define reliability. I was using the correctness of the outcome and if the service/product was up or not. These are actually two different concepts that we should dive into right now.
You can measure the availability of a service by the time it is online/up and running whereas the reliability of a service is if it’s online and delivering the correct and expected content to the customer.
When thinking about availability nothing is 100% available (even if your service has 100% uptime the infrastructure where it stands is not and when talking about uptime you’re only as strong as your weakest link). The number of 9s describes the uptime of your service/product. For example:
- (2 nines) 99% uptime – 87h:36m downtime per year
- (3 nines) 99.9% uptime – 8h:45m downtime per year
- (4 nines) 99.99% uptime – 52.6m downtime per year
- (5 nines) 99.999% uptime – 5m:15s downtime per year
- (6 nines) 99.9999% uptime – 31s downtime per year
- (7 nines) 99.99999% uptime – 3s downtime per year
Note the progression and imagine the effort and cost associated with increasing a single 9 of uptime.
Think for a bit, what’s better? A platform that has 6 nines of uptime or one that 3 nines of uptime? Imagine that the users of that platform are all from a unique region or country and that the service has no usage for 3 hours every day? That would sum up to over 1000 hours yearly. Is it still worth the effort of reducing the downtime from 8h/year to 31s/year?
Why are you spending an incredible amount of money and resources in increasing your uptime to levels that no one will notice?
How available your service needs to be is something that needs to be properly analyzed. Obviously, everyone wants to have the best product, the best services always available but the cost and effort that is associated with it might not make sense.
I then start to think about the sense of betrayal one feels when something that you count as guaranteed fails.
Let’s say something fundamental like hot water or electricity. You’re counting with hot water when you step into the shower in the morning (I’m a morning shower guy) and when that fails I definitely go through all the phases of grief:
- Denial – Oh no no no no, this is not happening.
- Anger – Oh you little *****, give me hot water!
- Bargaining – Come on! Just for a minute…
- Depression – Of all the days for this to happen it had to happen today?!
- Acceptance – Arrgh **** it! You either step up and go full-on cold shower mode or snap out of it and just go on without a shower for now. But move it.
Even though I didn’t have hot water – lack of reliability – the water heater was always connected and turned on – available.
Every time I think about me going through all these emotions when trying to use a product, a service or a utility, I know for sure that I don’t want a customer of mine to go through the same issues.
Focusing on reliability I think you can place your service/platform/product into one of these three categories:
Let me start with the place I hope none of us is in. A not reliable system. Good news? You can only go up from here! From my perspective your customers are in one of these places:
- Your customers don’t care about your product’s lack of reliability. Imagine a website that eventually will produce the desired outcome. All you have to do is click refresh a couple or a dozen times. If your customer is ok with it, perhaps you’re ok with it as well?
- You are still in an early stage (let’s say Beta wink wink NBA Top Shot of mid-February) and your customers understand that reliability is not your primary concern. You’re focused on feature work, in creating more value for the customers and reliability is still not your priority.
- Your customers are constantly frustrated. Either you don’t care about reliability or you don’t care about the experience using your product, the fact of the matter is that your customers are not having a good time and eventually it will result in people leaving your platform or product and going towards competitors.
Does not fail. Period!
The holy grail! You have an incredibly stable and reliable platform. The intricacy of how your services interact with each other is extremely complex but you are lord and commander of your metrics, understand what affects your customer’s experience.
To me, it’s really important to note that you don’t have to be in a critical environment like banking or healthcare to provide a service that is considered reliable to the customer.
Every single time it fails I take it personally
I guess most of our platforms stand here. It’s not a bad place to be. Your customers truly believe in you and are expecting nothing less than perfect. The problem? You have outages every now and again.
Now previously, I listed where your customers could be, now I’ll try to describe where I believe you are:
- You trust your platform and your product and don’t believe or understand all the complaints you read regarding the reliability you provide. To you, there is no necessary evolution.
- You understand your metrics and know there is still work to be done. You want to improve and focus on increasing your reliability providing an ever-improving service to your customers.
- You know how reliable you are and are actively working on improving your services. You’re already working with key concepts like SLIs and SLOs and are starting to take advantage of them.
If we were to stop right now, analyze our products, our platforms or services or even things we use around us we’ll quickly be able to understand what we consider reliable or not, with high availability or not.
This is part of my SRE learning journey and I’m looking to share it as much as I can. The concept behind this is that whenever I write these posts I use solely my interpretation of the materials I’ve researched in blog posts, books, etc. The inspiration for this method comes from this amazing book by Austin Kleon – Share Your Work.
My views and my interpretation might change as I go along the journey but I think it’s extremely useful to document where I am today. I hope that my way of thinking resonates with you, either because you agree, because you’ve never thought of it like that or because you think I’m blatantly wrong. Please reach out on Twitter and get in touch!
Congratulations! You’ve reached the end of the article!
If this content resonated with you don’t be shy about sharing it. Part of the experience is sharing the knowledge.
If you’re going to use information from this article please link back to it! That’s one of the ways we can grow the blog.
Come say hi on social media! All my whereabouts are in the top menu!
See you next time! ✌️💪