One of the main challenges this year is picking up two teams: Observability and SRE (Site Reliability Engineering). These are crucial themes for any tech company, particularly the ones holding platforms and that place their customers first.
Since I arrived at the challenge deprived of knowledge on the subject, I started going through the standard and expected reading list:
- SRE Book – Online | Amazon
- SRE Workbook – Online | Amazon
- Building Secure and Reliable Systems – Online | Amazon
The whole idea here is to share where I am with my SRE learning path. I’ve started researching this topic a couple of weeks ago and this is where I am at the moment.
I know I’m not 100% right but I’m also not 100% wrong so this is my understanding of topics I consider key to start.
Now, these are some thoughts on these concepts. I’ve tried not to go back to books when describing this to help me deliver my own understanding and with my own words.
SLA stands for Service Level Agreement and it’s usually defined by the contractual relationship between multiple parties. Let’s say that our company makes a contract with company A that states that we will deliver X amount of items monthly. In that contract, you have a clause that states that if we fail to deliver those items we’ll be penalized in Y dollars. This clause can be considered an SLA.
Even though agreements like these are the backbone of so many business relations, Engineering and Ops teams are usually not bothered with these contractual affairs apart from the key part of how many items per month we need to deliver. That takes us to the next definition.
Service Level Objectives (SLOs) are key in the daily activity of SRE teams, of their relationship with other teams and with the state of a platform.
In the previous example, the SLO is the number of items we have agreed to deliver to company A. We know that we need to hit that mark otherwise, there will be consequences. Now, setting our objective to the exact number we need to deliver is cutting it a bit close.
With that in mind, we create internal and external SLOs. Our external SLO would be the X amount of items we need to deliver but internally we want our alarms to sound a bit earlier so we define our internal SLO let’s say a number 10 or 20% more conservative. This will allow us to react and to change anything that we need to ensure that we still deliver and meet our external SLO.
The conception of the SLO must have the end user in mind. If your user has an extremely fast internet connection it might not make a difference if the response of your service is improved from 12ms to 11ms. Especially if you consider the cost associated with this amount of fine tuning. It could mean more expensive hardware, faster database clusters, etc. If the customer won’t feel it then it’s not a good objective at least for an SLO.
Now look at your service and digest what it does. How does it reach the customer? What part of the process does it impact? Do you have dependencies? Is your service a dependency? Understand how you can improve the customer’s experience and define your SLOs based on it.
With this you can go one of two ways. You can first define SLAs and create your SLOs based on how not to breach the SLAs, or, especially for companies that are now implementing SLOs, you can find how to improve the user experience and create customised SLOs. This second approach will drive eventual SLAs and possible contract improvements since now you offer the guarantee of a service – something that wasn’t available before.
The number of items we create that allows us to know where we stand in our SLO is called Service Level Indicators (SLIs). We can measure a lot of different properties, from items we produce (throughput), to the amount of time it takes for our customers to have a response from our platform (latency), how many requests we failed to respond or reply with the wrong data(error rate) and to all of these indicators will allow teams to improve their systems and transparent exposure to these metrics and how they compare to the product we want to sell sometimes changes the strategic vision for an organisation.
Now we arrive at an incredibly interesting concept called Error Budget. Let’s say our SLO is related to uptime and that we agreed with our partners that the uptime of our platform is 99.95%. It might seem like an incredible amount of uptime but in reality, it translates to more than 4 hours with the service unavailable (that amount might make you rethink things especially if you are in a critical environment).
Those 4 hours are our Error Budget. Every minute we have our platform unavailable our Error Budget decreases. Now, what makes this concept interesting is what happens or should happen if we surpass our budget. According to the literature, all feature work should be frozen and teams would only focus on the platform and how to improve the overall state of the platform.
The first time I read this I thought of it as common sense but… Image the implications that this “no more feature work” has in partner relations, committed deliveries and expected features by customers. It takes very strong companies to have this mindset and a resolute buy-in from top executives. This is one of those SRE ways of living and not so much an SRE-fad.
Congratulations! You’ve reached the end of the article!
If this content resonated with you don’t be shy about sharing it. Part of the experience is sharing the knowledge.
If you’re going to use information from this article please link back to it! That’s one of the ways we can grow the blog.
Come say hi on social media! All my whereabouts are in the top menu!
See you next time! ✌️💪