What’s an SRE? Explaining the main concepts from IBM’s SRE learning path

Cristian Desivo
5 min readAug 26, 2021

--

Learning path for the IBM’s associate SRE certificate

I took the Site Reliability Engineering learning path from IBM and going in I had a lot of questions, but maybe the most important was:

What’s a site reliability engineer?

“An SRE leverages operations data and software engineering to automate IT operations tasks, and to accelerate software delivery while minimizing IT risk”

SRE is a new approach to operations/development, focusing on automating IT operations by using runbooks and safely accelerating the software delivery by introducing continuous deployment.

A site reliability engineer has to pay special attention to the site performance metrics, including incidents and troubleshooting, to orientate their priorities.

Using all these tools an SRE helps the site comply with its service level objectives (SLO) and in turn its service level agreements (SLA).

To become an SRE one must understand and apply these concepts, which I’ll try to explain in the following sections.

Metrics, Objectives and Agreements

When running a site for a client several agreements can be made, defining the site performance expectancies and the penalties for not reaching them. These are called service level agreements. These agreements must reference objective service level metrics that must be clearly defined, for example, “the site must be available at least 95% of the time”. In order to comply with these agreements it is recommended to impose stricter service level objectives to aim to, that way there is some slack between breaking the objectives and breaking the actual agreements.

The most used metrics to describe SLOs and SLAs are:

Availability: is the percentage of time a service remains operational, as measured by response time and throughput.

Performance: is a service’s ability to respond to requests and process transactions in a timely and correct manner, as measured by load and speed.

Reliability: is the probability that a system will meet performance standards over a duration of time, as measured by the frequency and impact of failures.

Incident management

Resiliency is an essential part of a reliable site. Resiliency is the capacity to recover from failures.

Incidents happen all the time in every site, so it is important to be prepared for them, and have a plan to deal with them.

The SRE helps deal with incidents by taking operational responsibility for supporting applications and services. They monitor resources to detect outages, receive data from subject matter experts and creates and keep troubleshooting and incident resolution runbooks for first responders to follow.

Runbooks and automation

A runbook is a document detailing step by step instructions to complete operational tasks and perform troubleshooting.

Script-based operational tasks are defined, built, orchestrated, automated, and managed as runbooks. The first-responder team is equipped with automation and well-defined runbooks to resolve issues instantly.

Runbooks guide the administrator through those steps.

One of the roles of an SRE is to create and maintain updated runbooks for the first responders to follow in order to complete an operational task. This can be seen as a middle step between manual task resolution and full automation of the task. It is faster than manual resolution but less intrusive than an automated process.

When people are more comfortable with the runbook process, the invocation can be changed to a semi-automatic or fully automated mode.

A key aspect of Incident Management is to restore the service as quickly as possible. Automated tasks can usually complete these steps faster than any administrator. To enhance the toolchain for Incident Management, it is recommended to include automated mitigation actions that are run after a defined alert is received. If no mitigation action is defined, automation can help in the investigation phase by taking a snapshot of critical system states, such as process lists and trace routes.

The end goal of a runbook is to become fully automated, thus being able to be added to the site pipeline in case of an incident.

Continuous deployment

The roles of SREs

Continuous deployment begins as continuous integration.

The goal of Continuous Integration (CI) is to integrate and test the system on every change to minimize the time between injecting a defect and correcting it.

The next step towards continuous deployment is continuous delivery. Continuous delivery is the automation of the processes (mainly tests and integrations) that deploy software changes.

Continuous delivery picks up where continuous integration ends, automating the delivery of applications to selected infrastructure environments.

It ensures the automation of pushing code changes to different environments, such as development, testing, and production.

Once every process in the deployment pipeline is completely automated we have achieved continuous deployment.

Bonus: Pearson-Vue exams

Part of the IBM SRE learning path is taking an exam at Pearson-Vue to get the SRE certificate.

Pearson-Vue adds a platform for online exam-taking.

The first thing you need to do is register and schedule the exam you want.

Then you have to download OnVue’s software that tests if your computer is apt to take the exam.

You can prepare for the exam by taking a sample training exam, but other than that you have to prepare on your own or wait until the exam date arrives.

On the exam day you need to check in 30 minutes prior to the scheduled time, and after proving your ID and that your work environment complies with Vue’s rules you can start your exam.

The SRE exam consisted on multiple choice questions (some with more than one answer) about the topics learned on the many courses of IBM’s learning path. You have 90 minutes to complete it, and once you do you know your score and if you passed instantly!

My thoughts

The SRE learning path and the SRE role itself has a lot to offer to the industry, I’d recommend this course for cloud users that are passionate about automation and want a new challenge.

It was a fun experience and I hope I was able to at least give you a good sense of what an SRE is. Chau!

--

--

Cristian Desivo
Cristian Desivo

Written by Cristian Desivo

0 Followers

I'm a mathematician, software developer and sometimes a writer. I've been working with artificial intelligence since 2016.

No responses yet