Blog Beyond SRE: What Is CRE (Customer Reliability … 6 min
DevOps

Beyond SRE: What Is CRE (Customer Reliability Engineering)?

SparkFabrik Team6 min read
Beyond SRE: What Is CRE (Customer Reliability Engineering)?

Implementing the SRE approach seems impossible? With CRE it becomes feasible thanks to the active support of a service provider that takes care of reliability. Discover how it works and when to choose it.

For several years now, we have been hearing more and more about CRE, an acronym for Customer Reliability Engineering. We could define CRE as a pragmatic, customer-oriented approach that is specific to SRE practices.

In this model, the customer is provided with the skills to manage their own application, while the CRE provider takes care of managing the aspects most closely related to reliability.

We can therefore identify two profiles:

  • the company / customer who is responsible for the application;
  • the CRE provider who offers hands-on support and training to the company in addition to directly handling reliability.

The presence of two actors entails a functional division of tasks for the implementation of SRE practices, where each one brings their own specific expertise to strive for excellence.

To understand in more detail how CRE works, we need to introduce the concepts of Cloud Native Journey and SRE.

Cloud Native Journey: The Starting Point

It would not make sense to talk about CRE and SRE without framing them within a very specific paradigm: the Cloud Native one. In fact, Cloud Native applications are, if not reliable, at least ready to be evaluated in terms of reliability. Without this requirement, it would be impossible to apply either SRE logic or, even less so, CRE logic!

Adopting Cloud Native requires a technological adaptation, but not only that. Being truly Cloud Native also means undertaking a significant cultural and organizational change: it is in every respect a journey that we can imagine as being divided into various stages.

Starting a modernization project by planning a specific Cloud Native Journey allows you to map the current status and the roadmap to reach the desired state. Not all organizations start from the same situation; that is why it is essential, before talking about SRE or CRE, to start a customized transformation project.

Once you have reached the end of the journey, the question will naturally arise: “How do we manage reliability?”. Two paths emerge: collaborating with a CRE provider that takes on this aspect, or managing it entirely in-house. Further on, we will see which criteria to consider when making this choice.

SRE (Site Reliability Engineering) in Brief

As explained in our introduction to SRE article, Site Reliability Engineering refers to a set of principles, practices, and organizational constructs that allow both increasing system reliability and constantly innovating the features offered based on business needs.

The SRE approach therefore presupposes maintaining a dual focus: on one hand, guaranteeing reliability and a certain percentage of uptime; on the other, continuously improving the application. Two objectives between which it is necessary to find a balance.

This is where the so-called “error budget” comes into play, a concept effectively explained by Google in this blog post. Let’s assume that a system must guarantee 99.9% uptime: this means that each month a maximum of 43 minutes of downtime can be used (corresponding to the missing 0.1%). As long as you stay within those 43 minutes, the team can focus on developing new features and improvements. However, from the moment the error budget is exceeded, 100% of the time must be dedicated to solving the problem causing excessive downtime.

Thanks to SRE, it is therefore possible to understand where to focus the team’s efforts based on data. This is precisely why it is essential to create monitoring dashboards capable of flagging any anomalies.

CRE: What Problems Does It Solve?

We have covered all the necessary prerequisites, and now all that remains is to dive deeper into the topic of Customer Reliability Engineering.

This approach was created to solve a very common problem directly related to the migration of systems to the Cloud. Although this process brings undeniable benefits, the Cloud indirectly causes a drawback that must be taken into account.

In fact, regardless of the quality of the preliminary Cloud training, sooner or later every company reaches a phase of uncertainty and anxiety regarding system management. This is understandable: the Cloud is perceived as something over which you do not have full control, unlike on-premise machines.

Such uncertainties are usually addressed with White Papers and Checklists by Cloud providers in a completely ineffective manner. Basically, it is a bit like telling the CIO: “If you follow these best practices, you can be fairly confident that your application will run correctly.” But “fairly confident” is not enough when service quality and company revenue are at stake.

And this is where CRE solutions can truly make a difference. From this perspective, the CRE provider offers technical and cultural tools to manage applications within the SRE framework. It does not stop at informational materials that the company must act on independently — quite the opposite. Alongside concrete, data-based information, precise guidance is provided on which Cloud Native products to adopt and effective configurations for those software solutions and infrastructure components.

But that’s not all: the CRE provider takes on the continuous assessment of infrastructure reliability, relieving the company’s teams of this concern. It is therefore the provider who creates and constantly monitors the dashboards and indicates where to focus attention (based, among other things, on the error budget parameter).

In essence, it becomes possible to obtain all the benefits of the SRE approach without having to invest in internal resources to handle it.

How Does CRE Work Exactly?

As already mentioned, from a CRE perspective, activities are divided between the company, which is responsible for innovation, and the provider, which is in charge of infrastructure reliability.

What does the company handle?

The company is responsible for developing the application by following a set of best practices that allow the application itself to be reliable.

It is not just a matter of avoiding or quickly fixing bugs, but also of implementing certain solutions that allow you to monitor efficiency. For example, you need to ensure that the application does not consume too many resources and therefore does not drive up costs.

What does the CRE service provider handle?

In summary, the activities the provider handles are:

  • Guidance along the journey toward Cloud Native, which we said is an essential step zero for being able to do SRE.
  • Monitoring infrastructure reliability, constant oversight, and sending clear guidance about the priorities IT teams should follow.
  • The production readiness review, an assessment of the current state and the changes needed to reach an acceptable level of reliability. In other words, the provider sets the parameters and guidelines to follow, personally verifying their implementation.

When Should You Choose a CRE Approach?

There are several situations where it is preferable to rely on a CRE service provider. A first example is that of companies whose Cloud Native Journey has just begun or where the internal skills necessary to manage SRE activities are lacking. In these situations, it is clearly always better to avoid the do-it-yourself approach, because it could prove very costly in terms of performance and effort invested.

A second consideration concerns the composition of the IT team. Where there is a strong imbalance of skills toward development, for example, it may be more cost-effective to rely on a CRE provider rather than hiring infrastructure-dedicated staff.

Finally, the intervention of a specialized provider is advisable where there are serious reliability problems. In this case, the priority is to resolve the critical issues in order to improve service quality; it is therefore necessary to understand and apply best practices, implement the right tools, and set up a monitoring system. Throughout the entire process, having the support of a CRE provider is certainly a great help.

Get in touch

Follow us on social media
Listen to Continuous Delivery