Site reliability engineering

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems. Defined by Ben Treynor, founder of Google's Site Reliability Team: "what happens when a software engineer is tasked with what used to be called operations."[1]

History

Site Reliability Engineering was created at Google around 2003 when Ben Treynor was hired to lead a team of seven software engineers to run a production environment. The team was tasked to make Google's sites run smoothly, efficiently and more reliably. Early on, Google's large-scale systems required the company to come up with new paradigms on how to manage such large systems that have never existed before and at the same time introduce new features continuously but at a very high-quality end user experience. The SRE footprint at Google is now larger than 1500 engineers. Many products have small to medium sized SRE teams supporting them. Not all products have SREs. The SRE processes that have been honed over the years are being used by other, mainly large scale, companies that are also starting to implement this paradigm. Microsoft, Apple, Twitter, Facebook, Dropbox, Amazon, IBM, Xero and Oracle have all put together SRE teams.

Roles

A site reliability engineer (SRE) will ideally spend up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal SRE candidate is a coder who also has operational, systems or networking knowledge and likes to whittle down complex tasks.

DevOps vs SRE

Coined around 2008, DevOps is a practice that encompasses automation of manual tasks, continuous integration and continuous delivery. It applies to a wide audience of companies whereas SRE might be considered a subset of DevOps that possesses additional skill sets.

See also

References

  1. Are SRE the next data scientists?, TechCrunch, Mar 2, 2016, Donald Fischer
General
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.