Google SRE

Website

Foreword

Google's story is a story of scaling up. It is one of the great success stories of the computing industry,

Website

Preface

Software engineering has this in common with having children: the labor before the birth is painful

Website

1. Introduction

Hope is not a strategy. It is a truth universally acknowledged that systems do not run thems

Website

2. The Production Environment at Google, from the Viewpoint of an SRE

Google datacenters are very different from most conventional datacenters and small-scale server farms. These

Website

3. Embracing Risk

You might expect Google to try to build 100% reliable services—ones that never fail. It turns out

Website

4. Service Level Objectives

It’s impossible to manage a service correctly, let alone well, without understanding which behav

Website

5. Eliminating Toil

If a human operator needs to touch your system during normal operations, you have a bug. The definition of n

Website

6. Monitoring Distributed Systems

Google’s SRE teams have some basic principles and best practices for building successful monitoring and aler

Website

7. The Evolution of Automation at Google

Besides black art, there is only automation and mechanization. For SRE, automation is a force multiplier

Website

8. Release Engineering

Release engineering is a relatively new and fast-growing discipline of software engineering that can be conc

Website

9. Simplicity

The price of reliability is the pursuit of the utmost simplicity. Software systems are inherentl

Website

10. Practical Alerting

May the queries flow, and the pager stay silent. Monitoring, the bottom layer of the Hierarchy of Producti

Website

11. Being On-Call

Being on-call is a critical duty that many operations and engineering teams must undertake in order to keep th

Website

12. Effective Troubleshooting

Be warned that being an expert is more than understanding how a system is supposed to work. Expertise is g

Website

13. Emergency Response

Things break; that’s life. Regardless of the stakes involved or the size of an organization, one tra

Website

14. Managing Incidents

Effective incident management is key to limiting the disruption caused by an incident and restoring normal bus

Website

15. Postmortem Culture: Learning from Failure

The cost of failure is education. As SREs, we work with large-scale, complex, distributed system

Website

16. Tracking Outages

Following Escalator’s example, where we added useful features to existing infrastructure, we created a sys

Website

17. Testing for Reliability

If you haven't tried it, assume it's broken. One key responsibility of Site Reliability Engineer

Website

18. Software Engineering in SRE

Ask someone to name a Google software engineering effort and they’ll likely list a consumer-facing product lik

Website

19. Load Balancing at the Frontend

We serve many millions of requests every second and, as you may have already guessed, we use more than a sin

Website

20. Load Balancing in the Datacenter

This chapter focuses on load balancing within the datacenter. Specifically, it discusses algorithms for dist

Website

21. Handling Overload

Avoiding overload is a goal of load balancing policies. But no matter how efficient your load balancing poli

Website

22. Addressing Cascading Failures

If at first you don't succeed, back off exponentially. Why do people always forget that you n

Website

23. Managing Critical State: Distributed Consensus for Reliability

Processes crash or may need to be restarted. Hard drives fail. Natural disasters can take out several datacent

Website

24. Distributed Periodic Scheduling with Cron

Written by Štěpán Davidovič114 Edited by Kavita Guliani This chapter describes Google's

Website

25. Data Processing Pipelines

This chapter focuses on the real-life challenges of managing data processing pipelines of depth and complexi

Website

26. Data Integrity: What You Read Is What You Wrote

What is "data integrity"? When users come first, data integrity is whatever users think it is.

Website

27. Reliable Product Launches at Scale

Internet companies like Google are able to launch new products and features in far more rapid iterations t

Website

28. Accelerating SREs to On-Call and Beyond

How Can I Strap a Jetpack to My Newbies While Keeping Senior SREs Up to Speed? You’ve Hired

Website

29. Dealing with Interrupts

Humans are imperfect machines. They get bored, they have processors (and sometimes UIs) that aren’t very w

Website

30. Embedding an SRE to Recover from Operational Overload

It's standard policy for Google's SRE teams to evenly split their time between projects and reac

Website

31. Communication and Collaboration in SRE

The organizational position of SRE in Google is interesting, and has effects on how we communica

Website

32. The Evolving SRE Engagement Model

SRE Engagement: What, How, and Why We've discussed in most of the rest of this book what hap

Website

33. Lessons Learned from Other Industries

A deep dive into SRE culture and practices at Google naturally leads to the question of how othe

Website

34. Conclusion

I read through this book with enormous pride. From the time I began working at Excite in the early ’90s,