-
ForewordGoogle's story is a story of scaling up. It is one of the great success stories of the computing industry,
-
PrefaceSoftware engineering has this in common with having children: the labor before the birth is painful
-
1. IntroductionHope is not a strategy. It is a truth universally acknowledged that systems do not run thems
-
2. The Production Environment at Google, from the Viewpoint of an SREGoogle datacenters are very different from most conventional datacenters and small-scale server farms. These
-
3. Embracing RiskYou might expect Google to try to build 100% reliable services—ones that never fail. It turns out
-
4. Service Level ObjectivesIt’s impossible to manage a service correctly, let alone well, without understanding which behav
-
5. Eliminating ToilIf a human operator needs to touch your system during normal operations, you have a bug. The definition of n
-
6. Monitoring Distributed SystemsGoogle’s SRE teams have some basic principles and best practices for building successful monitoring and aler
-
7. The Evolution of Automation at GoogleBesides black art, there is only automation and mechanization. For SRE, automation is a force multiplier
-
8. Release EngineeringRelease engineering is a relatively new and fast-growing discipline of software engineering that can be conc
-
9. SimplicityThe price of reliability is the pursuit of the utmost simplicity. Software systems are inherentl
-
10. Practical AlertingMay the queries flow, and the pager stay silent. Monitoring, the bottom layer of the Hierarchy of Producti
-
11. Being On-CallBeing on-call is a critical duty that many operations and engineering teams must undertake in order to keep th
-
12. Effective TroubleshootingBe warned that being an expert is more than understanding how a system is supposed to work. Expertise is g
-
13. Emergency ResponseThings break; that’s life. Regardless of the stakes involved or the size of an organization, one tra
-
14. Managing IncidentsEffective incident management is key to limiting the disruption caused by an incident and restoring normal bus
-
15. Postmortem Culture: Learning from FailureThe cost of failure is education. As SREs, we work with large-scale, complex, distributed system
-
16. Tracking OutagesFollowing Escalator’s example, where we added useful features to existing infrastructure, we created a sys
-
17. Testing for ReliabilityIf you haven't tried it, assume it's broken. One key responsibility of Site Reliability Engineer
-
18. Software Engineering in SREAsk someone to name a Google software engineering effort and they’ll likely list a consumer-facing product lik
-
19. Load Balancing at the FrontendWe serve many millions of requests every second and, as you may have already guessed, we use more than a sin
-
20. Load Balancing in the DatacenterThis chapter focuses on load balancing within the datacenter. Specifically, it discusses algorithms for dist
-
21. Handling OverloadAvoiding overload is a goal of load balancing policies. But no matter how efficient your load balancing poli
-
22. Addressing Cascading FailuresIf at first you don't succeed, back off exponentially. Why do people always forget that you n
-
23. Managing Critical State: Distributed Consensus for ReliabilityProcesses crash or may need to be restarted. Hard drives fail. Natural disasters can take out several datacent
-
24. Distributed Periodic Scheduling with CronWritten by Štěpán Davidovič114 Edited by Kavita Guliani This chapter describes Google's
-
25. Data Processing PipelinesThis chapter focuses on the real-life challenges of managing data processing pipelines of depth and complexi
-
26. Data Integrity: What You Read Is What You WroteWhat is "data integrity"? When users come first, data integrity is whatever users think it is.
-
27. Reliable Product Launches at ScaleInternet companies like Google are able to launch new products and features in far more rapid iterations t
-
28. Accelerating SREs to On-Call and BeyondHow Can I Strap a Jetpack to My Newbies While Keeping Senior SREs Up to Speed? You’ve Hired
-
29. Dealing with InterruptsHumans are imperfect machines. They get bored, they have processors (and sometimes UIs) that aren’t very w
-
30. Embedding an SRE to Recover from Operational OverloadIt's standard policy for Google's SRE teams to evenly split their time between projects and reac
-
31. Communication and Collaboration in SREThe organizational position of SRE in Google is interesting, and has effects on how we communica
-
32. The Evolving SRE Engagement ModelSRE Engagement: What, How, and Why We've discussed in most of the rest of this book what hap
-
33. Lessons Learned from Other IndustriesA deep dive into SRE culture and practices at Google naturally leads to the question of how othe
-
34. ConclusionI read through this book with enormous pride. From the time I began working at Excite in the early ’90s,