Site Reliability Engineering

How Google Runs Production Systems

First edition.
  • 3.8 (10 ratings)
  • 35 Want to read
  • 6 Currently reading
  • 13 Have read

My Reading Lists:

Create a new list

  • 3.8 (10 ratings)
  • 35 Want to read
  • 6 Currently reading
  • 13 Have read

Buy this book

Last edited by Drini
November 7, 2025 | History

Site Reliability Engineering

How Google Runs Production Systems

First edition.
  • 3.8 (10 ratings)
  • 35 Want to read
  • 6 Currently reading
  • 13 Have read

"The overwhelming majority of a software system’s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems? In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient - lessons directly applicable to your organization. This book is divided into four sections: Introduction - Learn what site reliability engineering is and why it differs from conventional IT industry practices; Principles - Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE); Practices - Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systems; Management - Explore Google's best practices for training, communication, and meetings that your organization can use."--Publisher's description.

Publish Date
Language
English
Pages
524

Buy this book

Previews available in: English

Edition Availability
Cover of: Site Reliability Engineering
Site Reliability Engineering: How Google Runs Production Systems
2017, Google, O'Reilly Media
Web Book in English
Cover of: Site Reliability Engineering
Site Reliability Engineering: How Google Runs Production Systems
2016, O'Reilly Media, Inc.
Paperback in English - First edition.

Add another edition?

Book Details


Table of Contents

Introduction. The production environment at Google, from the viewpoint of an SRE
Principles. Embracing risk
Service level objectives
Eliminating toil
Monitoring distributed systems
The evolution of automation at Google
Release engineering
Simplicity
Practices. Practical alerting from time-series data
Being on-call
Effective troubleshooting
Emergency response
Managing incidents
Postmortem culture: learning from failure
Tracking outages
Testing for reliability
Software engineering in SRE
Load balancing at the frontend
Load balancing in the datacenter
Handling overload
Addressing cascading failures
Managing critical state: distributed consensus for reliability
Distributed periodic scheduling with Cron
Data processing pipelines
Date integrity: what you read is what your wrote
Reliable product launches at scale
Management. Accelerating SREs to on-call and beyond
Dealing with interrupts
Embedding an SRE to recover from operational overload
Communication and collaboration in SRE
The evolving SRE engagement model
Conclusions. Lessons learned from other industries.

Edition Notes

Includes bibliographical references (pages 501-512) and index.

Classifications

Dewey Decimal Class
620.00452 SIT
Library of Congress
HD9696.8.U64 G6666 2016, QA76.77

The Physical Object

Format
Paperback
Pagination
xxiv, 524 pages
Number of pages
524

Edition Identifiers

Open Library
OL27208603M
ISBN 10
149192912X
ISBN 13
9781491929124
OCLC/WorldCat
950479609, 930683030
Goodreads
27968891

Work Identifiers

Work ID
OL20028554W

Work Description

Members of the SRE team explain how their engagement with the entire software lifecycle has enabled Google to build, deploy, monitor, and maintain some of the largest software systems in the world.

Links outside Open Library

Community Reviews (0)

No community reviews have been submitted for this work.

Lists

Download catalog record: RDF / JSON