サクサク読めて、アプリ限定の機能も多数!
トップへ戻る
2024年ランキング
sre.google
Lessons Learned from Twenty Years of Site Reliability Engineering Or, Eleven things we have learned as Site Reliability Engineers at Google Authors Adrienne Walcer, Kavita Guliani, Mikel Ward, Sunny Hsiao, and Vrai Stacey Contributors Ali Biber, Guy Nadler, Luisa Fearnside, Thomas Holdschick, and Trevor Mattson-Hamilton Foreword A lot can happen in twenty years, especially when you're busy growing
Google が過去に出版した 2 冊の書籍「Site Reliability Engineering」と「The Site Reliability Workbook」は、サービスライフサイクル全体への取り組みによって、組織がソフトウェアシステムの構築、展開、監視、保守を成功させる方法と理由を示しています。本レポートでは、Google Cloud Reliability Advocate の Steve McGhee と Google Cloud Solutions Architect の James Brookbank が、組織で SRE を導入する際にエンジニアが直面する特定の課題について深く掘り下げています。 SRE の普及にもかかわらず、多くの企業では SRE に対する当初の熱意と、その採用の度合いの間に大きな隔たりが生じています。本レポートは、プロダクトオーナーや信頼性の高いサー
The Art of SLOsは、GoogleのCustomer Reliability Engineeringチームによって開発されたワークショップです。このワークショップの目的は、Googleがサービスの信頼性を計測する方法 サービスレベル指標(SLI) とサービスレベル目標 (SLO)を参加者に紹介し、実際にこれらの計測方法を作成することを体験してもらうことです。これらは重要で土台となる概念です。サービスの信頼性を客観的に測定する方法があれば、サービスの信頼性について有意義な会話をすることがはるかに簡単になります。 ワークショップの理論編では、開発チームと運用チームの間でしばしば生じる組織的な緊張を、サービスの望ましい信頼性を表す目標値を設定することで解決する方法を学びます。また、SLOとエラーバジェットを使って、データ駆動で、客観的、かつユーザー重視の方法でサービスの信頼性を測定・
SRE Book Updates, by Topic Click on a chapter thumbnail to see relevant publications, conference talks, and workshops by Google SREs.
Anatomy of an Incident - Google - Site Reliability Engineering When it comes to system design, failure is inevitable. Scientists and engineers implement solutions based on the available information, without a complete knowledge of the future. You can’t always anticipate the next zero-day event, viral media trend, weather disaster, or shift in technology. But you can be prepared to respond when inc
SRE in the Cloud Learn how to put SRE principles into practice by leveraging cloud technology. Implement SRE in your organization through tooling, hands-on tutorials, videos, blogs, and other resources. Balance development velocity and reliability Manage reliability and drive alignment between developers and operators with baked-in SRE best practices. Create Service-Level Indicators (SLI), set Ser
Štěpán Davidovič Incident Metrics in SRE Critically Evaluating MTTR and Friends Boston Farnham Sebastopol Tokyo Beijing Boston Farnham Sebastopol Tokyo Beijing 978-1-098-10313-2 [LSI] Incident Metrics in SRE by Štěpán Davidovič Copyright © 2021 O’Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebas
Incident Metrics in SRE - Google - Site Reliability Engineering Measuring improvements as a result of a process change, product purchase, or a technological change is commonplace. In reliability engineering, statistics such as mean time to recovery (MTTR) or mean time to mitigation (MTTM) are often measured. These statistics are sometimes used to evaluate improvements, or track trends. In this rep
What is Site Reliability Engineering (SRE)? SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and cap
Introducing Non-Abstract Large System Design By Salim Virji, James Youngman, Henry Robertson, Stephen Thorne, Dave Rensin, and Zoltan Egyed with Richard Bondi With responsibilities that span production operations and product engineering, SRE is in a unique position to align business case requirements and operational costs. Product engineering teams may not be aware of the maintenance cost of syste
Service Overview The Example Game Service allows Android and iPhone users to play a game with each other. The app runs on users’ phones, and moves are sent back to the API via a REST API. The data store contains the states of all current and previous games. A score pipeline reads this table and generates up-to-date league tables for today, this week, and all time. League table results are availabl
Implementing SLOs By Steven Thurgood and David Ferguson with Alex Hidalgo and Betsy Beyer Service level objectives (SLOs) specify a target level for the reliability of your service. Because SLOs are key to making data-driven decisions about reliability, they’re at the core of SRE practices. In many ways, this is the most important chapter in this book. Once you’re equipped with a few guidelines, s
Alerting on SLOs By Steven Thurgood with Jess Frame, Anthony Lenton, Carmela Quinito, Anton Tolchanov, and Nejc Trdin This chapter explains how to turn your SLOs into actionable alerts on significant events. Both our first SRE book and this book talk about implementing SLOs. We believe that having good SLOs that measure the reliability of your platform, as experienced by your customers, provides t
Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program Written by: Jennifer Petoff, JC van Winkel & Preston Yoshioka with Jessie Yang, Jesus Climent Collado & Myk Taylor Providing training and education for Site Reliability Engineers is universally important to set them up for success in your organization. However, the specific training needs of each enginee
Written by: Heather Adkins, Betsy Beyer, Paul Blankinship, Ana Oprea, Piotr Lewandowski, Adam Stubblefield Can a system be considered truly reliable if it isn't fundamentally secure? Or can it be considered secure if it's unreliable? Security is crucial to the design and operation of scalable systems in production, as it plays an important part in product quality, performance, and availability. In
If you’re rolling out a large-scale infrastructure change, you know it can be like swapping out a jet engine while flying. Staying aloft takes coordination and communication with many teams, good processes and documentation, risk identification and management, monitoring, and tracking of the change progress—not to mention dealing with the catastrophic challenges that crop up midflight. In this rep
Release Engineering Written by Dinah McNutt Edited by Betsy Beyer and Tim Harvey Release engineering is a relatively new and fast-growing discipline of software engineering that can be concisely described as building and delivering software [McN14a]. Release engineers have a solid (if not expert) understanding of source code management, compilers, build configuration languages, automated build too
Postmortem Culture: Learning from Failure By Daniel Rogers, Murali Suriar, Sue Lueder, Pranjal Deo, and Divya Sudhakar with Gary O’Connor and Dave Rensin Our experience shows that a truly blameless postmortem culture results in more reliable systems—which is why we believe this practice is important to creating and maintaining a successful SRE organization. Introducing postmortems into an organiza
Copyright © 2018 Google, Inc. Published by O'Reilly Media, Inc. Licensed under CC BY-NC-ND 4.0
Launch Coordination Checklist This is Google’s original Launch Coordination Checklist, circa 2005, slightly abridged for brevity: Architecture Architecture sketch, types of servers, types of requests from clients Programmatic client requests Machines and datacenters Machines and bandwidth, datacenters, N+2 redundancy, network QoS New domain names, DNS load balancing Volume estimates, capacity, and
Example Postmortem Shakespeare Sonnet++ Postmortem (incident #465) Date: 2015-10-21 Authors: jennifer, martym, agoogler Status: Complete, action items in progress Summary: Shakespeare Search down for 66 minutes during period of very high interest in Shakespeare due to discovery of a new sonnet. Impact:163 Estimated 1.21B queries lost, no revenue impact. Root Causes:164 Cascading failure due to com
Monitoring Distributed Systems Written by Rob Ewaschuk Edited by Betsy Beyer Google’s SRE teams have some basic principles and best practices for building successful monitoring and alerting systems. This chapter offers guidelines for what issues should interrupt a human via a page, and how to deal with issues that aren’t serious enough to trigger a page. Definitions There’s no uniformly shared voc
Service Level Objectives Written by Chris Jones, John Wilkes, and Niall Murphy with Cody Smith Edited by Betsy Beyer It’s impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they u
Postmortem Culture: Learning from Failure Written by John Lunney and Sue Lueder Edited by Gary O’ Connor The cost of failure is education. Devin Carraway As SREs, we work with large-scale, complex, distributed systems. We constantly enhance our services with new features and add new systems. Incidents and outages are inevitable given our scale and velocity of change. When an incident occurs, we fi
Eliminating Toil Written by Vivek Rau Edited by Betsy Beyer If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow. Carla Geisser, Google SRE In SRE, we want to spend time on long-term engineering project work instead of operational work. Because the term operational work may be misinterpreted, we use a specifi
Chapter 2 - The Production Environment at Google, from the Viewpoint of an SRE The Production Environment at Google, from the Viewpoint of an SRE Written by JC van Winkel Edited by Betsy Beyer Google datacenters are very different from most conventional datacenters and small-scale server farms. These differences present both extra problems and opportunities. This chapter discusses the challenges a
次のページ
このページを最初にブックマークしてみませんか?
『Google - Site Reliability Engineering』の新着エントリーを見る
j次のブックマーク
k前のブックマーク
lあとで読む
eコメント一覧を開く
oページを開く