← Back to all posts

Service Monitors and Observability

Published: ....
Last modified: ....

Share this post on BlueskySee discussion on Bluesky

Thanks to Scott Kaye for reviewing an early draft of this post

Awhile back I tweeted made this post on X about being intentional when creating service monitors:

This became my go-to question when we were defining additional monitors for our services over the past month or so:

“If you get paged by this at 2am, do you know what it means/what to do?”

It’s a really solid litmus test to ensure monitors are clear, concise, and actionable https://t.co/YpFhbCW34G

— Matt Hamlin (@immatthamlin) July 29, 2023

I figured I would expand a bit on it within a blog post since I have thought a decent amount about it since I posted it.

For those that don't know, service monitors are automatic "tests" that can be used to determine the health of a system in production. They can be configured for just about anything, latency and up-time are generally the most popular monitors.

Traditionally, monitors are configured to automatically raise an incident, which usually pages someone that is currently on-call for the service.

However, as uncle Ben says - with great power comes great responsibility.

It can be nerveracking to be on-call for large and complex services, especially if you're new to the team that owns such a service. Now compound that with being paged due to a monitor that is missing context about what's actually wrong with the service and what to do to resolve it.

This is what I was speaking to in the above post on Twitter/X, at work we rapidly spun up a ton of new monitors for our core service after a particularly interesting series of incidents (maybe I'll write about those in the near future). However, at the start we weren't necessarily thinking about them from the framing of the original post and instead we were mainly thinking about adding more observability to the systems and service as a whole to help cover some of the gaps identified in the previous incidents.

Fortunately, wiser minds on the team prevailed, and we started to become more critical about the monitors we were creating. We started to add more context to the monitors, outlining what it means for a particular monitor to be tripped, and what one should do to help resolve the issue if it is happening.

We even started to cull back some of the many monitors we created for the service, this may seem a bit counterintuitive but another risk of creating monitors is adding noise to the engineers on-call. We found that we'd be paged for issues that would auto-resolve in a short amount of time, or even those that we couldn't even do anything about during the time of the incident.

All of these patterns made it more difficult to support our services rather than making it easier.

As with everything, there's nuance. Service monitors offer a lot of benefits and help improve overall service and system observability. However they must be applied appropriately. Try to remember - what can this monitor tell me when I get paged at 2 or 3 in the morning?


Tags:

Related Posts

Web Development

Website Redesign v10

Published: ....

I recently launched a rewrite and redesign of this personal website, I figured I'd talk a bit about the changes and new features that I added along the way!

Server Side Rendering Compatible CSS Theming

Published: ....

A quick tip to implementing CSS theming that's compatible with Server Side Rendered applications!

Podcasting By Hand

Published: ....

A brief overview on how we launched The Bikeshed Podcast, including a deep dive in our recording and distribution workflows!

Quick Tip - Specific Local Module Declarations

Published: ....

A quick tip outlining how to provide specific TypeScript type definitions for a local module!

On File-System Routing Conventions

Published: ....

Some rough thoughts on building a file-system routing based web application

You're Building Software Wrong

Published: ....

Slicing software: why vertical is better than horizontal.

Single File Web Apps

Published: ....

What if you could author an entire web application in a single file?

Resetting Controlled Components in Forms

Published: ....

A quick way to handle resetting internal state in components when a parent form is submitted!

A Quick Look at Import Maps

Published: ....

A brief look at Import Maps and package.json#imports to support isomorphic JavaScript applications!

Recommended Tech Talks

Published: ....

A collection of tech talks that I regularly re-watch and also recommend to everyone!

Request for a (minimal) RSC Framework

Published: ....

Some features and functionality that I'd like within a React Server Component compatible framework.

Bluesky Tips and Tools

Published: ....

A (running) collection of Bluesky tips, tools, packages, and other misc things!

The Bookkeeping Pattern

Published: ....

A quick look at a small but powerful pattern I've been leveraging as of late!

TSLite

Published: ....

A proposal for a minimal variant of TypeScript!

Monorepo Tips and Tricks

Published: ....

Sharing a few core recommendations when working within monorepos to make your life easier!

Next.js with Deno v2

Published: ....

This is a quick post noting that Next.js should now work with Deno v2!

Don't Break the Implicit Prop Contract

Published: ....

React components have a fundamental contract that is often unstated in their implementation, and you should know about it!

A Better useSSR Implementation

Published: ....

Replace that old useState and useEffect combo for a new and better option!

My Current Dev Setup

Published: ....

A quick look at the applications and tools that I (generally) use day to day for web development!

There Is No Standard Markdown

Published: ....

There are a variety of different markdown "standards" out there, and sometimes they're not all that consistent

Abstract Your API

Published: ....

Proposing a solution for sharing core "business" logic across services!

Tip: Request and Response Headers

Published: ....

There's a common gotcha when creating Web Request and Response instances with Headers!

Using Feature Toggles to De-risk Refactors

Published: ....

Feature toggles are often underused by most software development teams, and yet offer so much value during not only feature development but also refactors

Hohoro

Published: ....

A quick introduction to my new side project, hohoro. An incremental JS/TS library build tool!

Custom Favicon Recipes

Published: ....

Two neat tricks for enhancing your site's favicon!

Corporate Sponsored OSS

Published: ....

The various risks and pitfalls of open source software run by corporations.

The Library-Docs Monorepo Template

Published: ....

A monorepo template for managing a library and documentation together.

Building Better Beacon

Published: ....

How we solved an almost show-stopping production bug, and how you can avoid it in your own projects.

Project Deep Dive: Tails

Published: ....

A(nother) deep dive into one of my recent side projects; tails - a plain and simple cocktail recipe app.

Churn Anxiety

Published: ....

When did semver major changes become so scary?

On Adopting CSS-in-JS

Published: ....

A brief recap of how Wayfair changed it's CSS approach not once but twice in the span of 5 years!

Project Deep Dive: Microfibre

Published: ....

A deep dive into one of my recent side projects; microfibre - a minimal text posting application

Pair Programming

Published: ....

Pair programming can be good sometimes - but not all the time

Suspense Plus GraphQL

Published: ....

A few thoughts on using Suspense with GraphQL to optimize application data loading

You've Launched, Now What?

Published: ....

A few thoughts on what to do after you launch a new project

Taking a Break

Published: ....

A few quick thoughts on burn out and taking a break

Managing Complex UI Component State

Published: ....

A few thoughts on managing complex UI component state within React

Understanding React 16.3 Updates

Published: ....

A quick overview of the new lifecycle methods introduced in React 16.3

CSS in JS

Published: ....

A few thoughts and patterns for using styled-jsx or other CSS-in-JS solutions

Redesign v6

Published: ....

A few thoughts on the redesign of my personal site, adopting Next.js and deploying via Now

JavaScript Weirdness

Published: ....

A few weird things about JavaScript

Calendar

Published: ....

Building a calendar web application

micropost