Service Monitors and Observability
Published: ....
Last modified: ....
Share this post on BlueskySee discussion on Bluesky
Thanks to Scott Kaye for reviewing an early draft of this post
Awhile back I tweeted made this post on X about being intentional when
creating service monitors:
This became my go-to question when we were defining additional monitors for our services over the past month or so:
— Matt Hamlin (@immatthamlin) July 29, 2023
“If you get paged by this at 2am, do you know what it means/what to do?”
It’s a really solid litmus test to ensure monitors are clear, concise, and actionable https://t.co/YpFhbCW34G
I figured I would expand a bit on it within a blog post since I have thought a decent amount about it since I posted it.
For those that don't know, service monitors are automatic "tests" that can be used to determine the health of a system in production. They can be configured for just about anything, latency and up-time are generally the most popular monitors.
Traditionally, monitors are configured to automatically raise an incident, which usually pages someone that is currently on-call for the service.
However, as uncle Ben says - with great power comes great responsibility.
It can be nerveracking to be on-call for large and complex services, especially if you're new to the team that owns such a service. Now compound that with being paged due to a monitor that is missing context about what's actually wrong with the service and what to do to resolve it.
This is what I was speaking to in the above post on Twitter/X, at work we rapidly spun up a ton of new monitors for our core service after a particularly interesting series of incidents (maybe I'll write about those in the near future). However, at the start we weren't necessarily thinking about them from the framing of the original post and instead we were mainly thinking about adding more observability to the systems and service as a whole to help cover some of the gaps identified in the previous incidents.
Fortunately, wiser minds on the team prevailed, and we started to become more critical about the monitors we were creating. We started to add more context to the monitors, outlining what it means for a particular monitor to be tripped, and what one should do to help resolve the issue if it is happening.
We even started to cull back some of the many monitors we created for the service, this may seem a bit counterintuitive but another risk of creating monitors is adding noise to the engineers on-call. We found that we'd be paged for issues that would auto-resolve in a short amount of time, or even those that we couldn't even do anything about during the time of the incident.
All of these patterns made it more difficult to support our services rather than making it easier.
As with everything, there's nuance. Service monitors offer a lot of benefits and help improve overall service and system observability. However they must be applied appropriately. Try to remember - what can this monitor tell me when I get paged at 2 or 3 in the morning?
Tags:
Related Posts
Web Development
Published: ....
I recently launched a rewrite and redesign of this personal website, I figured I'd talk a bit about the changes and new features that I added along the way!
Published: ....
A quick tip to implementing CSS theming that's compatible with Server Side Rendered applications!
Published: ....
A brief overview on how we launched The Bikeshed Podcast, including a deep dive in our recording and distribution workflows!
Published: ....
A quick tip outlining how to provide specific TypeScript type definitions for a local module!
Published: ....
Some rough thoughts on building a file-system routing based web application
Published: ....
Slicing software: why vertical is better than horizontal.
Published: ....
What if you could author an entire web application in a single file?
Published: ....
A quick way to handle resetting internal state in components when a parent form is submitted!
Published: ....
A brief look at Import Maps and package.json#imports to support isomorphic JavaScript applications!
Published: ....
A collection of tech talks that I regularly re-watch and also recommend to everyone!
Published: ....
Some features and functionality that I'd like within a React Server Component compatible framework.
Published: ....
A (running) collection of Bluesky tips, tools, packages, and other misc things!
Published: ....
A quick look at a small but powerful pattern I've been leveraging as of late!
Published: ....
A proposal for a minimal variant of TypeScript!
Published: ....
Sharing a few core recommendations when working within monorepos to make your life easier!
Published: ....
This is a quick post noting that Next.js should now work with Deno v2!
Published: ....
React components have a fundamental contract that is often unstated in their implementation, and you should know about it!
Published: ....
Replace that old useState and useEffect combo for a new and better option!
Published: ....
A quick look at the applications and tools that I (generally) use day to day for web development!
Published: ....
There are a variety of different markdown "standards" out there, and sometimes they're not all that consistent
Published: ....
Proposing a solution for sharing core "business" logic across services!
Published: ....
There's a common gotcha when creating Web Request and Response instances with Headers!
Published: ....
Feature toggles are often underused by most software development teams, and yet offer so much value during not only feature development but also refactors
Published: ....
A quick introduction to my new side project, hohoro. An incremental JS/TS library build tool!
Published: ....
Two neat tricks for enhancing your site's favicon!
Published: ....
The various risks and pitfalls of open source software run by corporations.
Published: ....
A monorepo template for managing a library and documentation together.
Published: ....
How we solved an almost show-stopping production bug, and how you can avoid it in your own projects.
Published: ....
A(nother) deep dive into one of my recent side projects; tails - a plain and simple cocktail recipe app.
Published: ....
When did semver major changes become so scary?
Published: ....
A brief recap of how Wayfair changed it's CSS approach not once but twice in the span of 5 years!
Published: ....
A deep dive into one of my recent side projects; microfibre - a minimal text posting application
Published: ....
Pair programming can be good sometimes - but not all the time
Published: ....
A few thoughts on using Suspense with GraphQL to optimize application data loading
Published: ....
A few thoughts on what to do after you launch a new project
Published: ....
A few quick thoughts on burn out and taking a break
Published: ....
A few thoughts on managing complex UI component state within React
Published: ....
A quick overview of the new lifecycle methods introduced in React 16.3
Published: ....
A few thoughts and patterns for using styled-jsx or other CSS-in-JS solutions
Published: ....
A few thoughts on the redesign of my personal site, adopting Next.js and deploying via Now
Published: ....
A few weird things about JavaScript
Published: ....
Building a calendar web application
micropost
Published: ....
It's fine for a library to express some opinions about how it should be adopted and how the overall workflow/application in which it is adopted should function. However, it's false advertising to say that it is unopinionated.
Published: ....
No I don't mean those Milano cookies you keep taking from the office snack wall either (although you should probably stop snacking on those as often as well).
Published: ....
Low/no process workflow wasn't actually no process, it was only an "invisible" process. An implicit contract with everyone on the team to do that async workflow on their own time.