Damage Control in Distributed Systems

It began with a squirrel, judging by the tail, an ill-conceived violation of the transformer’s inner sanctum greeted with righteous thunder and an otherwise minor blip across the United Illuminating Company’s grid. What took down the mighty Nasdaq exchange wasn’t the incident itself, but what came after. Not the blaze of glory or the grieving mother in the family nest on the Pequonnock River, but the surge of power flowing back down the lines.

Grid operators, systems thinkers, and grey-beard sysadmins will tell you: systems are hard. Defining, assigning, and synchronizing work is tricky enough on one computer, but even simple tools can yield surprising complexity when assembled into networks.

The failure states that arise in deeply interconnected systems are as difficult to anticipate as they are to resolve, and sometimes the best we can do is simply to stop them from spreading. Keep the rest of the system running? That’s a good day. No amount of thoughtful application design will resolve a problem begun somewhere else, but by embracing failure and owning its impact on the user, our services can at least strive not to make things worse.

What follows are several useful patterns for containing and responding to common faults. They’re presented in stripped-down form with neither the configurability nor runtime visibility they would need in production, but they’re still a useful place to start. Which is to say: study them, don’t use them. Caveat emptor.

The common currency

For the sake of example, let’s pretend that every action inside a big, imaginary network can be represented as a function–call it Action–consisting of an immediate Request and some eventual Response.

type Action = (r: Request) => Promise<Response>

There’s nothing stopping us from representing future actions as continuations, streams, or your favorite asynchronous programming model, but we’ll stick to just one of them here.

Simple enough? Good. Let’s go make things fail.

Time limits: ending the neverending

Here’s one to ponder: how long can a long-running action go on before the customer (even a very patient, very digital customer) loses all interest in the outcome?

Pull up a chair. With no upper bound, we could be here a while.

The way to guarantee service within a reasonable window is to set that bound. “Service” might mean an error with our end user–but a timely error will almost always be an improvement over waiting until the heat death of the universe. And the implementation works like you’d expect. We’ll start a race between a timer and some pending promise. If the timer comes back first, we’ll declare the promise timed out and unblock it.

const delay = (t: number): Promise<void> =>
  new Promise(resolve => setTimeout(resolve, t));

class TimeoutError extends Error {
  readonly code = "ETIMEOUT";
}

type TimeLimiterOpts = {
  timeout: number;
};

function timeLimiter(opts: TimeLimiterOpts) {
  return function<T>(pending: Promise<T>) {
    return Promise.race<T>([
      pending,
      delay(opts.timeout).then(() =>
        Promise.reject(new TimeoutError("Timed out"))
      )
    ]);
  };
}

There’s one big caveat, though: while for all intents and purposes we’re carrying on as if we’ve “canceled” the promise, we’ve (in JavaScript, at least) only canceled the Request. We’ll see our TimeoutError, but any resources or computations attached to the pending promise may still be running to completion, and associated references may not yet be released.

Cutting off long-running requests is a very sensible step. But what about heading off the sort of congestion that might slow them down in the first place?

Rate limits: jamming the jam

If we’ve benchmarked our system’s performance–and before worrying too much about loaded behavior behavior it’s worth benchmarking our system’s performance–we’ll already have a rough sense of where and how it’s likely to fail. With those numbers, we can try to keep traffic below some “known-good” threshold and preempt trouble before it starts. Here’s a pared-down implementation:

type Supplier<T> = () => Promise<T>;

type RateLimiterOpts = {
  limit: number;
  interval: number;
};

type RateLimiter<T> = (f: Supplier<T>) => Supplier<T>;

class RateLimitError extends Error {
  readonly code = "ERATELIMIT";
}

function rateLimiter<T>(opts: RateLimiterOpts): RateLimiter<T> {
  let count = 0;
  let periodStart = 0;

  return <T>(supplier: Supplier<T>) => {
    return () => {
      const now = Date.now();
      if (now - periodStart > opts.interval) {
        periodStart = now;
        count = 0;
      }

      if (count >= opts.limit) {
        const err = new RateLimitError("Exceeded rate limit");
        return Promise.reject(err);
      }

      count += 1;
      return supplier();
    };
  };
}

In order to control invocation, this pattern wraps a Supplier (in lieu of a single Promise). By blocking invocations beyond a user-specified limit, we now have a crude lever to control supplier throughput.

This is useful enough, though there’s plenty to be done here to improve ease of use. In a more generous implementation we might warn customers as we approach the limit (allowing them to throttle requests accordingly) or even queue rejected requests to be re-processed later.

Which raises an interesting question: when failure happens, what comes next?

Retry: unfailing the failed

Well, something broke.

Maybe it was a self-imposed limit from a timer or rate-limiter, or maybe it was a genuine, honest-to-goodness exception in some action. In either case, we’ll need to decide whether to dutifully relay the failure to our customer or first attempt recovery on our own. If the error looks recoverable–a network flaked out, or a service was temporarily unavailable–we’ll usually start with the latter.

The easiest step may be simply to retry it:

type RetryerOpts = {
  limit: number;
};

class RetryError extends Error {
  readonly code = "ERETRY";
}

retryer(opts: RetryerOpts) {
  return <T>(supplier: Supplier<T>): Promise<T> => {
    let attempt = 1;

    const onError = async () => {
      if (attempt >= opts.limit) {
        throw new RetryError("Exceeded retry limit");
      }

      await delay(1000);
      attempt += 1;
      return supplier().catch(onError);
    };

    return supplier().catch(onError);
  };
}

Once again we’ve spared some niceties in the name of a straightforward example, but that’s the heart of it. We watch for an error, and (provided we haven’t already exceeded a reasonable retry limit) we’ll go ahead and try it again. In a more robust implementation, we would likely want a way to filter “retryable” errors from the recoverable sort, as well as some way to review all of the errors related to both the initial request and our subsequent retry attempts. We may also want the ability to “cancel” retries in-flight, or even to check in on progress. But those, too, are projects for another day.

One that’s worth taking on, however, is the schedule on which our retries are sent.

Backoff: a breath for the breathless

A retry policy is both a sensible first step and a terrific way to load artificial traffic onto an already-beleaguered service. If a failure may be due to request throttling or an overwhelmed service, it’s a good idea (as well as good form) to take a breath before any subsequent retries.

Here’s our original retryer, this time with room for different backoff strategies.

export type BackoffStrategy = (attempt: number) => Promise<void>;

type RetryerOpts = {
  limit: number;
  backoff: BackoffStrategy;
};

function retryer(opts: RetryerOpts) {
  return <T>(supplier: Supplier<T>): Promise<T> => {
    let attempt = 1;

    const onError = async () => {
      if (attempt >= opts.limit) {
        throw new RetryError("Exceeded retry limit");
      }

      await opts.backoff(attempt);
      attempt += 1;
      return supplier().catch(onError);
    };

    return supplier().catch(onError);
  };
}

const backoff = {
  fixed(d: number) {
    return (attempt: number) => delay(d);
  },
  linear(d: number) {
    return (attempt: number) => delay(d * attempt);
  }
};

Backoff isn’t quite enough on its own, though. Consider the case where a surge in requests causes many clients to fail near-simultaneously. If the clients share a common retry policy, no amount of backing-off will save the coordinated surge that follows with each successive retry.

What will save them is to jitter the backoff calculation:

const jitter = (d: number) => (d / 2) + (Math.random() * d);

export const backoff = {
  linear(d: number) {
    return (attempt: number) =>
      delay(jitter(d * attempt));
  }
  // ...
};

We may still face down the original collision, but a bit of entropy sprinkled on our beautiful, deterministic backoff algorithm will at least prevent the (unintentionally) coordinated sort.

Circuit Breaker: breaking the broken

So far we’ve put bounds around the sorts of faults that are relatively easy to specify. Requests need to settle within a certain window? Cut them off. Services will fail under a certain load? Limit the volume of requests.

Not all production faults are so accommodating. We may not know how something will fail–but when it does, we do want to detect it and avoid making it worse.

Enter the circuit breaker.

type CircuitBreakerOpts = {
  threshold: number;
  waitDuration: number;
};

type CircuitBreaker<T> = (f: Supplier<T>) => Promise<T>;

class CircuitOpenError extends Error {
  readonly code = "ECIRCUITOPEN";
}

type State = "CLOSED" | "OPEN" | "HALF_OPEN";

const circuitBreaker = (opts: CircuitBreakerOpts) => <T>(
  f: Supplier<T>
): (() => Promise<T>) => {
  let count = 0;
  let rejected = 0;
  let state: State = "CLOSED";
  let lastChangeAt = 0;

  const setState = (next: State) => {
    count = 0;
    rejected = 0;
    state = next;
    lastChangeAt = Date.now();
  };

  setState("CLOSED");

  return async () => {
    if (state === "OPEN") {
      if (Date.now() - lastChangeAt > opts.waitDuration) {
        setState("HALF_OPEN");
      } else {
        throw new CircuitOpenError("Circuit breaker is open");
      }
    }

    count += 1;

    try {
      const result = await f();
      if (state === "HALF_OPEN") {
        setState("CLOSED");
      }
      return result;
    } catch (err) {
      if (state === "HALF_OPEN") {
        setState("OPEN");
      }

      rejected += 1;
      if (state !== "OPEN") {
        const failureRate = rejected / count;
        if (failureRate > opts.threshold) {
          setState("OPEN");
        }
      }

      throw err;
    }
  };
};

Real world circuit breakers save all sorts of incendiary unpleasantness by cutting off the power to a circuit that’s exceeded its design load. Our version will open its “circuit” when a certain signal (in this case, the error rate) exceeds a user-defined threshold. After waiting in the OPEN state, the breaker will relax to a HALF_OPEN position, at which point the success or failure of the next request will either restore normal operations or trip it back OPEN.

Once again, a production version of the circuit breaker would also some ability to filter “safe” errors, as well as exposing some insight into the present state of the state machine embedded inside our breaker. We’d likely surface an EventEmitter or push state changes, failure rates, or both to a metrics collector–just as we’d ship attempts, throughput, and latency from the rest of our fault-tolerance toolkit.

Putting it into practice

There are a couple of points worth considering before putting these patterns into practice. First, will the complexity added by any of these patterns solve a pressing problem within your system? If the answer is anything less than certain, leave it out.

Second, consider whether a proven library like Hystrix, resilience4j, (or the port into your favorite language) will provide the features you need. It probably will. Leaning on it will save the trouble of verifying, benchmarking, and ironing out the kinks in your own, homegrown safety equipment.

If you’re tired of chasing mysteries through a big, teetering network, though, patterns like these can help make its failures more obvious. Maybe you’ll combine them–by wrapping a circuit breaker around a rate-limited component, say–to layer on the sanity-checks and fallback behaviors. Maybe they’ll live in sidecar containers, in API middleware, or in proxies fronting for serverless functions.

But whatever and however you use them, the ability to recognize and handle errors is a critical part of coming to terms with the complexity of your distributed system. Assume failure. Own the bad news. And above all, don’t make it worse.

Hey, it's RJ—thanks for reading! If you enjoyed this post, would you be willing to share it on Twitter, Facebook, or LinkedIn?