BattlefyBlogHistoryOpen menu
Close menuHistory

Operational excellence easy as counting 1-2-3

Ronald ChenFebruary 7th 2022

Production is down! Quick, where is the problem? The availability metric is relentlessly ticking away. Oops, there goes another nine.


A production system are typically a mix of internal and external services. We can treat these services as black boxes to figure out where the problem lies.

Production system as black boxes

Universal service black box

We can observe many metrics going into and out of a black box.

Black box metrics

The most common ones are,

  1. Count of requests in
  2. Count of responses out
  3. Count of requests out
  4. Count of responses in
  5. Latency (aka service time)

We can already tell a lot of what is going on with these basic metrics. If the count of requests coming in does not match the count of responses going out, the service is likely failing.

Service missing responses

If the count of requests going out does not match the responses coming in, then a downstream service is likely failing.

Service not receiving responses

Latency is difference from the request timestamp to the matching response timestamp. A high latency is usually an indication of a problem of either the service itself or one of its downstream dependencies.

Measuring service latency

Tracing the problem

With all these metrics in-place, finding the problem is simply tracing the incoming traffic service by service. One can see which service is missing responses or has high latency.

One service is unlike the other

We need to go deeper

Once we have determined which service is failing, we can look inside of it. Now the parts that make up the service can be treated as black boxes. We continue to trace the external service metrics to the internals.

Inside the service

Not all services will have internal metrics and whether one should add them is a business value judgement.

Yer an accountant, Harry

When we are tracing production issues, what we are really doing is just accounting. Where the latency being produced? Can we reconcile the request to responses?

Once we get a feel for what is “normal”, we can start creating alarms and detect failures before an actual outage.

Do you want to achieve operational excellence? You’re in luck, Battlefy is hiring.


Powered by