Production is down! Quick, where is the problem? The availability metric is relentlessly ticking away. Oops, there goes another nine.
Source: https://en.wikipedia.org/wiki/High_availability#Percentage_calculation
A production system are typically a mix of internal and external services. We can treat these services as black boxes to figure out where the problem lies.
Production system as black boxes
We can observe many metrics going into and out of a black box.
Black box metrics
The most common ones are,
We can already tell a lot of what is going on with these basic metrics. If the count of requests coming in does not match the count of responses going out, the service is likely failing.
Service missing responses
If the count of requests going out does not match the responses coming in, then a downstream service is likely failing.
Service not receiving responses
Latency is difference from the request timestamp to the matching response timestamp. A high latency is usually an indication of a problem of either the service itself or one of its downstream dependencies.
Measuring service latency
With all these metrics in-place, finding the problem is simply tracing the incoming traffic service by service. One can see which service is missing responses or has high latency.
One service is unlike the other
Once we have determined which service is failing, we can look inside of it. Now the parts that make up the service can be treated as black boxes. We continue to trace the external service metrics to the internals.
Inside the service
Not all services will have internal metrics and whether one should add them is a business value judgement.
When we are tracing production issues, what we are really doing is just accounting. Where the latency being produced? Can we reconcile the request to responses?
Once we get a feel for what is “normal”, we can start creating alarms and detect failures before an actual outage.
Do you want to achieve operational excellence? You’re in luck, Battlefy is hiring.