Back to Zopa Blog

Health-monitoring in production (beyond logs)

Here at Zopa, we are about halfway through our journey of breaking up our venerable monolith into various microservices. It shouldn’t come as a surprise that whilst a microservice architecture can free up teams to reason about and release their systems much more quickly, this flexibility does not come without a considerable support cost. We use a number of support tools for this, primarily Splunk for log aggregation, but sometimes a more active approach can be useful. In this brief post, I will introduce Zopa.ServiceDiagnostics; a tiny open-source library that we released, which rationalises one aspect of system monitoring.

Finding the needle in the haystack

In our new microservice-based utopia, instead of one system to monitor, we now have a whole zoo of them. They could be written on top of various stacks (as of last count, we use .Net, Java, Python, Ruby and Go). If something, somewhere starts misbehaving, how do you figure out which system it is? If that service has multiple external dependencies (be it other services, databases, messaging fabrics), how can you be sure it isn’t one of those that is misbehaving? Since Zopa is moving increasingly to a Kafka-based asynchronous communications pattern, the failure of one service will not necessarily cause an easy-to-spot failure cascade. If we do notice an error, the amount of logging data we generate can make determining a root cause time-consuming.

A rational monitoring solution

Whilst being harnessed in varying degrees, HTTP bindings exist in each microservice in the Zopa estate, so we get health and diaganostic information via two well-defined HTTP endpoints, namely:

/healthz – A super-lightweight call that tests if a node is internally running (e.g. it can service http requests and its messaging busses are active). Yields 200 if the service is internally happy, and 500 otherwise.

/diagnoticz – An in-depth report of its relationship with the services it depends upon. Yields 200 or 500 deopending on the health of its dependencies, along with a breakdown of each dependency and their percieved health.

Why the z, I hear you ask? No, it’s not because we are l33t h4x0rz; we use Kubernetes to host our new applications, and /healthz happens to be a standard health endpoint. The z essentially ensures that it won’t be inadvertently used by our applications for some other purpose.

With these endpoints, we can configure Splunk to proactively alert us to problems based on the outcomes of these calls.

Diagnostic information

The /diagnosticz endpoint returns a simple object comprising of a correlationId and an array of healthcheck results. We make heavy use of correlation ids at Zopa to tie logs generated by multiple systems together in Splunk. We try to ensure every interation logs a supplied correlation id, allowing us to view exactly what happened during an execution of a /diagnosticz.

{   "correlationId":"c7eba331-caab-47d7-9a66-3f8cd7f6e436",   "results":[     {       "exceptionMessage": null,       "executionTime": "00:00:00.0302823",       "additionalMessage": "connected to replica",       "name": "some healthcheck aginst some messaging fabric",       "passed": true     },     {       "exceptionMessage": "Add your exeption and stacktrace here...",       "executionTime": "00:00:01",       "additionalMessage": "connected to replica",       "name": "some other healthcheck against some unhappy service",       "passed": false     },   ] }

A .Net library

To simplify this system for .Net developers, we created Zopa.ServiceDiagnostics; It was designed to run various pre-defined health checks in parallel, and report the results in the format above. Using it is as simple as implementing one of two interfaces (depending on how much granularity your healthcheck needs) per check, and registering it with a runner (we use IoC for this purpose)

The IAmAHealthcheck interface is as simple as

    public interface IAmAHealthCheck     {         string Name { get; }          /// <summary>         /// Runs the actual health check         /// </summary>         /// <param name="correlationId">Each unique run of all checks will have an associated correlation id to help tie your logs together (assuming you log such things)</param>         /// <returns>Anyhting if the check was successful. A health check's run is marked as unsuccessful if this method throws an exception</returns>         Task ExecuteAsync(Guid correlationId);     }

A successful execution is one that doesn’t throw an exception. After these have been implented, simply hook them up into your IoC container of choice (or don’t, F# fans) and fire off a runner. Below is the non-IoC’d variant of the concept.

IEnumerable<IAmAHealthCheck> healthchecks = SomehowInstantiateYourStandardHealthchecks(); IEnumerable<IAmADescriptiveHealthCheck> descriptiveHealthchecks = SomehowInstantiateYourDescriptiveHealthchecks();  var runner = new HealthCheckRunner(healthchecks, descriptiveHealthchecks)  var result = await runner.DoAsync();

The runner makes use of the TPL to run as many healthchecks as the system will allow concurrently, and has no external dependencies. To this end, it relies on the developers serializing the output to JSON, which is a trivial task. Its also available on nuget

A word of warning

Microservice-based solutions come in all shapes and sizes, and initial stabs at breaking up monoliths may lead to micromonoliths – highly coupled services relying on synchronous interconnections which succeed or fail together. In such environments, a /diagnosticz endpoint may not be useful, since one failing service may take down the others.

e.g., if A depends on B, B depends on C …. X depends on Y, a failure of Y will take down every other service in the environment. If you must write heathchecks against peer services, ensure they are very light-touch, e.g. consider making a healthcheck that simply pings service B’s /healthz endpoint to check network connectivity, without worrying whether service B is healthy. This will protect an explosion of failures, and make service B’s diagnostic failures much more visible.

In summary

A robust microservice architecture mandates a massive amount of supporting infrastructure, monitoring or otherwise, but I hope that this article gives you some insight into how we do it here, and sparks some conversation around proactive monitoring and alerting.

(image credit to Pete Linforth)